# Getting Started with pandas

Pandas will make data cleaning and analysis fast in python. It's designed for working with tabular or heterogeneous data. 

Conventions:
```python
import numpy as np
import pandas as pd
```

## Index

* [Introductino to Pandas Data Structures](#introduction-to-pandas-data-structures)
    * [Series](#series)
    * [DataFrames](#dataframe)
    * [Index Object](#index-object)
* [Essential Functionality](#essential-functionality)
    * [Reindexing](#reindexing)
    * [Dropping entries from an Axis](#dropping-entries-from-an-axis)
    * [Indexing, Selection and Filtering](#indexing-selection-and-filtering)
        * [Indexing options with DataFrame](#indexing-options-with-dataframe)
    * [Arithmetic and Data Alignment](#arithmetic-and-data-alignment)
        * [Operations between DataFrame and Series](#operations-between-dataframe-and-series)
    * [Function Application and Mapping](#function-aplication-and-mapping)
        * [Formating with functions](#formating-with-functions)
    * [Sorting and Ranking](#sorting-and-ranking)
    * [Axis indexes with duplicate labels](#axis-with-duplicate-labels)

## Introduction to pandas Data Structures

### Series

Series is a one-dimensional array-like object containing a sequence of values asociated to a data labels (index).

In [10]:
import pandas as pd

obj = pd.Series([6, 2, -3, 9])
print(obj)

obj2 = pd.Series([5, 3, -6, 8], index=range(10, 50, 10))
print(f"\n{obj}")
print(f"{obj2.index}")
print(f"Object n30: {obj2[30]}")
print(f"Greater than 0: \n{obj2 > 0}"
      # Aplying a filter
      f"\nAnd filtering rows: \n{obj2[obj2 > 0]}")
# Operations like NumPy
print(f"\nIs index 55 in obj2? \n{55 in obj2}")

0    6
1    2
2   -3
3    9
dtype: int64

0    6
1    2
2   -3
3    9
dtype: int64
RangeIndex(start=10, stop=50, step=10)
Object n30: -6
Greater than 0: 
10     True
20     True
30    False
40     True
dtype: bool
And filtering rows: 
10    5
20    3
40    8
dtype: int64

Is index 55 in obj2? 
False


In [15]:
import pandas as pd

dict = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
# Dictionary to pandas Series
obj3 = pd.Series(dict)

print(f"{obj3}")
# pandas Series to dictionary
print(f"\n{obj3.to_dict()}")

# Specifying the index (with miss data)
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(dict, index=states)
print(f"\n{obj4}") 
# Missing data as NaN (Not a Number)
# Utah has data but is excluded from 'obj4' because the index

print(f"\nLooking for missing data: \n{pd.isna(obj4)}")
print(f"\nLooking for NOT missing data: \n{pd.notna(obj4)}")
# Getting rows with missing data
print(f"\nFiltering missing data: \n{obj4[obj4.isna()]}")

# We can do arithmetic operations with two Series, 
# and automaticaly it aligns the index
print(f"Obj3 + Obj4: \n{obj3 + obj4}")

# atribute naming for Series
obj4.name = "population"
obj4.index.name = "state"
print(f"Obj4 with name attribute: \n{obj4}")

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Looking for missing data: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Looking for NOT missing data: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Filtering missing data: 
California   NaN
dtype: float64
Obj3 + Obj4: 
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
Obj4 with name attribute: 
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64


### DataFrame

A DataFrame is a table of data wutg irdered and named collection of collumns that can be different value type and it has both a row and column index. You can create a DataFrame with a dictionary of equal length lists or NumPy arrays.

Asigning new values to a column in the DataFrame, the new data in the Series or Array must match the length of the dataframe. If we assign a Series, it's labels will be realigned to the DataFrame's index, inserting missing values in any index values not present.

*Possible data inputs to the DataFrame constructos*
|Type|Notes|
|---|---|
|2D ndarray| A matrix of data, passing optional row and column labels|
|Dictionary of arrays, lists, or tuples |Each sequence becomes a column in the DataFrame; all sequences must be the same length|
|NumPy structured/record array| Treated as the “dictionary of arrays” case|
|Dictionary of Series| Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed|
|Dictionary of dictionaries| Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case|
|List of dictionaries or Series |Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels|
|List of lists or tuples| Treated as the “2D ndarray” case|
|Another DataFrame| The DataFrame’s indexes are used unless different ones are passed|
|NumPy MaskedArray| Like the “2D ndarray” case except masked values are missing in the DataFrame result|

In [21]:
import pandas as pd
from pandas import DataFrame

data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada", "Oregon",
              "Oregon", "Texas", "Texas",],
    "year": [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003, 2004, 2005],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 2.4, 3.1, 3.3, 2.9]
    }

# We can define columns order making the DataFrame
df = DataFrame(data, columns=["year", "state", "pop"])
print(f"{df}")

# For large DataFrame, we can see the first 5 rows with 'head()'
print(f"\n{df.head()}")
# And we can see the las 5 rows with 'tail()'
print(f"\n{df.tail()}")

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2
6  2002  Oregon  2.4
7  2003  Oregon  3.1
8  2004   Texas  3.3
9  2005   Texas  2.9

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

   year   state  pop
5  2003  Nevada  3.2
6  2002  Oregon  2.4
7  2003  Oregon  3.1
8  2004   Texas  3.3
9  2005   Texas  2.9


In [3]:
import pandas as pd

data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada", "Oregon",
              "Oregon", "Texas", "Texas",],
    "year": [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003, 2004, 2005],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 2.4, 3.1, 3.3, 2.9]
    }

# Adding an extra column for missing data
df2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
print(f"Columns df2: {df2.columns}")
print(f"States: \n{df2['state']}"
      f"\nPrinting years: \n{df2.year}")
print(f"\nRetrieving by position: \n{df2.iloc[2]}")

Columns df2: Index(['year', 'state', 'pop', 'debt'], dtype='object')
States: 
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
6    Oregon
7    Oregon
8     Texas
9     Texas
Name: state, dtype: object
Printing years: 
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
6    2002
7    2003
8    2004
9    2005
Name: year, dtype: int64

Retrieving by position: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object


In [16]:
import pandas as pd
import numpy as np

data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada", "Oregon",
              "Oregon", "Texas", "Texas",],
    "year": [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003, 2004, 2005],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 2.4, 3.1, 3.3, 2.9]
    }
df2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

# Modifying 'debt' column with array 
df2["debt"] = np.arange(1.,11.0)

# Creating a 'eastern' column for 'Ohio', as boolean
df2["eastern"] = df2["state"] == "Ohio"
print(f"{df2}")

# Removing "eastern" column
del df2["eastern"]
print(f"\nDeleting 'eastern' column: {df2.columns}")

"""
    The column returned from indexing a DataFrame is a 'view', not a copy, 
    thus in-place modifications to Series will be reflected in the DataFrame.
    The column can be copied with 'copy' method from Series.
"""

   year   state  pop  debt  eastern
0  2000    Ohio  1.5   1.0     True
1  2001    Ohio  1.7   2.0     True
2  2002    Ohio  3.6   3.0     True
3  2001  Nevada  2.4   4.0    False
4  2002  Nevada  2.9   5.0    False
5  2003  Nevada  3.2   6.0    False
6  2002  Oregon  2.4   7.0    False
7  2003  Oregon  3.1   8.0    False
8  2004   Texas  3.3   9.0    False
9  2005   Texas  2.9  10.0    False

Deleting 'eastern' column: Index(['year', 'state', 'pop', 'debt'], dtype='object')


In [7]:
import pandas as pd

populations = {
    "Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
    "Nevada": {2001: 2.4, 2002: 2.9}
}
# In a DataFrame, the outer dictionary keys will be the columns
frame3 = pd.DataFrame(populations)

print(f"{frame3.T}")
# Trasposing discards the column data types if the columns don't have
# all the same data type. Transposing back it becomes pure python objects

# Specifying the index
f3ind = pd.DataFrame(populations, index=[2001, 2002, 2003])
print(f"\n{f3ind}")

serie_data = {"Ohio": frame3["Ohio"][:-1],
              "Nevada": frame3["Nevada"][:2],
              }
sdata = pd.DataFrame(serie_data)
print(f"\n{sdata}")

        2000  2001  2002
Ohio     1.5   1.7   3.6
Nevada   NaN   2.4   2.9

      Ohio  Nevada
2001   1.7     2.4
2002   3.6     2.9
2003   NaN     NaN

      Ohio  Nevada
2000   1.5     NaN
2001   1.7     2.4


In [10]:
import pandas as pd

populations = {
    "Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
    "Nevada": {2001: 2.4, 2002: 2.9}
}

frame3 = pd.DataFrame(populations)

frame3.index.name = "year"
frame3.columns.name = "state"
print(f"{frame3}")

# DataFrame doesn't have a name attribute. 'to_numpy()' returns the data
print(f"To numpy: \n{frame3.to_numpy()}")

data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada", "Oregon",
              "Oregon", "Texas", "Texas",],
    "year": [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003, 2004, 2005],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 2.4, 3.1, 3.3, 2.9]
    }

frame2 = pd.DataFrame(data)
# Data type will be acomodated to all of the columns if there are different dtp
print(f"\n{data}")
print(f"to numpy: \n{frame2.to_numpy()}")

state  Ohio  Nevada
year               
2000    1.5     NaN
2001    1.7     2.4
2002    3.6     2.9
To numpy: 
[[1.5 nan]
 [1.7 2.4]
 [3.6 2.9]]

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada', 'Oregon', 'Oregon', 'Texas', 'Texas'], 'year': [2000, 2001, 2002, 2001, 2002, 2003, 2002, 2003, 2004, 2005], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2, 2.4, 3.1, 3.3, 2.9]}
to numpy: 
[['Ohio' 2000 1.5]
 ['Ohio' 2001 1.7]
 ['Ohio' 2002 3.6]
 ['Nevada' 2001 2.4]
 ['Nevada' 2002 2.9]
 ['Nevada' 2003 3.2]
 ['Oregon' 2002 2.4]
 ['Oregon' 2003 3.1]
 ['Texas' 2004 3.3]
 ['Texas' 2005 2.9]]


### Index Object

Index objects are responsible for holding the axis labels, including DataFrame's column names. Any sequence of labels used constructing a Series or DF is internally converted to an Index.

A pandas Index can contain duplicate labels, but a Python sets cannot.

*Some Index methods and properties*
|Method/Property| Description|
|---|---|
|append() |Concatenate with additional Index objects, producing a new Index|
|difference() |Compute set difference as an Index|
|intersection() |Compute set intersection|
|union() |Compute set union|
|isin() |Compute Boolean array indicating whether each value is contained in the passed collection|
|delete() |Compute new Index with element at Index i deleted|
|drop() |Compute new Index by deleting passed values|
|insert() |Compute new Index by inserting element at Index i|
|is_monotonic |Returns True if each element is greater than or equal to the previous element|
|is_unique |Returns True if the Index has no duplicate values|
|unique() |Compute the array of unique values in the Index|

In [14]:
import numpy as np 
import pandas as pd 

obj = pd.Series(np.arange(3), index=["a","b","c"])
index = obj.index 
print(f"{index}")
# Index objects are immutable 

labels = pd.Index(np.arange(3))

obj2 = pd.Series([1.5, -3.4, 0], index=labels)
print(f"{obj2}")

Index(['a', 'b', 'c'], dtype='object')
0    1.5
1   -3.4
2    0.0
dtype: float64


## Essential Functionality

### Reindexing 

`reindex()` create a new object with the values rearranged to align with the new index and itroduces new values if it's necesary.

For ordered data, with `method=ffill` we can interpolate or fill the values when reindexing.

Working with DataFrames, you can reindex the index (row), columns, or both.

*reindex function arguments*
|Argument|Description|
|---|---|
|labels| New sequence to use as an index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying.|
|index |Use the passed sequence as the new index labels.|
|columns |Use the passed sequence as the new column labels.|
|axis |The axis to reindex, whether "index" (rows) or "columns". The default is "index". You can alternately do reindex(index=new_labels) or reindex(columns=new_labels).|
|method |Interpolation (fill) method; "ffill" fills forward, while "bfill" fills backward.|
|fill_value| Substitute value to use when introducing missing data by reindexing. Use fill_value="missing" (the default behavior) when you want absent labels to have null values in the result.|
|limit |When forward filling or backfilling, the maximum size gap (in number of elements) to fill.|
|tolerance| When forward filling or backfilling, the maximum size gap (in absolute numeric distance) to fill for inexact matches.|
|level |Match simple Index on level of MultiIndex; otherwise select subset of.|
|copy |If True, always copy underlying data even if the new index is equivalent to the old index; if False, do not copy the data when the indexes are equivalent.|

Reindex with `loc` operator only works if all of the new index labels already exist in the DataFrame (whereas `reindex` will insert missing data for new labels)

In [16]:
obj = pd.Series([4.5, 7.2, -5.2, 3.7])
print(f"{obj}")

obj2 = obj.reindex(["a", "b", "c", "d", "e"])
print(f"\n Reindex new object to: \n{obj2}")

# Reindex with ffill
obj3= pd.Series(["blue", "purple", "yellow"], index=[0,3,6])
print(f"\n\n{obj3}\nAnd reindex with method=ffill")
obj3.reindex(np.arange(9), method="ffill")

0    4.5
1    7.2
2   -5.2
3    3.7
dtype: float64

 Reindex new object to: 
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
dtype: float64

0      blue
3    purple
6    yellow
dtype: object
And reindex with method=ffill


0      blue
1      blue
2      blue
3    purple
4    purple
5    purple
6    yellow
7    yellow
8    yellow
dtype: object

In [21]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=["a", "d", "c"],
                     columns=["Ohio", "Texas", "California"])
print(f"{frame}")

frame2 = frame.reindex(index=["a", "b", "c", "d"])
print(f"\nReindex in new object:\n{frame2}")

# Reindexing columns
states = ["Texas", "Utah", "California"]
print(f"\n{frame.reindex(columns=states)}")
#frame.reindex(states, axis="columns")

print(f"\nReindex with loc:"
      f"\n{frame.loc[['a','d','c'], ['California','Texas']]}")

   Ohio  Texas  California
a     0      1           2
d     3      4           5
c     6      7           8

Reindex in new object:
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   6.0    7.0         8.0
d   3.0    4.0         5.0

   Texas  Utah  California
a      1   NaN           2
d      4   NaN           5
c      7   NaN           8

Reindex with loc:
   California  Texas
a           2      1
d           5      4
c           8      7


### Dropping Entries from an Axis

`drop()` method will retirn a new object with the indicated values deleted from an axis.

In a DataFrame you can drop using the axis name or index.

In [3]:
import pandas as pd 
import numpy as np 

obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
print(f"Original object: \n{obj}")

dp_obj = obj.drop (["d", "c"])
print(f"new object with drop: \n{dp_obj}")

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=["a", "b", "c", "d"],
                     columns=["Ohio", "Texas", "California", "Utah"])

print(f"\nOriginal DataFrame: \n{frame}")

print(f"\nDataFrame dropping two index: \n{frame.drop(index=['b', 'd'])}")

# drop column with 'axis=1' or 'axis="columns"' 
print(f"\nDataFrame dropping 'California' column:"
      f"\n{frame.drop(columns=['California'])}")

Original object: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
new object with drop: 
a    0.0
b    1.0
e    4.0
dtype: float64

Original DataFrame: 
   Ohio  Texas  California  Utah
a     0      1           2     3
b     4      5           6     7
c     8      9          10    11
d    12     13          14    15

DataFrame dropping two index: 
   Ohio  Texas  California  Utah
a     0      1           2     3
c     8      9          10    11

DataFrame dropping 'California' column:
   Ohio  Texas  Utah
a     0      1     3
b     4      5     7
c     8      9    11
d    12     13    15


###  Indexing, Selection, and Filtering

The preferred way to select index values is with the special `loc` operator; `loc` works with labels and `iloc` works with integers. That's because indexing with '[ ]' will treat integers as labels if the index conains integers, so de behaviour differs depending on the data type of the index.

In [10]:
import numpy as np 
import pandas as pd 

obj1 = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj2 = pd.Series(np.arange(4.), index=[10, 20, 30, 40])

print(f"Object1 loc[b:d]: \n{obj1.loc['b':'d']}")
print(f"\nObject2 iloc[[0,1,3]]: \n{obj2.iloc[[0, 1, 3]]}")
print(f"\nObject1 iloc[[0,1,3]]: \n{obj1.iloc[[0, 1, 3]]}")

Object1 loc[b:d]: 
b    1.0
c    2.0
d    3.0
dtype: float64

Object2 iloc[[0,1,3]]: 
10    0.0
20    1.0
40    3.0
dtype: float64

Object1 iloc[[0,1,3]]: 
a    0.0
b    1.0
d    3.0
dtype: float64


In [13]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=["Ohio", "Texas", "California", "Utah"],
                     columns=["a", "b", "c", "d"])

print(f"DataFrame: \n{frame}")

# indexing into DataFrame
print(f"\nColumn 'b': \n{frame['b']}")
print(f"\nColumns 'b','d': \n{frame[['b','d']]}")
print(f"\nRows with column 'c' > 5: \n{frame[frame['c'] > 5]}")

DataFrame: 
             a   b   c   d
Ohio         0   1   2   3
Texas        4   5   6   7
California   8   9  10  11
Utah        12  13  14  15

Column 'b': 
Ohio           1
Texas          5
California     9
Utah          13
Name: b, dtype: int32

Columns 'b','d': 
             b   d
Ohio         1   3
Texas        5   7
California   9  11
Utah        13  15

Rows with column 'c' > 5: 
             a   b   c   d
Texas        4   5   6   7
California   8   9  10  11
Utah        12  13  14  15


In [19]:
### Selection on DataFrame with 'loc' and 'iloc'

import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=["Ohio", "Texas", "California", "Utah"],
                     columns=["a", "b", "c", "d"])

print(f"California Data: \n{frame.loc['California']}")
print(f"\nIndexes 0 and 2: \n{frame.iloc[['0','2']]}")
print(f"\nTexas and c, d columns: \n{frame.loc['Texas', ['c','d']]}")
print(f"\niloc[:,:3][frame.c > 5]: \n{frame.iloc[:,:3][frame.c > 5]}")

California Data: 
a     8
b     9
c    10
d    11
Name: California, dtype: int32

Indexes 0 and 2: 
            a  b   c   d
Ohio        0  1   2   3
California  8  9  10  11

Texas and c, d columns: 
c    6
d    7
Name: Texas, dtype: int32

iloc[:,:3][frame.c > 5]: 
             a   b   c
Texas        4   5   6
California   8   9  10
Utah        12  13  14


#### *Indexing options with DataFrame*

|Type|Notes|
|---|---|
|df[column] |Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion) |
|df.loc[rows] |Select single row or subset of rows from the DataFrame by label|
|df.loc[:, cols] |Select single column or subset of columns by label|
|df.loc[rows, cols] |Select both row(s) and column(s) by label|
|df.iloc[rows] |Select single row or subset of rows from the DataFrame by integer position|
|df.iloc[:, cols] |Select single column or subset of columns by integer position|
|df.iloc[rows, cols] |Select both row(s) and column(s) by integer position|
|df.at[row, col] |Select a single scalar value by row and column label|
|df.iat[row, col] |Select a single scalar value by row and column position (integers)|
|reindex method |Select either rows or columns by labels|

Working with integers in pandas work differently from built-in Python data structures. With a Series `pd.Series(np.arange(3.))` you can't use `ser[-1]` but `ser.iloc[-1]` will work. But if the index is noninteger such as 'a, b, c', `ser[-2]` will work too.



In [23]:
### Modifying data from DataFrame

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=["Ohio", "Texas", "California", "Utah"],
                     columns=["a", "b", "c", "d"])

print(f"Original Data: \n{frame}")

frame_mod1 = frame.loc[:,"a"] = 1
print(f"\nColumn 'a' = 1: \n{frame} ")

frame_mod2 = frame.iloc[2] = 5
print(f"\nRow 2 = 5: \n{frame} ")

# watch frame_mod* are not copies, are views

Original Data: 
             a   b   c   d
Ohio         0   1   2   3
Texas        4   5   6   7
California   8   9  10  11
Utah        12  13  14  15

Column 'a' = 1: 
            a   b   c   d
Ohio        1   1   2   3
Texas       1   5   6   7
California  1   9  10  11
Utah        1  13  14  15 

Row 2 = 5: 
            a   b   c   d
Ohio        1   1   2   3
Texas       1   5   6   7
California  5   5   5   5
Utah        1  13  14  15 


### Arithmetic and Data Alignment

When you adds objects summing two series, if any index pairs are not the same, the respective index in the result will be the union of the index pairs, producing missiong values in the label locations that don't overlap. Then, in DataFrames have to match to return the unions of the ones in each DataFrame, the rest data which does not matcc will be NaN.

*Flexible arithmetic methods*
|Method|Description|
|---|---|
|add, radd |Methods for addition (+)|
|sub, rsub |Methods for subtraction (-)|
|div, rdiv |Methods for division (/)|
|floordiv, rfloordiv |Methods for floor division (//)|
|mul, rmul |Methods for multiplication (*)|
|pow, rpow |Methods for exponentiation (**)|

In that table, each method has a counterpart starting with *r* meaning *reverse*. Then, `1 / df1` are equivalent to `df1.rdv(1)`

In [24]:
import numpy as np 
import pandas as pd
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df1 + df2
## Only Texas and Ohio with B 'b' and 'd' columns match

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [27]:
import numpy as np 
import pandas as pd
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                    columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                    columns=list("abcde"))
print(df1+df2)
# Filling df1 + df2 NaN values
df1.add(df2, fill_value=0)

      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


#### Operations between DataFrame and Series

With NumPy arrays, if you have an one dimensional array and a two dimensional array, you want to substract the data of one-dimensional from two-dimensional array. The operation is performed once for each row (this is referred to as broadcasting). Operations between DataFrame and Series are similar, matches the index of the Series on the columns of the DataFrame. Also, you can specified the index to perform the arithmetic method like `sub()`, the axis that you pass is the 'axis to match on'.

In [29]:
import numpy as np 
import pandas as pd 

frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
series1 = frame.iloc[0]
series2 = frame["d"]
print(f"DataFrame:\n{frame}")
print(f"Series1: \n{series1}")
print(f"\nDataFrame - Series1: \n{frame-series1}")

print(f"\nSubstracting from axis='index':\n{frame.sub(series2, axis='index')}")


DataFrame:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
Series1: 
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

DataFrame - Series1: 
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

Substracting from axis='index':
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0


### Function Aplication and Mapping

NumPy ufuncs also work with pandas.

You can apply a function on one-dimensional arrays to each column or row using `apply()` method from DataFrame.

In [30]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])

def f1(x):
    return x.max() - x.min()

print(f"DataFrame:\n{frame}")
print(f"\nDataFrame with np.abs():\n{np.abs(frame)}")
print(f"\nApplying function (max - min):\n{frame.apply(f1)}")
print(f"\nApplying function (max - min) on columns:"
      f"\n{frame.apply(f1, axis='columns')}")

DataFrame:
               b         d         e
Utah    0.449770  1.675518 -0.045813
Ohio    0.970555  0.446453 -1.597612
Texas  -0.338836 -0.164289 -0.338806
Oregon -0.165150 -0.674075  1.129830

DataFrame with np.abs():
               b         d         e
Utah    0.449770  1.675518  0.045813
Ohio    0.970555  0.446453  1.597612
Texas   0.338836  0.164289  0.338806
Oregon  0.165150  0.674075  1.129830

Applying function (max - min):
b    1.309390
d    2.349593
e    2.727442
dtype: float64

Applying function (max - min) on columns:
Utah      1.721332
Ohio      2.568167
Texas     0.174546
Oregon    1.803905
dtype: float64


#### Formating with functions

In [33]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])

# Extracting min and max values
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

## Formating the DataFrame
def format2(x):
    return f"{x:.2f}"

print(f"Calculating min and max from DF: \n{frame.apply(f2)}")
print(f"\nFormating two decimals on DF: \n{frame.map(format2)}")
print(f"\nFormating column 'e': \n{frame['e'].map(format2)}")

Calculating min and max from DF: 
            b         d         e
min -1.261568 -0.666499 -0.599034
max  1.761770  1.475769  0.132470

Formating two decimals on DF: 
            b      d      e
Utah    -1.26  -0.67   0.03
Ohio    -0.75   0.29   0.02
Texas    1.76   0.62   0.13
Oregon   0.17   1.48  -0.60

Formating column 'e': 
Utah       0.03
Ohio       0.02
Texas      0.13
Oregon    -0.60
Name: e, dtype: object


### Sorting and Ranking

Sorting a dataset by labels you can use `sort_index()` method, and sorting by values with `sort_values()` method. Sorting by default is ascending, but with `ascending="False"` argument you can change it. Sorting values provides you the posibility to choose where NaN values goes with `na_position="first"` argument. Working with DataFrames, you can pass column name in `sort_values("column")` argument or you can pass multiple column names with a list.

Ranking assing a rank to each value, ascending by default. The `rank()` method print the Series or DataFrame and replace the values with their corresponding ranking. For DataFrames you can choose columns or index.

*Tie-breaking methods with rank*
|Method|Description|
|---|---|
|"average" |Default: assign the average rank to each entry in the equal group|
|"min" |Use the minimum rank for the whole group|
|"max" |Use the maximum rank for the whole group|
|"first" |Assign ranks in the order the values appear in the data|
|"dense" |Like method="min", but ranks always increase by 1 between groups rather than the number of equal elements in a group|

In [40]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])

obj = pd.Series(np.arange(5), index=["d", "a", "b", "c", "e"])

print(f"Sorting obj index:\n{obj.sort_index()}")
print(f"\nSorting frame index: \n{frame.sort_index(axis='columns')}")
print(f"\nSorting frame values c: \n{frame.sort_values('c', ascending=False)}")


Sorting obj index:
a    1
b    2
c    3
d    0
e    4
dtype: int32

Sorting frame index: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

Sorting frame values c: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3


In [45]:
import pandas as pd 
import numpy as np 

frame = pd.DataFrame(np.random.standard_normal((4, 4)),
                     index=["three", "one", "four", "two"],
                     columns=["d", "a", "b", "c"])

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

objr = obj.rank()
objframe = pd.DataFrame({"Original": obj, "Ranking": objr})

print(f"Comparison between original Object and Ranking:\n{objframe}")

print(f"\nOriginal DataFrame: \n{frame}")
print(f"\nDataFrame Ranking columns: \n{frame.rank(axis='columns')}")

Comparison between original Object and Ranking:
   Original  Ranking
0         7      6.5
1        -5      1.0
2         7      6.5
3         4      4.5
4         2      3.0
5         0      2.0
6         4      4.5

Original DataFrame: 
              d         a         b         c
three  0.127778  1.205931 -0.148343 -2.544499
one    0.207475 -1.224809 -0.987458  2.303895
four  -0.156729 -1.092226  0.618376 -0.479203
two   -1.468082  0.127942 -1.560428 -0.819110

DataFrame Ranking columns: 
         d    a    b    c
three  3.0  4.0  2.0  1.0
one    3.0  1.0  2.0  4.0
four   3.0  1.0  4.0  2.0
two    2.0  4.0  1.0  3.0


### Axis with Duplicate Labels
