# *pandas* Data Structures

*pandas* contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. It is often used in tandem with numerical computing tools like *NumPy* and *SciPy*, analytical libraries - *statsmodels* and *scikit-learn* - as well as data visualisation libraries like *matplotlib*. Compared to *NumPy* which is more suited for numerical array data, *pandas* is designed for working with tabular or heterogeneous data.

To better appreciate the use of *pandas*, I will introduce its two workhorse data structures - *Series* and *DataFrame*.

Firstly, we will need to import `pandas` library before proceeding:

In [1]:
import pandas as pd   # we first need to import pandas to use its library
import numpy as np

## Series

This is a one-dimensional array-like object containing a sequence of values and an associated array of index, or data labels. You can think of Series as a fixed-length, ordered `dict`, as it is a mapping of index values to data values. 

### Creating Series

In [2]:
# generating a simple series
obj = pd.Series([4, 7, -5, 3])
print(obj)
print()  # adds breaklines between results

# getting the values of a Series, in array form
print(obj.values)
print()

# this returns a range of numbers
print(obj.index)
print()

# we can generate a series with predefined labels
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

0    4
1    7
2   -5
3    3
dtype: int64

[ 4  7 -5  3]

RangeIndex(start=0, stop=4, step=1)

d    4
b    7
a   -5
c    3
dtype: int64


### Label Indexing

Compared with NumPy arrays, pandas series allows you to use labels in the index to select values or a set of values:

In [3]:
# selecting value of label 'a'
print(obj2['a'])

# same result as this one, which uses numeric indexing - retrieve value from 3rd row
print(obj2[2])

# selecting multiple values via list of corresponding labels
print(obj2[['c', 'a', 'd']])

# assigning value to label d and overwrite old value
obj2['d'] = 6
obj2

-5
-5
c    3
a   -5
d    4
dtype: int64


d    6
b    7
a   -5
c    3
dtype: int64

### Scalar Multiplication, Boolean Indexing

Series also allows NumPy functions or operations such as filtering with a boolean condition and scalar multiplication:

In [4]:
# returning only values greater than zero in the series
print(obj2[obj2 > 0])
print()

# multiply every value by 2
print(obj2 * 2)
print()

np.exp(obj2)  # calculate exponential of every value

d    6
b    7
c    3
dtype: int64

d    12
b    14
a   -10
c     6
dtype: int64



d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

### Similar to In-Built `dict` in Python

You can think about a Series as fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in several contexts where you might a dict:

In [5]:
# check for b index/label in obj2
print('b' in obj2)
print()

# creating a Series from a defined dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)
print()

# override order of how keys would appear in the Series
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)

True

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


In the above example, no value for `California` was found in `sdata`, thus a `NaN` or missing value will be generated in place instead. Also notice that `obj3` will include the value for `Utah` but not `obj4` as `Utah` was not included in `states` list of indices passed in as an argument.

`pandas` also provides functions which can easily detect missing values in a Series:

In [6]:
print(pd.isnull(obj4))
print()

# the instance method does the same as above
print(obj4.isnull())
print()

# the opposite of isnull() is notnull()
obj4.notnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool



California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

### Arithmethic Operations on Series

A useful Series feature is that it automatically aligns by index label in arithmetic operations:

In [7]:
print(f"obj3:\n{obj3}\n")
print(f"obj4:\n{obj4}\n")

obj3 + obj4

obj3:
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

obj4:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64



California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

**Note:** Notice in the above example that the result will be the union of the index pairs when adding objects with different indexes. 

Also, `Utah` now have a missing value (represented by `NaN`) as `Utah` is not present in `obj4`. To understand why this happens, remember `NaN` is a special value in Python whereby when you perform any arithmethic operations on it, the result will always be `NaN`. For the case of `Utah` in the above example, the operation is basically summing `Utah` value in `obj3` (5000) to the corresponding value in `obj4`, which is missing or `NaN` value. Thus the result will be `NaN` for `Utah`.

In [8]:
# illustration of NaN value
a = float("NaN")
a ** 4

nan

## DataFrame

A DataFrame represents a table of data and contains an ordered collection of columns, each of which can be a different value type(numeric, string, boolean, etc.) The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Data in a DataFrame is stored as one or more two-dimensional blocks rather than a list, dict or some other collection of one-directional arrays. 

### Creating Dataframes

To construct a DataFrame:

In [9]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
       }
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


***Note***: *in Jupyter notebook, `pandas` DataFrame objects will be displayed as a more browser-friendly HTML table as an output.*

Specifying a sequence of columns will arrange the columns in that order while generating the DataFrame. You can also override the default index labels as well:

In [10]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


Notice in the above example that if you pass a column that isn't found in the `dict` data, the corresponding values for that column will be initialised with `NaN` values in the DataFrame. 

Another common form of data passed to a DataFrame is a nested dict of dicts. This makes it convenient to convert data in such format to a DataFrame:

In [11]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In the above example, `pandas` will interpret the outer dict keys as the column labels, and the inner keys as the row indices. Table below lists possible data inputs to DataFrame constructor:

**Type** | **Notes**
--- | ---
2D ndarray | A matrix of data, passing optional row and column labels
`dict` of arrays, lists or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array | Treated as the "dict of arrays" case
`dict` of Series | Each value becomes a column; indexes from each Series are unioned together to form the result's row index if no explicit index is passed.
`dict` of `dicts` |  Each inner dict becomes a column; keys are unioned to form the row index as in the "dict of Series" case
List of `dicts` or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame's column labels
List of lists or tuples | Treated as the "2D ndarray" case
Another DataFrame | The DataFrame's indexes are used unless different ones are passed
NumPy MaskedArray | Like the "2D ndaaray" case except masked values become NA/missing in the DataFrame result

### Sneak-Peek using `head` method

For larger dataframes, you can use the `head` method, which displays only the first five rows:

In [12]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


### Indexing
A column can be retrieved in a DataFrame as a Series, which will retain the column label and index labels, like this:

In [13]:
frame2['state']
# one        Ohio
# two        Ohio
# three      Ohio
# four     Nevada
# five     Nevada
# six      Nevada
# Name: state, dtype: object

frame2.year   # works similarly but not encouraged

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

**Note**: `frame2[column]` in the above example works for any column name, but `frame2.column` only works when the column name is a valid Python variable name (Recall [chapter](https://github.com/colintanwh/python-basics/blob/master/variables.ipynb) on *Variables* in Python basics)

### Assigning Values to a DataFrame

Columns can be modified by assignment. For example, the empty *debt* column could be assigned a scalar value, an array of values or even a Series. Note that the column returned via indexing is a *view* on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflectedin the DataFrame:

In [14]:
# assigns all 'debt' values to 16.5
frame2['debt'] = 16.5  
print(f"{frame2}\n")

# assigns an array of running integers from 0 to 'debt'
frame2['debt'] = np.arange(6.)
print(f"{frame2}\n")

# assigns a Series of values to 'debt'
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(f"{frame2}\n")

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5

       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN



In the example above, assigning a Series will have its labels realigned exactly to the DataFrame's index, inserting missing values in any gaps.

When assigning a column that doesn't exist, a new column will be created in the DataFrame:

In [15]:
frame2['eastern'] = frame2.state == 'Ohio'  # True or False
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


### Deleting columns

`del` method can be used to remove columns in a DataFrame:

In [16]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


### The Index Objects

`pandas`'s Index objects are responsible for holding the axis labels and other metadata. Any array or other sequence of labels you use when constructing a Series of DataFrame is internally converted to an Index. Index objects are immutable and thus can't be modified by the user:

In [17]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
# Index(['a', 'b', 'c'], dtype='object')

labels = pd.Index(np.arange(3))  # creating index objects from pandas

obj2 = pd.Series([1.5, -2.5, 0], index=labels)  # generate series with specified index objects passed in
obj2
# 0    1.5
# 1   -2.5
# 2    0.0
# dtype: float64

0    1.5
1   -2.5
2    0.0
dtype: float64

A `pandas` Index can contain duplicate labels. Selections with duplicate labels will select all occurrences of that label:

In [18]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index objects has a number of methods and properties for set logic. Some useful ones are summarised below:

**Method** | **Description**
--- | ---
`append` | Concatenate with additional index objects, producing a new Index 
`difference` | Compute set difference as an Index
`intersection` | Compute set intersection
`union` | Compute set union
`isin` | Compute boolean array indicating if each value is contained in the passed collection
`delete` | Compute new Index with element at index `i` deleted
`drop` | Compute new Index by deleting passed values
`insert` | Compute new Index by inserting element at index `i`
`is_monotonic` | Returns `True` if each element is greater than or equal to the previous element
`is_unique` | Returns `True` if the index has no duplicate values
`unique` | Compute the array of unique values in the index

## Essential Functionality

In the following sections, we will delve more deeply into data analysis and manipulation topics using `pandas`. 

### Reindexing

`reindex` helps create a new object with the data conformed to a new index. Calling `reindex` will rearrange the data according to the new index, introducing missing values if any index values were not already present. Consider this example below:

In [19]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
# d    4.5
# b    7.2
# a   -5.3
# c    3.6
# dtype: float64


obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# e    NaN
# dtype: float64

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, we can use `ffill` method to do some interpolation or filling of values when reindexing:

In [20]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3 = obj3.reindex(range(6), method='ffill')
obj3

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

By default, the `reindex` method reindexes the rows by default when passed only a sequence. We can reindex the columns with the `columns` keyword when invoking the `reindex` method:

In [21]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                    index = ['a', 'c', 'd'],
                    columns = ['Ohio', 'Texas', 'California'])
print(frame)

states = ['Texas', 'Utah', 'California']
frame = frame.reindex(['a', 'b', 'c', 'd'], columns = states)
frame

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping Entries

The `drop` method allows you to remove specified rows from a Series and any rows or columns from a DataFrame. The `drop` method will return a new object with the indicated value or values deleted from an axis:

In [22]:
obj = pd.Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# e    4.0

new_obj = obj.drop('c')  # drop column c from Series
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Dropping rows/columns from DataFrame:

In [23]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index = ['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])
#           one  two  three  four
# Ohio        0    1      2     3
# Colorado    4    5      6     7
# Utah        8    9     10    11
# New York   12   13     14    15

# drop specified rows
a = data.drop(['Colorado', 'Ohio'])  
print(a)

# drop specified columns
b = data.drop(['two', 'four'], axis=1)
print(b)

# this does the same as dropping columns
c = data.drop(['two', 'four'], axis='columns')
print(c)

          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14


By default, `drop` will remove rows based the parameters passed in unless `axis=1` or `axis='columns'` is passed, which instructs Python to remove columns instead of rows from the DataFrame. 

Many functions, like `drop`, which modify the size or shape of a Series/DataFrame, can manipulate an object *in-place* without returning a new object:

In [24]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

### Indexing

#### Slicing

Like Numpy arrays, DataFrames facilitate slicing but also with labels. Slicing with labels behaves differently than normal Python slicing in that the end-point is inclusive. For example:

In [25]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['Ohio','Colorado','Utah','New York'], 
                    columns=['one', 'two', 'three', 'four'])

# corresponds to 2nd to 4th row
print(data['Colorado':'New York'])
print()

# using integer slicing omits the end-point
print(data[1:3])

          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11


#### Boolean Indexing

In [26]:
# prints boolean Series of values under column 'three' greater than 5
a = data['three'] > 5
print(a)
print()

# passing above boolean result into indexing arguments back into the DataFrame
b = data[data['three'] > 5]
print(b)
print()

# working on entire DataFrame
c = data < 5
print(c)
print()

# ikewise, we passed the above result to see what is displayed
d = data[data < 5]
d

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False



Unnamed: 0,one,two,three,four
Ohio,0.0,1.0,2.0,3.0
Colorado,4.0,,,
Utah,,,,
New York,,,,


Notice in the last example the NaN values are displayed where the corresponding Boolean results of `data<5` are `False`. This also makes it easier to assign new values in any position in the DataFrame that fulfills the condition such as below:

In [27]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Selection with `loc` and `iloc`

For DataFrame label-indexing on the rows, there are special indexing operators, `loc` and `iloc`. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc). For examples:

In [28]:
# retrieve specified rows and columns
a = data.loc['Colorado', ['two', 'three', 'four']]
print(a)
print()

# retrieve specified row with all columns
b = data.loc['Utah']
print(b)
print()

# loc also accepts slices as arguments
c = data.loc[:'Utah',['two']]
print(c)

two      5
three    6
four     7
Name: Colorado, dtype: int64

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

          two
Ohio        0
Colorado    5
Utah        9


Perfoming similar selections using `iloc`:

In [29]:
# retrieve specified rows and columns
a = data.iloc[1, [1, 2, 3]]
print(a)
print()

# retrieve specified row with all columns
b = data.iloc[2]
print(b)
print()

# iloc also accepts slices as arguments
c = data.iloc[:, :3]
print(c)
print()

# as iloc's result is still a DataFrame, we can further index on it by combining with a conditio like below
data.iloc[:, :3][data.three > 5]

two      5
three    6
four     7
Name: Colorado, dtype: int64

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

          one  two  three
Ohio        0    0      0
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14



Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


A summary of indexing options with DataFrame below:

**Type** | **Notes**
--- | ---
`df[val]` | Select single coumn or sequence of columns; special case conveniences: boolean array, slice, or boolean DataFrame
`df.loc[val]` | Selects single row or subset of rows by label
`df.loc[:, val]` | Selects single column or subset of columns by label
`df.loc[val1, val2]` | Selects both rows and columns by label
`df.iloc[where]` | Selects single row or subset of rows by integer position
`df.iloc[:, where]` | Select single column or subset of columns by integer position
`df.iloc[where_i, where_j]` | Selects both rows and columns by integer position
`df.at[label_i, label_j]` | Select a single scalar value by row and column label
`df.iat[i, j]` | Selects a single scalar value by row and column position (integers)
`reindex` method | Select either rows or columns by labels
`get_value`, `set_value` methods | Select single value by row and column label

b

In [30]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [31]:
ser2 = pd.Series(np.arange(3.), index=['a','b','c'])
ser2[-1]

2.0

In [32]:
ser[:-1]

0    0.0
1    1.0
dtype: float64

In [33]:
np.random.rand(1, 2)

array([[0.98777208, 0.47525868]])

### Arithmethic and Data Alignment

In one of the earlier examples, I've added two Series objects that will yield union of the index pairs. Internal data alignment will introduce missing values in label locations that don't overlap. In the case of DataFrame, alignment is performed on both the rows and columns:

In [34]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio','Texas','Oregon'])

print(f"{df1}\n")
print(f"{df2}\n")

df1 + df2

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0



Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In the above scenario, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

In [35]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In the above scenario, missing values will still appear in locations which are missing in both `df1` and `df2`

Relatedly, you can also specify a different fill value when reindexing a Series or DataFrame:

In [36]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,b,d,e
Ohio,0.0,2.0,0
Texas,3.0,5.0,0
Colorado,6.0,8.0,0


You can perform other arithmetic operations on DataFrame like below:

In [37]:
a = 1 / df1
print(f"{a}\n")

# this does the same
b = df1.rdiv(1)
print(f"{b}\n")

                 b         c      d
Ohio           inf  1.000000  0.500
Texas     0.333333  0.250000  0.200
Colorado  0.166667  0.142857  0.125

                 b         c      d
Ohio           inf  1.000000  0.500
Texas     0.333333  0.250000  0.200
Colorado  0.166667  0.142857  0.125



Some of the common arithmetic methods that can be performed on DataFrame are summarised below:

**Method** | **Description**
--- | ---
`add`, `radd` | Methods for addition (+)
`sub`, `rsub` | Methods for subtraction (-)
`div`, `rdiv` | Methods for division (/)
`floordiv`, `rfloordiv` | Methods for floor division (/)
`mul`, `rmul` | Methods for multiplication (\*)
`pow`, `rpow` | Methods for exponentiation (\*\*)

### Operations between DataFrame and Series

As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, let's consider the difference between a two dimensional array and one of its row:

In [38]:
arr = np.arange(12.).reshape((3,4))
print(arr)
print(f"\n{arr[0]}")

# subtract arr with values from its first row
arr - arr[0]

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]

[0. 1. 2. 3.]


array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

When we subtract `arr[0]` from `arr`, the subtraction is performed once for each row. This is referred as *broadcasting*. Operations between a DataFrame and a Series are similar:

In [39]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                    columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
print(frame)
print(f"\n{series}")

frame - series

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


You can also instead broadcast over the columns, matching on the rows like this below:

In [40]:
series2 = frame['d']
print(series2)

frame.sub(series2, axis = 'index')

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64


Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [41]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
                    index=['Utah','Ohio','Texas', 'Oregon'])
print(frame)
np.abs(frame)

               b         d         e
Utah    0.839060 -0.499929  0.920003
Ohio    0.260242  1.604618  0.834174
Texas  -1.327622  0.625455 -1.422361
Oregon  0.563533  0.536349 -0.133018


Unnamed: 0,b,d,e
Utah,0.83906,0.499929,0.920003
Ohio,0.260242,1.604618,0.834174
Texas,1.327622,0.625455,1.422361
Oregon,0.563533,0.536349,0.133018


Another frequent operation is applying a lambda function on one-dimensional arrays to each column or row. DataFrame's `apply` method does exactly this:

In [42]:
# computes difference between maximum and minimum of a Series
f = lambda x: x.max() - x.min()

# applying lambda f across rows
print(frame.apply(f))

# applying lambda f across columns
frame.apply(f, axis='columns')

b    2.166681
d    2.104547
e    2.342364
dtype: float64


Utah      1.419932
Ohio      1.344376
Texas     2.047816
Oregon    0.696551
dtype: float64

Several common array statistics (e.g. `sum` and `mean`) are DataFrame methods, thus using `apply` is not necessary.

Element-wise Python functions can also be used. For example, you can compute a formatted string from each floating-point value in DataFrame using the `applymap` method:

In [43]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.84,-0.5,0.92
Ohio,0.26,1.6,0.83
Texas,-1.33,0.63,-1.42
Oregon,0.56,0.54,-0.13


**Note:** The `applymap` method should be used only for DataFrames. In the case of Series, you can perform similar mapping using the `map` method.

In [44]:
frame['e'].map(format)

Utah       0.92
Ohio       0.83
Texas     -1.42
Oregon    -0.13
Name: e, dtype: object

### Sorting and Ranking

