# Operating on Data in Pandas

NumPy provides element-wise operations for arithmetic operations (trigonometric, exponential, logarithmic, etc.). Pandas adds several useful functions:

unary operations (negate, trigonometrics, ...) __preserve index and column labels in the output__; binary operations (add, multiply,...) __automatically align indices__.

In [1]:
import pandas as pd
import numpy as np

### Ufuncs: Index Preservation

Any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.

In [8]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 20, 10))
ser

0     6
1    19
2    14
3    10
4     7
5     6
6    18
7    10
8    10
9     3
dtype: int64

In [9]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,7,7,2,5
1,4,1,7,5
2,1,4,0,9


If we apply a NumPy ufunc on these objects, we get an Pandas object *with the indices preserved:*

In [10]:
np.exp(ser)

0    4.034288e+02
1    1.784823e+08
2    1.202604e+06
3    2.202647e+04
4    1.096633e+03
5    4.034288e+02
6    6.565997e+07
7    2.202647e+04
8    2.202647e+04
9    2.008554e+01
dtype: float64

In [11]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-0.7071068,-0.7071068,1.0,-0.707107
1,1.224647e-16,0.7071068,-0.707107,-0.707107
2,0.7071068,1.224647e-16,0.0,0.707107


### Index Alignment

Pandas will align indices while performing binary operations on two ``Series`` or ``DataFrame`` objects. This is very convenient when working with incomplete data.

### Index alignment in Series

- Suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [12]:
area =       pd.Series({'Alaska': 1723337,      'Texas': 695662,   'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')

What happens when we divide these to compute the population density? The result contains the *union* of indices of the two input arrays. Missing entries are marked with ``NaN``.

In [13]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [8]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

The ``NaN`` fill value can be modified:

In [15]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [16]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrames

Index alignment takes place for *both* columns and indices when working with ``DataFrame``s:

In [17]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),  columns=list('AB'))
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),  columns=list('BAC'))
A

Unnamed: 0,A,B
0,11,16
1,9,15


In [18]:
B

Unnamed: 0,B,A,C
0,2,6,3
1,8,2,4
2,2,6,4


In [19]:
A + B

Unnamed: 0,A,B,C
0,17.0,18.0,
1,11.0,23.0,
2,,,


Indices are aligned regardless of their order in the two objects. Indices in the result are sorted.

We can pass a ``fill_value`` in place of missing entries. Here we'll use the mean of all values in ``A`` (computed by first stacking the rows of ``A``).

In [20]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,17.0,18.0,15.75
1,11.0,23.0,16.75
2,18.75,14.75,16.75



| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


### Operations Between DataFrame and Series

Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy arrays.

In [21]:
A = rng.randint(10, size=(3, 4))
A

array([[8, 6, 1, 3],
       [8, 1, 9, 8],
       [9, 4, 1, 3]])

In [22]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 0, -5,  8,  5],
       [ 1, -2,  0,  0]])

According to NumPy's broadcasting rules, __subtraction between a 2D array and one of its rows is applied row-wise__. Pandas has the same default convention.

In [23]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,0,-5,8,5
2,1,-2,0,0


__Use the ``axis`` keyword to operate column-wise.__

In [24]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,2,0,-5,-3
1,7,0,8,7
2,5,0,-3,-1


These ``DataFrame``/``Series`` operations automatically align indices between the two elements.

In [25]:
halfrow = df.iloc[0, ::2]
halfrow

Q    8
S    1
Name: 0, dtype: int64

In [26]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,0.0,,8.0,
2,1.0,,0.0,
