# Operating on Data in Pandas

Pandas inherits much of its functionality from ufuncs from Numpy with some twists. For unary operations e.g. negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations e.g. addition and multiplication, Pandas will automatically align indices when passing the ojects. 

## Ufuncs: Index Preservation

Any Numpy ufunc will work on a Series and Dataframe object. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [4]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


Applying a Numpy ufunc, we see we get another Pandas object, with the indices preserved:

In [5]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [6]:
np.sin(df * np.pi/4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## Ufuncs: Index Alignment

For binary operations, Pandas will align the indices - this can be very useful when working with incomplete data

### Index Alginment in Series

In [8]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [9]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The array contains the union of indices of the two input arrays:

In [10]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [11]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN is not the desired behaviour, the fill value can be modified:

In [12]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index Alginment in DataFrame

In [14]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [16]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [17]:
A+B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


Similary to a Series, we can modify the fill_value if we don't want to use NaN. In this case we will fill will the mean of all values in A (computed by first stacking the rows of A):

In [22]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


Python Operator
Pandas Method(s)  <br>
+
add() <br>
-
sub(), subtract() <br>
*
mul(), multiply() <br>
/
truediv(), div(), divide() <br>
//
floordiv() <br>
%
mod() <br>
**
pow() <br>

## Ufuncs Operations between DataFrames and Series

When performing these, the index and column alignment is similarly maintained. These operations are similar to those between a two-dimensional and one-dimensional Numpy array.For example, where we find the difference of a two-dimensional array and one of its rows:

In [23]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [24]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

According to the Numpy broadcasting rules, subtraction between a two-dim array and one of its rows is applied row wise. In Pandas, the convention is the same by default:

In [33]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


To operate column-wise:

In [34]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Note these DataFrame/Series operations,will automatically align indices between the two elements:

In [35]:
halfrow = df.iloc[0, ::2]
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [36]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


This means the context of the data is always maintained, preventing silly erros when working with heterogeneous and/or misaligned data in raw NumPy arrays