# Operating on Data in Pandas

In [2]:
import pandas as pd
import numpy as np

Pandas is designed to work with numpy. Thus we can use the ufunc functionality to work with pandas and make instruction execution more efficient.

In [3]:
np.random.seed()

In [4]:
rng=np.random.RandomState(42)#setting random seed
print(rng)
rng.seed
ser=pd.Series(rng.randint(0,10,4))#series with 4 entries
ser

RandomState(MT19937)


0    6
1    3
2    7
3    4
dtype: int32

In [7]:
df=pd.DataFrame(rng.randint(0,10,(3,4)),columns=['A','B','C','D'],index=['1','2','3'])
df

Unnamed: 0,A,B,C,D
1,6,3,8,2
2,4,2,6,4
3,8,6,1,3


In [13]:
print(np.exp(ser))#e^ser would do exactly what it would do if the argument was a numpy array
print(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64
0    6
1    3
2    7
3    4
dtype: int32


In [14]:
np.sin(df*np.pi/4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## Index Alignment

operating two dataframes or two series, Pandas will align the indeces/entries/columns in the process of performing the operation very conviniently.

### Alignment in Series

In [10]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,'California': 423967})
population = pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127})

Say we want to find the population density. We can do so by dividing the two series defined above. The row alignment would be done automatically

In [11]:
print(population/area)

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64


Notice how it fills some entries with NaN. We can choose to replace the NaN values such that values of one array are operated with a fill value instead of non existant values.

In [23]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [25]:
A.add(B,fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Alignment in DataFrame

In [13]:
A=pd.DataFrame(rng.randint(0,20,(2,2)),columns=['A','B'])
A

Unnamed: 0,A,B
0,6,18
1,10,10


In [14]:
B=pd.DataFrame(rng.randint(0,10,(3,3)),columns=['B','A','C'])
B

Unnamed: 0,B,A,C
0,7,4,3
1,7,7,2
2,5,4,1


In [15]:
A+B

Unnamed: 0,A,B,C
0,10.0,25.0,
1,17.0,17.0,
2,,,


Notice how the entries of the sum are automatically aligned row-wise and column-wise

We can fill up the NaN entries such that,all entries of one array operate with the mean of A instead of operating with some non existant entry.

In [16]:
themean=np.mean(np.array(A))
A.add(B,fill_value=themean)

Unnamed: 0,A,B,C
0,10.0,25.0,14.0
1,17.0,17.0,13.0
2,15.0,16.0,12.0


In [58]:
A#A.add(B) doesn't change the value of A

Unnamed: 0,A,B
0,3,7
1,2,1


.+	add()


.-	sub(), subtract()


.*	mul(), multiply()


/	truediv(), div(), divide()


//	floordiv()


%	mod()


**	pow()

### Alignment in operatioons b/w Series and DataFrames

Consider this operation where we find the difference between a 2dimensional array and one of it's rows

In [60]:
A=rng.randint(10,size=(3,4))
A

array([[1, 3, 8, 1],
       [9, 8, 9, 4],
       [1, 3, 6, 7]])

In [61]:
A-A[0]

array([[ 0,  0,  0,  0],
       [ 8,  5,  1,  3],
       [ 0,  0, -2,  6]])

Clearly what's happening here is that A[0] is being subtracted from every row of the array A. This works the same way in pandas as well.

In [68]:
df=pd.DataFrame(A,columns=list('QRST'))
df-df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,8,5,1,3
2,0,0,-2,6


Say instead of operation row-wise we want to operate column-wise, we could do so by using the list of functions mentioned previously

In [71]:
df.subtract(df['R'],axis=0)

Unnamed: 0,Q,R,S,T
0,-2,0,5,-2
1,1,0,1,-4
2,-2,0,3,4
