# Operations on data in Pandas

From: K.A.

To get started, we need to import the NumPy and Pandas libraries

In [1]:
import pandas as pd
import numpy as np 


Universal functions: index preservation

In [33]:
rng = np.random.RandomState(33)
ser = pd.Series(rng.randint(0, 9, 5)) # writes to the array from/to/quantity
ser

0    4
1    7
2    8
3    2
4    2
dtype: int64

In [34]:
df = pd.DataFrame(rng.randint(0, 17, (4,5)),columns=['A', 'B', 'C', 'D', 'F']) 
df # generates numbers from 0 to 10 (4 rows, 5 columns) and writes them in the corresponding columns 

Unnamed: 0,A,B,C,D,F
0,9,3,6,14,10
1,13,3,1,13,12
2,7,12,11,16,16
3,10,6,16,8,9


If you apply the universal NumPy function to any of these objects, the result will be another Pandas library object with the indexes preserved:


In [35]:
np.exp(ser) # Rises the base of the natural logarithm in the numbers of the array ser

0      54.598150
1    1096.633158
2    2980.957987
3       7.389056
4       7.389056
dtype: float64

In [36]:
np.cos(df * np.pi / 3) # similar to the example above, but with cos and division 


Unnamed: 0,A,B,C,D,F
0,-1.0,-1.0,1.0,-0.5,-0.5
1,0.5,-1.0,0.5,0.5,1.0
2,0.5,1.0,0.5,-0.5,-0.5
3,-0.5,1.0,-0.5,-0.5,-1.0


## Index alignment


When performing binary operations on two Series or DataFrame objects, Pandas will align the indexes during the operation. This is useful when dealing with incomplete data.


In [37]:
area = pd.Series({'Roses': 1723337, 'Tulips': 695662,'Apple': 423967}, name='area')
quantity = pd.Series({'Roses': 38332521, 'Tulips': 26448193,'Orange': 19651127}, name='quantity') 

Let's see what happens if we divide the second result by the first:

In [38]:
quantity / area # outputs the result of dividing "series by series"  

Apple           NaN
Orange          NaN
Roses     22.243195
Tulips    38.018740
dtype: float64

The resulting array contains the union of the indices of the two original arrays

In [39]:
area.index.union(quantity.index) # joins array indexes

Index(['Apple', 'Orange', 'Roses', 'Tulips'], dtype='object')

None of the elements relating to both of them contain the NaN value ("nonnumeric value") with which Pandas marks missing data. Similarly, index matching is implemented for all embedded Python arithmetic expressions: all missing values are filled with NaN by default

In [40]:
A = pd.Series([8, 14, 2], index=[0, 1, 2]) 
B = pd.Series([13, 17, 4], index=[1, 2, 3]) 
A + B 

0     NaN
1    27.0
2    19.0
3     NaN
dtype: float64

If using NaN values is undesirable, you can replace this value with another one. For example, calling method A.add(B) is equivalent to calling A + B, but allows you to specify values for missing data

In [41]:
A.add(B, fill_value=0)

0     8.0
1    27.0
2    19.0
3     4.0
dtype: float64

Aligning indexes in DataFrame objects

When performing operations on DataFrame objects, the same alignment occurs for both columns and indexes

In [42]:
A = pd.DataFrame(rng.randint(0, 17, (2, 2)),columns=list('AK'))
A 

Unnamed: 0,A,K
0,3,7
1,10,3


In [43]:
B = pd.DataFrame(rng.randint(0, 17, (3, 3)),columns=list('ABK'))
B

Unnamed: 0,A,B,K
0,7,3,11
1,11,0,16
2,12,14,11


In [44]:
A + B

Unnamed: 0,A,B,K
0,10.0,,18.0
1,21.0,,19.0
2,,,


The indexes are aligned correctly regardless of their location in the two objects and the indexes in the resulting object are sorted. As in the case of Series objects, you can use the corresponding arithmetic methods of the objects and pass any desired *fill_value* for use instead of missing values. 
For example, we can fill the missing values with the average value of all elements of object A (which we compute by first lining up the values of object A in one column using the *stack* function):

In [45]:
fill = A.stack().mean()        
A.add(B, fill_value=fill)

Unnamed: 0,A,B,K
0,10.0,8.75,18.0
1,21.0,5.75,19.0
2,17.75,19.75,16.75


### Performing operations between DataFrame and Series objects


Operations between DataFrame and Series objects use the same method to align columns and indices. Operations between DataFrame and Series objects are similar to operations between two-dimensional and one-dimensional arrays in NumPy. 

Example: Calculating the difference of a two-dimensional array and one of its rows

In [46]:
A = rng.randint(17, size=(3, 4))        
A 

array([[ 1, 15,  4,  1],
       [11,  1, 15, 15],
       [14, 14,  8, 10]])

In [47]:
A - A[0]

array([[  0,   0,   0,   0],
       [ 10, -14,  11,  14],
       [ 13,  -1,   4,   9]])

According to the rules of NumPy broadcasting, the subtraction from a two-dimensional array of one of its rows is performed line by line.
In Pandas, the default subtraction is also performed line by line

In [48]:
df = pd.DataFrame(A, columns=list('PRST'))      
df - df.iloc[0]

Unnamed: 0,P,R,S,T
0,0,0,0,0
1,10,-14,11,14
2,13,-1,4,9


If you want to perform this operation on the columns, you can use the above mentioned object methods by specifying the axis

In [49]:
df.subtract(df['S'], axis=0)

Unnamed: 0,P,R,S,T
0,-3,11,0,-3
1,-4,-14,0,0
2,6,6,0,2


This preservation and alignment of indices and columns means that data operations in Pandas always preserve the data context, preventing possible errors when dealing with heterogeneous and/or mis/unaligned data in the original NumPy arrays.