<h1>Operating on Data in Pandas</h1>

In [1]:
# Pandas inherits much of the functionalities from NumPy and Ufuncs.
# In Pandas, ufuncs preserve index and column labels in the output and for binary operations such as addition
# and multiplication. 

# Pandas will automatically align indices when passing the objects to the ufunc. 

import pandas as pd
import numpy as np

<h3>Ufuncs: Index Preservation</h3>

In [2]:
# A Simple series and Dataframe objects
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0,10,4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [3]:
# Data frame
df = pd.DataFrame(rng.randint(0,10,(3,4)),
                  columns=["A","B","C","D"]
                 )
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [4]:
# If we apply a NumPy ufunc on either of these objects, the result will be another Pandas Object with the indices
# preserved. 
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [5]:
# Slightly more complex solution
np.sin(df * np.pi/4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


<h3>UFuncs: Index Alignment</h3>

In [6]:
# For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of 
# performing the operation. 

<h4>Index alignment in series</h4>

In [7]:
# We are combining two different data sources
area = pd.Series({
    "Alaska":1723337,
    "Texas":695662,
    "California":423967
}, name="area")

population = pd.Series({
    "California":38332521,
    "Texas":26448193,
    "New York":19651127
}, name="population")

In [8]:
# Union of Area and population
population/area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [9]:
# The resulting array contains the union of indices of the two inout arrays, which we could determine using 
# standard Python set arithmetic
area.index | population.index

  area.index | population.index


Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [10]:
# Any item for which one or the other does not have an entry is marked with NaN or "not a number" which
# is how Pandas marks missing data. 

# This index matching is implemented this way for any of the Python's built in arithmetic expressions. Any 
# missing values are filled in with NaN by default.
A = pd.Series([2,4,6],index=[0,1,2])
B = pd.Series([1,3,5], index=[1,2,3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [11]:
# If using NaN values is not desired behaviour, we can modify the fill value using appropriate object methods
# in place of the operators. 
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

In [12]:
A.add??

<h4>Index Alignment in DataFrame</h4>

In [13]:
# A similar type of alignment takes place for both columns and indices when you are performing operations on 
# DataFrames. 
A = pd.DataFrame(rng.randint(0,20,(2,2)),
                columns=list("AB"))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [14]:
B = pd.DataFrame(rng.randint(0,10,(3,3)), columns=list("BAC"))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [15]:
# Perform Addition operation
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


In [16]:
# We can use the associated object's arithmetic method and pass any deisred fill_value to be used in place of 
# missing entries. 
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


<h3>Ufuncs: Operations between DataFrame and Series</h3>

In [18]:
# Operations between a Dataframe and a Series are similar to operations between a two-dimensional and one-
# dimensional NumPy array
# Operation of two-dimensional array and one of its rows
A = rng.randint(10, size=(3,4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [19]:
# Subtraction Operation
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

In [20]:
# According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is 
# applied row-wise. 
# In Pandas, the convention similarly applies by default
df = pd.DataFrame(A, columns=list("QRST"))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


In [21]:
# To apply the same operation column wise, object methods can be used, specifying the axis
df.subtract(df["R"], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


In [22]:
# The DataFrame/Series operations, will automatically align indices between the two elements
halfrow = df.iloc[0,::2]
halfrow

Q    3
S    2
Name: 0, dtype: int64

In [23]:
# Subtraction operation
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


In [24]:
# The preservation and alignment of indices and columns means that operations on data in pandas, will always means
# the data context.
# This prevents the type of errors that come up while working with hetrogenous and or misaligned data 
# in NumPy arrays.