# Operating On Data in Pandas

## A. ufuncs in Pandas 

Since `Pandas` is designed to work with NumPy, all the ufuncs of `NumPy` will also work on the Pandas `Series` and `DataFrame` objects.

In [2]:
# import the libraries 
import pandas as pd
import numpy as np

In [3]:
# let us create a 1D series object in Pandas 
series_1 = pd.Series(np.random.randint(50,size=15))
print(f'The series is:\n{series_1}')

# let us also create a dataframe object
df_1 = pd.DataFrame(data=np.random.randint(50, size=(4,4)),
                    columns=['A','B','C','D'])
print(f'The dataframe is :\n{df_1}')

The series is:
0     26
1     37
2     14
3     19
4     32
5     43
6     15
7     20
8      1
9      8
10    23
11    14
12    21
13     4
14    31
dtype: int32
The dataframe is :
    A   B   C   D
0  37  16  10  28
1  22  21  22  14
2  45  41  17  13
3  33  18  44   2


In [4]:
# I can use the ufuncs in NumPy to modify the series or dataframes 
print(f'The series being squared is: \n{np.power(series_1,2)}')

# The dataframe raised to exponential 
print(f'The exponential version of df is:\n{np.exp(df_1)}')

The series being squared is: 
0      676
1     1369
2      196
3      361
4     1024
5     1849
6      225
7      400
8        1
9       64
10     529
11     196
12     441
13      16
14     961
dtype: int32
The exponential version of df is:
              A             B             C             D
0  1.171914e+16  8.886111e+06  2.202647e+04  1.446257e+12
1  3.584913e+09  1.318816e+09  3.584913e+09  1.202604e+06
2  3.493427e+19  6.398435e+17  2.415495e+07  4.424134e+05
3  2.146436e+14  6.565997e+07  1.285160e+19  7.389056e+00


## B. Ufuncs: Index Alignment
### B.1 Index Alignment in Series

In [5]:
# Dictionary for GDP of states (in millions of dollars)
gdp = pd.Series({'California': 3134962, 'Texas': 1852682, 'New York': 1766405}, name='gdp')

# Dictionary for population
population = pd.Series({'California': 97, 'Texas': 41, 'Florida': 155}, name='population_density')

print(gdp)
print(population)

# let us calculate gdp per capita 
print(gdp/population)

California    3134962
Texas         1852682
New York      1766405
Name: gdp, dtype: int64
California     97
Texas          41
Florida       155
Name: population_density, dtype: int64
California    32319.195876
Florida                NaN
New York               NaN
Texas         45187.365854
dtype: float64


So, clearly, an `union` is made of the indices and then if there exists an item for which the other does not have an entry, it is marked as `NaN`

In [6]:
# we can add a separate arg namely, fill_vlaue to byepass the NaN by pandas 
gdp_percapita =  gdp.divide(population, fill_value=0)
gdp_percapita

California    3.231920e+04
Florida       0.000000e+00
New York               inf
Texas         4.518737e+04
dtype: float64

### B.3 Index Alignment in DataFrame objects 

In [7]:
# DataFrame for area (in square kilometers)
area1_df = pd.DataFrame({'Area (km^2)': [423967, 695662, 170312]},
                       index=['California', 'Texas', 'Florida'])

# DataFrame for population density (people per square kilometer)
area2_df = pd.DataFrame({'Area (km^2)': [423967, 695662, 180312]},
                          index=['California', 'Texas', 'New York'])

In [8]:
area1_df.add(area2_df)

Unnamed: 0,Area (km^2)
California,847934.0
Florida,
New York,
Texas,1391324.0


So, for a DataFrame, the alignment takes place for both the column and the row indices!

## C. Ufuncs: Operations between DataFrame and Series 

Here, too the alignment is maintained too! 

Also the operations follow the broadcasting rules of NumPy



In [9]:
rng = np.random.default_rng(10)

df = pd.DataFrame(rng.integers(50, size=(4,4)), columns=['A','B','C','D'])
print(df)

print(f'The df after subtracting the first row is: \n{df-df.iloc[0,]}')

print(f'The df after subtracting the first column: \n{df.subtract(df["A"], axis=0)}') # axis is 0 since we subtract each column along the horizontal axis 

    A   B   C   D
0  38  47  13  10
1  39  41  25   7
2  41  25   7   6
3  20  34  20  42
The df after subtracting the first row is: 
    A   B   C   D
0   0   0   0   0
1   1  -6  12  -3
2   3 -22  -6  -4
3 -18 -13   7  32
The df after subtracting the first column: 
   A   B   C   D
0  0   9 -25 -28
1  0   2 -14 -32
2  0 -16 -34 -35
3  0  14   0  22


END!