# Pandas-4: Advanced Opearations on DataFrame


### Table of Content

* Introduction to NaN and handling NaN values
* Categorical Operations and Memory optimization
* GroupBy and Aggregate Functions
* Merge in Pandas DataFrame
* Join in Pandas DataFrame

In [1]:
# import package
import numpy as np
import pandas as pd

In [2]:
# creating a dataframe
df = pd.DataFrame(
    {
        'A':[1,2,np.nan],
        'B':[5,np.nan,np.nan],
        'C':[1,2,3]
    }
)
print(df)

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3


**In a DataFrame**
- Row → a horizontal series of values (called a record or observation).
- Column → a vertical series of values (called a field or feature).
- Cell / Element / Scalar value → the single value at the intersection of a row and a column.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 204.0 bytes


In [4]:
df['Country']="IN US ENG".split()
df.set_index('Country',inplace=True)

In [5]:
print(df)

           A    B  C
Country             
IN       1.0  5.0  1
US       2.0  NaN  2
ENG      NaN  NaN  3


#### NaN Values Introduction

`NaN` stands for **Not a Number**. It represents **missing, undefined, or null data** in a DataFrame or Series.


**Key Points**
- `NaN` is a **floating-point value** (`numpy.nan`), even in integer columns.
- Used to **handle missing or incomplete datasets**.
- Operations with `NaN` usually **propagate**, e.g., `sum([1, NaN, 3]) = NaN`.


In [6]:
df

Unnamed: 0_level_0,A,B,C
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IN,1.0,5.0,1
US,2.0,,2
ENG,,,3


In [14]:
type(df.iloc[1,1])

numpy.float64

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, IN to ENG
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 96.0+ bytes


In [15]:
print(df.isna())   # True for NaN values

             A      B      C
Country                     
IN       False  False  False
US       False   True  False
ENG       True   True  False


In [16]:
print(df.notna())  # True for non-NaN values

             A      B     C
Country                    
IN        True   True  True
US        True  False  True
ENG      False  False  True


#### Handling NaN values

In [17]:
print("\nDropping any rows with a NaN value\n",'-'*35, sep='')
print(df.dropna(axis=0)) # dfault axis = 0


Dropping any rows with a NaN value
-----------------------------------
           A    B  C
Country             
IN       1.0  5.0  1


In [18]:
print("\nDropping any column with a NaN value\n",'-'*35, sep='')
df.dropna(axis=1)


Dropping any column with a NaN value
-----------------------------------


Unnamed: 0_level_0,C
Country,Unnamed: 1_level_1
IN,1
US,2
ENG,3


In [19]:
print("\nDropping a row with a minimum 2 NaN value using 'thresh' parameter\n",'-'*68, sep='')
df.dropna(axis=0, thresh=2)


Dropping a row with a minimum 2 NaN value using 'thresh' parameter
--------------------------------------------------------------------


Unnamed: 0_level_0,A,B,C
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
IN,1.0,5.0,1
US,2.0,,2


In [20]:
print("\nFilling values with a default value\n",'-'*35, sep='')
print(df.fillna(value='-1'))


Filling values with a default value
-----------------------------------
           A    B  C
Country             
IN       1.0  5.0  1
US       2.0   -1  2
ENG       -1   -1  3


In [21]:
print("\nFilling values with a computed value (mean of column A here)\n",'-'*60, sep='')
print(df.fillna(value=df['A'].mean()))


Filling values with a computed value (mean of column A here)
------------------------------------------------------------
           A    B  C
Country             
IN       1.0  5.0  1
US       2.0  1.5  2
ENG      1.5  1.5  3


In [22]:
#### Reset Index
df.reset_index(inplace=True)

In [23]:
df

Unnamed: 0,Country,A,B,C
0,IN,1.0,5.0,1
1,US,2.0,,2
2,ENG,,,3


#### Categorical Operations
- Convert columns to category type to save memory and speed up operations

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Country  3 non-null      object 
 1   A        2 non-null      float64
 2   B        1 non-null      float64
 3   C        3 non-null      int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 228.0+ bytes


In [25]:
df['Country'] = df['Country'].astype('category')
df

Unnamed: 0,Country,A,B,C
0,IN,1.0,5.0,1
1,US,2.0,,2
2,ENG,,,3


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   Country  3 non-null      category
 1   A        2 non-null      float64 
 2   B        1 non-null      float64 
 3   C        3 non-null      int64   
dtypes: category(1), float64(2), int64(1)
memory usage: 339.0 bytes


In [27]:
print(df['Country'].cat.categories)
print(df['Country'].cat.codes)

Index(['ENG', 'IN', 'US'], dtype='object')
0    1
1    2
2    0
dtype: int8


#### Memory Optimization
- Convert data types to save memory

In [28]:
df['C'] = df['C'].astype('int16')
df

Unnamed: 0,Country,A,B,C
0,IN,1.0,5.0,1
1,US,2.0,,2
2,ENG,,,3


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   Country  3 non-null      category
 1   A        2 non-null      float64 
 2   B        1 non-null      float64 
 3   C        3 non-null      int16   
dtypes: category(1), float64(2), int16(1)
memory usage: 321.0 bytes


#### Groupby & Mean

In [30]:
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Aman','Ankit','Aditya','Anjali','Ankita','Ayush'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Aman,200
1,GOOG,Ankit,120
2,MSFT,Aditya,340
3,MSFT,Anjali,124
4,FB,Ankita,243
5,FB,Ayush,350


In [31]:
# Grouping by 'Company' column and listing mean sales
byComp = df.groupby('Company')
print(byComp.Sales.mean())

Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64


#### Groupby & Sum

In [33]:
# Grouping by 'Company' column and listing sum of sales
print(byComp.Sales.sum())

Company
FB      593
GOOG    320
MSFT    464
Name: Sales, dtype: int64


#### groupby summary

In [27]:
print(df.groupby('Company').describe())

        Sales                                                        
        count   mean         std    min     25%    50%     75%    max
Company                                                              
FB        2.0  296.5   75.660426  243.0  269.75  296.5  323.25  350.0
GOOG      2.0  160.0   56.568542  120.0  140.00  160.0  180.00  200.0
MSFT      2.0  232.0  152.735065  124.0  178.00  232.0  286.00  340.0


#### Percentiles

In [34]:
# calculate mean for overall sales column
df['Sales'].mean()

np.float64(229.5)

In [35]:
# calculate std deviation for overall sales column
df['Sales'].std()

100.89945490437498

In [36]:
# calculate percentile for overall sales column
df['Sales'].quantile(0.5)

np.float64(221.5)

#### Merging DataFrames

In [37]:
# Creating 3 data frames

df1 = pd.DataFrame(    {'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame(    {'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8,9,10,11])

print(df1)
print('-'*68, sep='')
print(df2)
print('-'*68, sep='')
print(df3)

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
--------------------------------------------------------------------
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
--------------------------------------------------------------------
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


* **Concatenation with axis=0**

In [31]:
df_cat1 = pd.concat([df1,df2,df3], axis=0)
print("\nAfter concatenation along row\n",'-'*30, sep='')
print(df_cat1)


After concatenation along row
------------------------------
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


In [32]:
df_cat1.shape

(12, 4)

* **Concatenation with axis=1**

In [39]:
df_cat2 = pd.concat([df1,df2,df3], axis=1)
print("\nAfter concatenation along column\n",'-'*60, sep='')
print(df_cat2)


After concatenation along column
------------------------------------------------------------
      A    B    C    D    A    B    C    D    A    B    C    D
0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11


In [40]:
df_cat2.fillna(value=0)
print("\nAfter filling missing values with zero\n",'-'*60, sep='')
print(df_cat2.fillna(value=0))


After filling missing values with zero
------------------------------------------------------------
     A   B   C   D   A   B   C   D    A    B    C    D
0   A0  B0  C0  D0   0   0   0   0    0    0    0    0
1   A1  B1  C1  D1   0   0   0   0    0    0    0    0
2   A2  B2  C2  D2   0   0   0   0    0    0    0    0
3   A3  B3  C3  D3   0   0   0   0    0    0    0    0
4    0   0   0   0  A4  B4  C4  D4    0    0    0    0
5    0   0   0   0  A5  B5  C5  D5    0    0    0    0
6    0   0   0   0  A6  B6  C6  D6    0    0    0    0
7    0   0   0   0  A7  B7  C7  D7    0    0    0    0
8    0   0   0   0   0   0   0   0   A8   B8   C8   D8
9    0   0   0   0   0   0   0   0   A9   B9   C9   D9
10   0   0   0   0   0   0   0   0  A10  B10  C10  D10
11   0   0   0   0   0   0   0   0  A11  B11  C11  D11


* **Merging by single common key**

`merge()` is used to **combine two DataFrames** based on **common columns or indices**, similar to SQL joins.

---

**Syntax**
```python
pd.merge(left, right, 
         how='inner', 
         on=None, 
         left_on=None, 
         right_on=None, 
         left_index=False, 
         right_index=False, 
         suffixes=('_x', '_y'))
```

**how: Type of merge/join**
- 'inner' -> intersection (default)
- 'outer' -> union
- 'left' -> all rows from left
- 'right' -> all rows from right

In [41]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)

print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)


The DataFrame 'left'
------------------------------
  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3

The DataFrame 'right'
------------------------------
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3


In [42]:
merge1= pd.merge(left,right, on='key')
print("\nAfter simple merging with 'inner' method\n",'-'*50, sep='')
print(merge1)


After simple merging with 'inner' method
--------------------------------------------------
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3


* **Merging by 2 common keys**

In [43]:
# create DF

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)

print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)


The DataFrame 'left'
------------------------------
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

The DataFrame 'right'
------------------------------
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3


In [50]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


**Left Merge**

In [44]:
print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)

print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)

pd.merge(left, right, how='left',on=['key1', 'key2'])


The DataFrame 'left'
------------------------------
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

The DataFrame 'right'
------------------------------
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3


Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


**Right Merge**

In [53]:
pd.merge(left, right, how='right',on=['key1', 'key2'])

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


**Outer & Inner Merge**

In [47]:
df1,df2

(    A   B   C   D
 0  A0  B0  C0  D0
 1  A1  B1  C1  D1
 2  A2  B2  C2  D2
 3  A3  B3  C3  D3,
     A   B   C   D
 4  A4  B4  C4  D4
 5  A5  B5  C5  D5
 6  A6  B6  C6  D6
 7  A7  B7  C7  D7)

In [55]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [45]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,A,B,C,D


#### Joins in Pandas DataFrame
- mostly index-based joins

In [48]:
import pandas as pd
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

print("\nThe DataFrame 'left'\n",'-'*30, sep='')
print(left)

print("\nThe DataFrame 'right'\n",'-'*30, sep='')
print(right)


The DataFrame 'left'
------------------------------
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2

The DataFrame 'right'
------------------------------
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3


In [59]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [49]:
right.join(left)

Unnamed: 0,C,D,A,B
K0,C0,D0,A0,B0
K2,C2,D2,A2,B2
K3,C3,D3,,


In [50]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


**Merge Vs Join**

| Feature                | `merge()`                             | `join()`                      |
| ---------------------- | ------------------------------------- | ----------------------------- |
| Method Type            | Function: `pd.merge()`                | DataFrame method: `df.join()` |
| Join on Columns        | ✅ Yes                                 | ⚠️ Limited (use `on`)         |
| Join on Index          | ✅ Yes (with `left_index/right_index`) | ✅ Yes (default)               |
| Flexibility            | High                                  | Lower                         |
| Multiple Key Columns   | ✅ Yes                                 | ❌ Not directly                |
| SQL-like functionality | ✅ Yes                                 | ❌ No                          |

In short:
* Use merge() for SQL-style joins, column-based joins, or multiple keys.
* Use join() for simple index-based joins or when combining multiple DataFrames on their indices.

---

Happy Learning ! Team DecodeAiML !!