<a href="https://colab.research.google.com/github/Ziad-Fahmy/Data-Mining/blob/main/03_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3],
                  'D': [10, np.nan, np.nan]})

In [3]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


**Check Missing data existance**

In [4]:
df.isna()

Unnamed: 0,A,B,C,D
0,False,False,False,False
1,False,True,False,True
2,True,True,False,True


In [5]:
df.isna().sum(axis=0)

Unnamed: 0,0
A,1
B,2
C,0
D,2


In [6]:
df.isna().sum(axis=1)

Unnamed: 0,0
0,0
1,2
2,3


**Droping missing data**

In [7]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [8]:
df.dropna(axis=0, inplace=True)

In [9]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0


In [10]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3],
                  'D': [10, np.nan, np.nan]})

In [11]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [12]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [13]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [17]:
df.dropna(thresh=1, axis=0)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


thresh=1: This parameter specifies the minimum number of non-null values required for a row or column to be retained. <br>
In this case, it is set to 1, meaning that a row will be kept if it has at least one non-null value.

In [20]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,


In [21]:
df.dropna(thresh=3)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0


**Fill missing data**

In [22]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [23]:
df.fillna(value=10)

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,10.0,2,10.0
2,10.0,10.0,3,10.0


In [24]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [25]:
df['A']

Unnamed: 0,A
0,1.0
1,2.0
2,


In [26]:
df['A'].mean()

1.5

In [27]:
df['A'].fillna(value=df['A'].mean(), inplace=True)

In [28]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,1.5,,3,


In [29]:
df.fillna(value=df['A'].mean(), inplace=True)

In [30]:
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,1.5,2,1.5
2,1.5,1.5,3,1.5


In [31]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3],
                  'D': [10, np.nan, np.nan]})
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,,2,
2,,,3,


In [32]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [33]:
for col in df.columns:
    df[col].fillna(value=df[col].mean(), inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(value=df[col].mean(), inplace=True)


Unnamed: 0,A,B,C,D
0,1.0,5.0,1,10.0
1,2.0,5.0,2,10.0
2,1.5,5.0,3,10.0


# Great Job!