<img src="images/pandas_logo.png" alt="pandas logo" width="400" height="200"/> <br>
# Pandas: Dealing with Missing Data

In this Notebook we will discuss how to deal with missing data using pandas. The missing data in pandas represents with two types:
 <ul>
  <li>None</li>
  <li>NaN (Not a Number)</li>
</ul> 

There are several ways to deal with it, below is some functions which useful to work with null values:
 <ul>
  <li><b>isnull()</b>: Checking the missing values (null/NaN)</li>
  <li><b>notnull()</b>: Checking the no null values</li>
  <li><b>dropna()</b>: Droping all row that contain null values</li>
  <li><b>fillna()</b>: Filing the null values with something</li>
  <li><b>replace()</b>: Replacing the null values </li>
  <li><b>interpolate()</b>: Interpolate missing values with some method, for example linear method </li>
</ul> 

In [1]:
# Importing the library
import pandas as pd
import numpy as np
import random

#Some functions to deal with missing data
def check_null(data):
    """
    checking null values
    data = data frame
    """
    null_columns = data.columns[data.isnull().any()]
    return data[null_columns].isnull().sum()    

def display_null(data,column):
    """
    data: data frame
    column: column name in str
    """
    bool_series = pd.isnull(data[column])
    return data[bool_series]

def display_not_null(data,column):
    """
    data: data frame
    column: column name in str
    """
    bool_series = pd.notnull(data[column])  
    return data[bool_series]  

def column_is_null(data,column):
    """
    data: data frame
    column: column name in str
    """
    for header in data.columns:
        if header == column:
            sum_of_null = data[column].isna().sum()
            column_null = f"Null values for '{header}': {sum_of_null}"
    print(column_null) 

In [2]:
# Loading the data
data = pd.read_csv('/home/afrioni/data_science/employees.csv')
data.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [3]:
# Checking Null Values
check_null(data)

First Name            67
Gender               145
Senior Management     67
Team                  43
dtype: int64

In [4]:
# displaying data only with column name = NaN  
display_null(data,'Team').head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
10,Louise,Female,8/12/1980,9:01 AM,63241,15.132,True,
23,,Male,6/14/2012,4:19 PM,125792,5.042,,
32,,Male,8/21/1998,2:27 PM,122340,6.417,,
91,James,,1/26/2005,11:00 PM,128771,8.309,False,


In [5]:
# displaying data only with column name = not NaN  
display_not_null(data,'Team').head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal


## Removing Null Values

### 1) Droping all rows that contains null values in data 

In [6]:
# First, we copy the data
data_copy = data.copy()
print('The shape of data containing the null values:',data_copy.shape)
data_copy.head()

The shape of data containing the null values: (1000, 8)


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [7]:
# Droping all rows that contains null values
data_no_null = data_copy.dropna()
print('The shape of data after droping the null values:',data_no_null.shape)
data_no_null.head()

The shape of data after droping the null values: (764, 8)


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal


### 2) Droping the rows that 'all' data in the row is missing/null

In [8]:
data_row_null = pd.DataFrame({'A':[np.nan, np.nan, np.nan, 80], 
        'B': [40, np.nan, 45, 556], 
        'C':[51, np.nan, 83, 67], 
        'D':[np.nan, np.nan, np.nan, np.nan]}) 

print(data_row_null)

# droping the row that all data in the row is null
# in this case row number 1 contain missing values
data_row_null.dropna(how= 'all')

      A      B     C   D
0   NaN   40.0  51.0 NaN
1   NaN    NaN   NaN NaN
2   NaN   45.0  83.0 NaN
3  80.0  556.0  67.0 NaN


Unnamed: 0,A,B,C,D
0,,40.0,51.0,
2,,45.0,83.0,
3,80.0,556.0,67.0,


For droping a column that contains at least one null values we use:<br>
<b>dropna(axis=1)</b> <br>




## Filling Null Values

### 1) Filling null values with single value

In [9]:
# filling a null values using fillna()  
data["Team"].fillna("No Team", inplace = True)

# checking null values for column name
column_is_null(data,'Team')

Null values for 'Team': 0


### 2) Replacing null values

In [10]:
data_test = pd.DataFrame({ 
    'A' : [np.nan,1,2,3],
    'B' : [5,5,np.nan,np.nan],
    'C' : [10,20,np.nan,40],
    'D' : [100,np.nan,300,400],
    'E' : ['a','a',np.nan,'c']
    })

data_test.head()

Unnamed: 0,A,B,C,D,E
0,,5.0,10.0,100.0,a
1,1.0,5.0,20.0,,a
2,2.0,,,300.0,
3,3.0,,40.0,400.0,c


In [11]:
def filling_with_mean(data,column):
    data_mean = data[column].mean()
    data.loc[data[column].isnull(), column] = data_mean

In [12]:
filling_with_mean(data_test,'B')

In [13]:
data_test.head()

Unnamed: 0,A,B,C,D,E
0,,5.0,10.0,100.0,a
1,1.0,5.0,20.0,,a
2,2.0,5.0,,300.0,
3,3.0,5.0,40.0,400.0,c


## Filling null value with random number between error (std) and mean

In [14]:
#Create data frame
data_cek= pd.DataFrame({ 
    'id' : [np.nan,1,2,3,4,5,6,7,8,9,10],
    'Age' : [5,5,np.nan,np.nan,np.nan,6,14,3,np.nan,3,3],
    })

data_cek

Unnamed: 0,id,Age
0,,5.0
1,1.0,5.0
2,2.0,
3,3.0,
4,4.0,
5,5.0,6.0
6,6.0,14.0
7,7.0,3.0
8,8.0,
9,9.0,3.0


In [15]:
#Generate 5 random numbers between low and high
mean_value = data_cek['Age'].mean()
std_value = data_cek['Age'].std()
low_range = round(mean_value - std_value)
high_range = round(mean_value + std_value)
randomlist = random.sample(range(low_range, high_range), 4)
print(randomlist)

[4, 5, 2, 3]


In [16]:
list_array = np.array(randomlist)
data_cek.loc[data_cek['Age'].isnull(), 'Age'] = list_array
data_cek

Unnamed: 0,id,Age
0,,5.0
1,1.0,5.0
2,2.0,4.0
3,3.0,5.0
4,4.0,2.0
5,5.0,6.0
6,6.0,14.0
7,7.0,3.0
8,8.0,3.0
9,9.0,3.0


### 3) Replacing null values with modus

You can obtain the data here: <br>
https://drive.google.com/open?id=1PWvHNhNDpckBHsUCLD1WhJyeBdVk7mlw <br>
The tutorial can be learn from here : <br>
[1] https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/ <BR>
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html