---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.18 (Pandas-10)</h1>

## _Handling Missing Data.ipynb_

## Learning agenda of this notebook

1. Have an insight about the Dataset
2. Identify the Columns having Null/Missing values
3. Handle/Impute the Null/Missing Values under the `math` Column
4. Handle/Impute the Null/Missing Values under the `group` Column
5. Handle Missing values under a Numeric/Categorical Column using `fillna()`
6. Handle Repeating Values (for same information) under the `session` Column
7. Create a new Column by Modifying an Existing Column
8. Delete Rows Having NaN values using `df.dropna()` method
9. Convert Categorical Variables into Numerical

## 1. Have an Insight about the Dataset

In [2]:
# import the pandas library
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,72.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,21,3500,,95.0,93
3,MS04,SAADIA,male,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


- Whenever the **`pd.read.csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.
- We can use the **`df.info()`** method to display the count of Non-Null columns, their datatypes, their names and memory usage of that dataframe.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rollno       50 non-null     object 
 1   name         50 non-null     object 
 2   gender       50 non-null     object 
 3   group        47 non-null     object 
 4   session      50 non-null     object 
 5   age          50 non-null     int64  
 6   scholarship  50 non-null     int64  
 7   math         46 non-null     float64
 8   english      47 non-null     float64
 9   urdu         50 non-null     int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 4.0+ KB


## 2. Identify the Columns having Null/Missing values
- The **`df.isna()`** recommended to use than `df.isnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ``''`` or `numpy.inf` are not considered NA values.
- The **`df.notna`** recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True. 

In [8]:
df.isna().head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [9]:
df.notna().head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True
2,True,True,True,False,True,True,True,False,True,True
3,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True


In [10]:
# Now we can use sum() to get the total count of missing values for each column
df.isna().sum()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           4
english        3
urdu           0
dtype: int64

In [12]:
# Now we can use sum() to get the total count of missing values for each column
df.isna().sum()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           4
english        3
urdu           0
dtype: int64

## 3. Handle/Impute the Null/Missing Values under the `math` Column

### a. Identify the Rows under the `math` Column having Null/Missing values
- The `df.isna()` method works equally good on Series objects as well

In [None]:
df.math.isna()

In [None]:
df[df.math.isna()]

In [None]:
df.loc[df.math.isna()]

### b. Replace the Null/Missing Values under the `math` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values
- Since this is a numeric column having datatype float64, let us compute the average of the column and replace the average value at the plade of missing values

In [None]:
# Compute the mean of math column
df.math.mean() 

In [None]:
# List only those records under math column having Null values
df.loc[(df.math.isna()), 'math']

In [None]:
# Let us replace these values with mean value of the math column
df.loc[(df.math.isna()),'math'] = df.math.mean()

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

## 4. Handle/Impute the Null/Missing Values under the `group` Column
- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values.

### a. Identify the Rows under the `group` Column having Null/Missing values

In [None]:
df.group.isna()

In [None]:
df[df.group.isna()]

In [None]:
df.loc[df.group.isna()]

### b. Replace the Null/Missing Values under the `group` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values
- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with th value inside the column having the maximum frequency

In [None]:
# Use value_counts() function which return a Series containing counts of unique values (in descending order)
# with the most frequently-occurring element at first. It excludes NA values by default.
df.group.value_counts()

In [None]:
# Another way of doing is use the mode() function on the column
df.group.mode() 

In [None]:
# List only those records under group column having Null values
df.loc[(df.group.isna()), 'group']

In [None]:
# Let us replace these values with maximum occurring value in the `group` column
df.loc[(df.group.isna()),'group'] = 'group C'

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

## 5. Handle Missing values under a Numeric/Categorical Column using `fillna()`

### a. Replace the Null/Missing Values under the math Column using `fillna()`
- This is more recommended way of filling in the Null values within columns of your dataset rather than the use of the `loc` method.
```
object.fillna(value, method, inplace=True)
```
- The only required argument is either the `value`, with which we want to replace the missing values OR the `method` to be used to replace the missing values
- Returns object with missing values filled or None if ``inplace=True``

In [None]:
# Let us read the dataset again with NA values under math column
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')

In [None]:
df.isna().sum()

In [None]:
df.loc[df.math.isna()]

In [None]:
#This time instead of loc, use fillna() function with just two arguments
# df.math.fillna(value, inplace=False) 
# inplace=True parameter ensure that this happens in the original dataframe

df.math.fillna(value=df.math.mean(), inplace=True)

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

### b. Replace the Null/Missing Values under the `group` Column using `fillna()`

In [None]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')

In [None]:
df.isna().sum()

In [None]:
#This time instead of loc, use fillna() function with just two arguments
# df.Outlet_Size.fillna(value, inplace=False) 
df.group.fillna('group C', inplace=True)

In [None]:
# Confirm the result
df.isna().sum()
#df.info()

In [None]:
# Let us fill the math, english and scholarship columns as well again
df.math.fillna(df.math.mean(), inplace=True)
df.english.fillna(df.english.mean(), inplace=True)
df.scholarship.fillna(df.scholarship.mean(), inplace=True)

In [None]:
# Confirm the result
df.isna().sum()


### c. Replace the Null/Missing Values under the math and group Column using ffill and bfill
- In above examples, we have used the mean value in case of numeric column and mode value in case of a categorical column as the filling value to the `fillna()` method
```
object.fillna(value, method, inplace=True)
```

- We can pass `ffill` or `bfill` as method argument to the `ffillna()` method. This will replace the null values with other values from the DataFrame
- `ffill` (Forward fill): It fills the NaN value with the previous value
- `bfill` (Back fill): It fills the NaN value with the Next/Upcoming value

<img align="right" width="490" height="100"  src="images/bfill.png"  >
<img align="left" width="490" height="100"  src="images/ffill.png"  >

In [None]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
df.isna().sum()

In [None]:
# forward fill or ffill attribute
# If have NaN value, just carry forward the previous value
# using ffill attribute, you can fill the NaN value with the previous value in that column
df.fillna(method = 'ffill', inplace=True)
df.head()

In [None]:
# Confirm the result
df.isna().sum()

## 6. Handle Repeating Values (for same information) under the `session` Column
- If you observe the values under the `session` column, you can observe that it is a categorical column containing six different categories (as values).
    - Notice that the categories `MORNING` and `MOR` are same
    - Similarly, `AFTERNOON` and `AFT` are same
    - Similarly, `EVENING` and `EVE` are same
- This happens when you have collected data from different sources, where same information is written in different ways
- So the `session` column has six different categories (as values) but should have only three

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
df.session

In [None]:
# Let use check out the counts of unique values inside the session Column
df.session.value_counts()

###  Handle  the Repeating Values under the session Column using map()
- To keep the data clean we will map all these values to only three categories to `M` , `A` and `E` using the map() function.
```
df.session.map(mapping, na_action=None)
```
- The `map()` method is used for substituting each value in a Series with another value, that may be derived from a `dict`. The `map()` method returns a series after performing the mapping
- You can give `ignore` as second argument which will propagate NaN values, without passing them to the mapping correspondence.


In [None]:
# To do this, let us create a new mapping (dictionary) 
dict1 = {
    'MORNING' : 'MOR',
    'MOR' : 'MOR',
    'AFTERNOON' : 'AFT',
    'AFT': 'AFT',
    'EVENING' : 'EVE',
    'EVE': 'EVE'
}

In [None]:
# It returns a series with the same index as caller, the original series remains unchanged. 
# So we have assigned the resulting series to `df.session` series

df.session = df.session.map(dict1)

In [None]:
# Count of new categories in the column session
# Observe we have managed to properly manage the values inside the session column
df.session.value_counts()

In [None]:
# Let us verify the result
df.head()

## 7. Create a new Column by Modifying an Existing Column
- We have a column scholarship in the dataset, which is in Pak Rupees
- Suppose you want to have a new column which should represent the scholarship in US Dollars
- For that we need to add a new column by dividing each value of scholarship with 150

In [None]:
df.scholarship.head()

In [None]:
df.scholarship.apply(lambda x: x/170)

In [None]:
df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)

In [None]:
df.head()

In [None]:
df[['scholarship','Scholarship_in_$']]

## 8. Delete Rows Having NaN values using `df.dropna()` method

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
df.shape

In [None]:
# using dropna() method, you can drop all the rows having NaN values
new_df = df.dropna()
new_df.head()

In [None]:
# Let us verify
new_df.shape

## 9. Convert Categorical Variables into Numerical
- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. 
- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. 
```
pd.get_dummies(data, drop_first=False)
```
- Where, the only required argument is `data` which can be a dataframe or a series
- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.

### a. Convert all categorical variables into dummy/indicator variables

In [None]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

In [None]:
# currently we have 9 columns in the data
df.shape

In [None]:
# Convert all categorical variables into dummy/indicator variables
df = pd.get_dummies(df)

In [None]:
# Let us view the datafreame, keep a note on the number of columns
df.head()

In [None]:
# The Number of columns has gone to 1605 now
df.shape

- So we have 112 columns
- Even though one-hot encoding is a good way to convert your categorical columns to numerical columns
- But it adds a lot of dimensionality to your data, i.e., increase the number of columns
- It also become difficult to deal with that much number of columns
- This is a trade-off
- In the later part of the course, we will learn how to do dimensionality reduction

### b. Perform One-Hot Encoding for Categorical Column `gender` Only
- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'
- It will create a dummy binary columns.  
- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.


In [None]:
import pandas as pd
df1 = pd.read_csv('datasets/group-marks.csv')
df1.head()

In [None]:
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']])
df2.head()

In [None]:
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']], drop_first=True)
df2.head()

In [None]:
df3 = df1.join(df2['gender_male'])
df3.head()