### Data Cleaning

#### Importing libraries

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv("datasets/toy_dataset_modified.csv")
df


Unnamed: 0,City,Gender,Age,Income,Illness
0,New York City,Male,49.0,112226.0,No
1,New York City,Male,42.0,110534.0,No
2,New York City,Female,61.0,100665.0,No
3,New York City,Female,58.0,98147.0,Yes
4,New York City,Female,43.0,93100.0,No
...,...,...,...,...,...
1495,New York City,Male,33.0,123132.0,No
1496,New York City,Female,48.0,96889.0,No
1497,New York City,Male,27.0,93822.0,No
1498,New York City,Female,36.0,116129.0,No


As you can see, we have 1500 rows and 5 columns in total.
We will first analyze the dataset by checking for null values.

Before we imputing missing data values, it is necessary to check and detect the presence of missing values using `isnull()` function as shown below

In [3]:
#see the shape of the data
df.shape


(1500, 5)

### See how many missing data points we have

In [7]:
#sum of null values in each column
df.isnull().sum()

City        0
Gender     21
Age        34
Income      0
Illness     3
dtype: int64

So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.

Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:
*   **Reduces the efficiency** of the ML model.
*   **Affects the overall distribution** of data values.
*   It leads to a **biased effect** in the estimation of the ML model.

We can use `SimpleImputer` function from scikit-learn to replace missing values with a fill value. [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) function has a parameter called strategy that gives us four possibilities to choose the imputation method:

*   `strategy='mean'` replaces missing values using the mean of the column.
*   `strategy='median'` replaces missing values using the median of the column.
*   `strategy='most_frequent'` replaces missing values using the most frequent (or mode) of the column.
*   `strategy='constant'` replaces missing values using a defined fill value.











### **Imputing Missing Data with Simple Techniques**

#### Imputation for Numeric Features

#### Using Pandas
A common method of imputation with numeric features is to replace missing values with the mean of the feature’s non-missing values. If the data have outliers, you may want to use the median instead. Either method is easy in Pandas:

In [6]:
#impute Income column with mean value
df['Income'].fillna(df['Income'].mean(), inplace=True)


In [9]:
#impute Age column with median value
df['Age'].fillna(df['Age'].median(), inplace=True)


In [20]:
#check the sum of null values in each column
df.isnull().sum()




City        0
Gender     21
Age         0
Income      0
Illness     0
dtype: int64

#### Using Scikit  Learn
A common method of imputation with numeric features is to replace missing values with the mean of the feature’s non-missing values. If the data have outliers, you may want to use the median instead. Either method is easy in Pandas:

In [13]:
#impute income column with mean value using SimpleImputer
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df['Income'] = imputer.fit_transform(df[['Income']])
df.head()

Unnamed: 0,City,Gender,Age,Income,Illness
0,New York City,Male,49.0,112226.0,No
1,New York City,Male,42.0,110534.0,No
2,New York City,Female,61.0,100665.0,No
3,New York City,Female,58.0,98147.0,Yes
4,New York City,Female,43.0,93100.0,No


### Impute for Categorical Data

In [19]:
#mpute Illness column with most frequent value using SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df['Illness'] = imputer.fit_transform(df[['Illness']])
df.head()




Unnamed: 0,City,Gender,Age,Income,Illness
0,New York City,Male,49.0,112226.0,No
1,New York City,Male,42.0,110534.0,No
2,New York City,Female,61.0,100665.0,No
3,New York City,Female,58.0,98147.0,Yes
4,New York City,Female,43.0,93100.0,No
