# <span style="color: #0F749B"> Data Cleaning for Loan Test and Train Datasets </span>
This Jupyter Notebook covers the data cleaning process for the loan train and loan test datasets. Here, we look for features that have missing values, determine methods of imputation and use other data cleaning techniques. 

## <span style="color: #04A4E2"> The Necessary Imports </span>
Before staring anything, we need to first import the necessary python libraries so that we can perform the data cleaning process. The following libraries we need to import are `numpy`, `pandas` and `matplotlib`.

In [None]:
# all required imports
import pandas as pd
from IPython.display import display

## <span style="color: #F4AB04"> Loading the Dataset </span>
Next, we move on to loading the dataset, as well as looking at the description and information behind the dataset so that we know if null values there are or if any data cleaning is needed.

In [None]:
# load the data frames and displaying them
train_df = pd.read_csv("../data/loan-train.csv")
test_df = pd.read_csv("../data/loan-test.csv")

display(train_df)
display(test_df)

### <span style="color: #F4AB04"> A Deeper Dive into the Training Dataset </span>
To gain a better understanding of the dataset we're dealing with, we need to look deeper into the statistics of the dataset itself. The next cells give us a better idea of how the training dataset looks.

In [None]:
# displaying the information and descriptive statistics
display(train_df.info())
display(train_df.describe())

### <span style="color: #F4AB04"> A Deeper Dive into the Testing Dataset </span>
To gain a better understanding of the dataset we're dealing with, we need to look deeper into the statistics of the dataset itself. The next cells give us a better idea of how the testing dataset looks.

In [None]:
# displaying the information and descriptive statistics
display(test_df.info())
display(test_df.describe())

### <span style="color: #F4AB04"> Observations and Conclusion </span>
When looking the statistics behind the dataset, we can see that both the train and test datasets are very similar. As a result, it is time saving to inspect one dataset, make data cleaning decisions and apply to both datasets.

## <span style="color: #04A4E2"> Missing Value Imputations </span>
In these next steps, we look for missing values within the dataset and then decide on a method of imputation. This will allow us to fill out the missing values and ensure a more complete dataset.


### <span style="color: #04A4E2">Features with Null Values & Skew </span>
First, we want to look for features with null values. Once we find those features with null values, we investigate the skew of the dataset by either visualizing them or using `.skew()` and then deciding on a method of imputation.

In [None]:
# dropping Loan_ID for train 
loan_ids = train_df['Loan_ID'].copy()
train_df.drop(columns=['Loan_ID'], inplace=True)

# intializing a list
null_train_numerical_list = []
null_train_categorical_list = []

# select the data tyes
train_numeric_features = train_df.select_dtypes(include='number')
train_categorical_features = train_df.select_dtypes(include='object')

In [None]:
# dropping Loan_ID for now
loan_ids = test_df['Loan_ID'].copy()
test_df.drop(columns=['Loan_ID'], inplace=True)

# intializing a list
null_test_numerical_list = []
null_test_categorical_list = []

# select the data tyes
test_numeric_features = test_df.select_dtypes(include='number')
test_categorical_features = test_df.select_dtypes(include='object')

In [None]:
# a for loop that loops through numeric train features
for column in train_numeric_features:

    if train_df[column].isnull().any():

        # add to the null numerical features list
        null_train_numerical_list.append(column)
        null_test_numerical_list.append(column)

        # applies imputations for both datasets
        if(train_df[column].skew() > 0):
            print(f"The feature {column} is right skewed")
            train_df[column] = train_df[column].fillna(train_df[column].median())
            test_df[column] = test_df[column].fillna(test_df[column].median())

        if(train_df[column].skew() < 0):
            print(f"The feature {column} is left skewed")
            train_df[column] = train_df[column].fillna(train_df[column].median())
            test_df[column] = test_df[column].fillna(test_df[column].median())

In [None]:
# a for loop that loops through the categorical train features
for column in train_categorical_features:
    
    if train_df[column].isnull().any():

        print(f"The feature {column} has null values!")

        # add to the null categorical features list
        null_train_categorical_list.append(column)
        null_test_categorical_list.append(column)

        # applies imputations for both datasets
        train_df[column] = train_df[column].fillna(train_df[column].mode()[0])
        test_df[column] = test_df[column].fillna(test_df[column].mode()[0])

In [None]:
print(train_df.isnull().values.any())
print(test_df.isnull().values.any())

## <span style="color: #F4AB04"> Converting Categorical Features to Numerical </span>
Next, we would like to perform one hot encoding over categorical features. If the categorical feature has only two unique values, we can map 1 and 0, else if it has more than two unique values, then we use one hot encoding.

### <span style="color: #F4AB04"> Categorical Features with Two Unique Values </span>
To begin, we look for categorical features with two unique values, and then apply the mapping of 1's and 0's to those features itself. No one hot encoding required here.

In [None]:
# for looping through all categorical features
for feature in train_categorical_features:
    
    if feature not in test_df.columns:
        continue
    
    if train_df[feature].nunique() == 2:
        print(f"The feature {feature} has two unique values!")
        unique_train_values = train_df[feature].unique()
        unique_test_values = test_df[feature].unique()

        unique_train_mapping = {unique_train_values[0]: 0, unique_train_values[1]: 1}
        unique_test_mapping = {unique_test_values[0]: 0, unique_test_values[1]: 1}

        # apply mapping and change data type for train
        train_df[feature] = train_df[feature].map(unique_train_mapping)
        train_df[feature] = train_df[feature].astype("int")
        
        # apply mapping and change data type for test
        test_df[feature] = test_df[feature].map(unique_test_mapping)
        test_df[feature] = test_df[feature].astype("int")

### <span style="color: #F4AB04"> Categorical Features with More Than Two Unique Values</span>
To begin, we look for categorical features with more than two unique values, and as a result, we perform the one hot encoding process, so that we can make the dataset fit for machine learning.

In [None]:
# looping through the categorical features
for feature in train_categorical_features:
    
    if train_df[feature].nunique() > 2:

        print(f"The feature {feature} has more than two unique values!")

        train_encode = pd.get_dummies(train_df[feature], prefix=feature, dtype=int)
        train_df.drop(columns=[feature], inplace=True)
        train_df = pd.concat([train_df, train_encode], axis=1)

# adding back loan ID's to the dataset
train_df.insert(0, 'Loan_ID', loan_ids)

# displaying the new dataframe
train_df