# Feature Engineering

We will be performing all the below steps in Feature Engineering

    Missing values
    Categorical variables: remove rare labels
    Outliers treatment
    Standarise the values of the variables to the same range



We need to fill null values with mean and median

In [111]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [112]:
#Load data to describe 
df_train = pd.read_csv('/workspaces/loan-elegibility-prediction/data/raw/loan-train.csv')

We are going to exclude 'Loan_ID' from the data set as it is not relevant.

In [113]:
df_train=df_train.drop(['Loan_ID'], axis=1)

**Missing Values Treatment**

In [114]:
df_train.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

There are missing values in Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History features.

We will treat the missing values in all the features one by one.

We can consider these methods to fill the missing values:

    For numerical variables: imputation using mean or median
    For categorical variables: imputation using mode

There are very few missing values in Gender, Married, Dependents, Credit_History, and Self_Employed features so we can fill them using the mode of the features.

In [115]:
df_train['Gender'].fillna(df_train['Gender'].mode()[0], inplace=True)
df_train['Married'].fillna(df_train['Married'].mode()[0], inplace=True)
df_train['Dependents'].fillna(df_train['Dependents'].mode()[0], inplace=True)
df_train['Self_Employed'].fillna(df_train['Self_Employed'].mode()[0], inplace=True)
df_train['Credit_History'].fillna(df_train['Credit_History'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_train['Gender'].fillna(df_train['Gender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_train['Married'].fillna(df_train['Married'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the in

To handle missing data in "Loan Term," we'll examine the frequency of each term duration.

In [116]:
print(df_train.Loan_Amount_Term.unique())
df_train.Loan_Amount_Term.value_counts()

[360. 120. 240.  nan 180.  60. 300. 480.  36.  84.  12.]


Loan_Amount_Term
360.0    512
180.0     44
480.0     15
300.0     13
84.0       4
240.0      4
120.0      3
60.0       2
36.0       2
12.0       1
Name: count, dtype: int64

The term length "360" appears most frequently in the "Loan_Amount_Term" variable.

In [117]:
df_train['Loan_Amount_Term'].fillna(df_train['Loan_Amount_Term'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_train['Loan_Amount_Term'].fillna(df_train['Loan_Amount_Term'].mode()[0], inplace=True)


We'll now focus on the "LoanAmount" variable, which contains numerical data.
Since there are outliers (extreme values) in the loan amounts, using the mean to fill missing values wouldn't be ideal as it's heavily influenced by outliers.
Therefore, we'll impute (fill in) the missing values using the median, which is less affected by outliers.

In [118]:
df_train['LoanAmount'].fillna(df_train['LoanAmount'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_train['LoanAmount'].fillna(df_train['LoanAmount'].median(), inplace=True)


In [119]:
df_train.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

**Categorical Variables Encoding**

In [120]:
df_train.Loan_Status = df_train.Loan_Status.replace({'Y': 1, 'N' : 0})
df_train.Gender = df_train.Gender.replace({'Male': 1, 'Female' : 0})
df_train.Married = df_train.Married.replace({'Yes': 1, 'No' : 0})
df_train.Self_Employed = df_train.Self_Employed.replace({'Yes': 1, 'No' : 0})
df_train.Education = df_train.Education.replace({'GraduateYes': 1, 'Not Graduate' : 0})
df_train.Property_Area = df_train.Property_Area.replace({'Rural': 1, 'Semiurban' : 2,'Urban':3})
df_train.Dependents = df_train.Dependents.replace({'0':0, '1':1, '2':2, '3+': 3})



  df_train.Loan_Status = df_train.Loan_Status.replace({'Y': 1, 'N' : 0})
  df_train.Gender = df_train.Gender.replace({'Male': 1, 'Female' : 0})
  df_train.Married = df_train.Married.replace({'Yes': 1, 'No' : 0})
  df_train.Self_Employed = df_train.Self_Employed.replace({'Yes': 1, 'No' : 0})
  df_train.Property_Area = df_train.Property_Area.replace({'Rural': 1, 'Semiurban' : 2,'Urban':3})
  df_train.Dependents = df_train.Dependents.replace({'0':0, '1':1, '2':2, '3+': 3})


In [121]:
df_train[0:20]



Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,0,0,Graduate,0,5849,0.0,128.0,360.0,1.0,3,1
1,1,1,1,Graduate,0,4583,1508.0,128.0,360.0,1.0,1,0
2,1,1,0,Graduate,1,3000,0.0,66.0,360.0,1.0,3,1
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,3,1
4,1,0,0,Graduate,0,6000,0.0,141.0,360.0,1.0,3,1
5,1,1,2,Graduate,1,5417,4196.0,267.0,360.0,1.0,3,1
6,1,1,0,0,0,2333,1516.0,95.0,360.0,1.0,3,1
7,1,1,3,Graduate,0,3036,2504.0,158.0,360.0,0.0,2,0
8,1,1,2,Graduate,0,4006,1526.0,168.0,360.0,1.0,3,1
9,1,1,1,Graduate,0,12841,10968.0,349.0,360.0,1.0,2,0
