<a href="https://colab.research.google.com/github/chandankumar3it/bank-loan-eda-analysis/blob/main/Bank_Loan_Case_Study_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Required Libraries

In [63]:
#Importing all the important libraries like numpy. pandas, matlplolib, and warnings to keep notebook clean
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [64]:
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [65]:
#notebook setting to display all the rowns and columns to have better clearity on the data.

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

# Work on Dataset - application_data.csv

## Reading and Understanding the Dataset

#### Importing the dataset





In [66]:
# importing application_data.csv

appData_df = pd.read_csv("application_data.csv")

#### Understanding the dataset

In [None]:
appData_df.head()

In [68]:
#Checking the rows and columns of the raw dataset

appData_df.shape

(307511, 122)

In [None]:
#Checking information of all the columns like data types
appData_df.info("all")

In [None]:
# Checking the numeric variables of the dataframes
appData_df.describe()

**INSIGHT**


*   There are 122 columns and 307511 rows.
*   There columns having negative, postive values which includes days. It is required to fix.
*   There are columns with very hight values, columns related to Amount. Standardising is required









## Data Cleaning & Manipulation

### Data Quality Check - Missing Values

In [165]:
#checking how many null values are present in each of the columns

#creating a function to find null values for the dataframe
def missing_values(appData_df):
    return 100*appData_df.isnull().mean().sort_values(ascending = False)


In [None]:
# Missing values columns

null_col = missing_values(appData_df)
null_col

**INSIGHT**


### Remove the columns with Missing values more than 40%

In [None]:
#creating a variable missing_value_col for storing null columns having missing values more than 40%

missing_value_col_40 = null_col[null_col>40]
missing_value_col_40

In [None]:
#Revieving missing_value_col

print(missing_value_col_40)
print()
print("Number of columns having missing values more than 40% :",len(missing_value_col_40))

**INSIGHT**

* There are 49 columns having null values more than 40% which are related to different area sizes on apartment owned/rented by the loan applicant

In [None]:
# We will drop all these columns
missing_value_col_40.index

In [76]:
# Drop all the columns having missing values more than 40%

appData_df.drop(columns = missing_value_col_40.index, inplace = True)

In [77]:
appData_df.shape

(307511, 73)

*** After after dropping 49 columns we have left with 73 columns**

### Dealing with null values less than 15%

In [None]:
# Columns with null values < 15%

missing_value_col_15 = null_col[null_col<15]
print("Number of columns with null value less than 15% :", len(missing_value_col_15.index))
print(missing_value_col_15)


*   There are 71 columns which have less than 15% missing values



In [None]:
missing_value_col_15.index

In [None]:
# Reviewing the columns
print(missing_value_col_15)
print()
print("Number of columns having missing values less than 15% :",len(missing_value_col_15))

### Analyse & Removing Unneccsary Columns

In [None]:
# Identifying unique values with columns < 15%

appData_df[missing_value_col_15.index].nunique().sort_values(ascending=False)

* **From the above we can see that first two (EXT_SOURCE_2, AMT_GOODS_PRICE) are continous variables and remaining are catagorical variables**

In [None]:
# Continous varibale - EXT_SOURCE_2

sns.boxplot(appData_df['EXT_SOURCE_2'])
plt.show()

In [None]:
# Continous varibale - AMT_GOODS_PRICE

sns.boxplot(appData_df['AMT_GOODS_PRICE'])
plt.show()

Observation from Boxplots:
*   For 'EXT_SOURCE_2' no outliers present. So data is rightly present.
*   For 'AMT_GOODS_PRICE' outlier present in the data. So need to impute with median.



In [None]:
for col in appData_df.columns:
    print(col)

### Removing the un-used columns and analysis

In [85]:
# Un-used columns in data set
unused_col = ['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE','FLAG_PHONE', 'FLAG_EMAIL',
          'REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','FLAG_EMAIL','CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
          'REGION_RATING_CLIENT_W_CITY','FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3','FLAG_DOCUMENT_4',
          'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10',
          'FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
          'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18','FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
          'FLAG_DOCUMENT_21','EXT_SOURCE_2','EXT_SOURCE_3']

In [86]:
# Droping un-used columns
appData_df.drop(labels = unused_col, axis=1, inplace = True)

In [None]:
appData_df.head()

In [88]:
appData_df.shape

(307511, 42)

## Imputing values

In [89]:
# Imputing the value'XNA' which means not available for the column 'CODE_GENDER'

appData_df.CODE_GENDER.value_counts()

F      202448
M      105059
XNA         4
Name: CODE_GENDER, dtype: int64

* XNA values are very low and Female is the majority. So lets replace XNA with gender 'F'

In [90]:
# Replacing 'XNA' to 'F' for 'CODE_GENDER'

appData_df.loc[appData_df.CODE_GENDER == 'XNA', 'CODE_GENDER'] = 'F'

In [91]:
# Reviewing the 'CODE_GENDER'
appData_df.CODE_GENDER.value_counts()

F    202452
M    105059
Name: CODE_GENDER, dtype: int64

In [None]:
# checking the CODE_GENDER

appData_df.CODE_GENDER.head(10)

In [93]:
appData_df["CODE_GENDER"].isnull().sum()

0

### Imputing for "OCCUPATION_TYPE" column

In [None]:
#Percentage of each category present in "OCCUPATION_TYPE"

appData_df["OCCUPATION_TYPE"].value_counts(normalize=True)*100

In [95]:
# Checking null value in column OCCUPATION_TYPE
appData_df["OCCUPATION_TYPE"].isnull().sum()

96391

* There are total 96391 records/rows having null value in columns OCCUPATION_TYPE

**Insight:**

* From above it looks like this columnn is categorical one and have missing values.
* To fix this we will impute another category as "Unknown" for the missing values.

In [96]:
# imputing null values with "Unknown"

appData_df["OCCUPATION_TYPE"] = appData_df["OCCUPATION_TYPE"].fillna("Unknown")

In [97]:
# Reviewing the null values in column OCCUPATION_TYPE
appData_df["OCCUPATION_TYPE"].isnull().sum()

0

In [None]:
# Plotting a percentage graph having each category of "OCCUPATION_TYPE"

plt.figure(figsize = [12,7])
(appData_df["OCCUPATION_TYPE"].value_counts()).plot.barh(color= "orange",width = .8)
plt.title("Type of Occupations", fontdict={"fontsize":20}, pad =20)
plt.show()

* **Highest percentage of values belongs to Unknown group and Secons belongs to Laborers**



In [None]:
appData_df.info("all")

### **Now let's move to other 6 columns :**
**"AMT_REQ_CREDIT_BUREAU_YEAR", "AMT_REQ_CREDIT_BUREAU_QRT","AMT_REQ_CREDIT_BUREAU_MON", "AMT_REQ_CREDIT_BUREAU_WEEK","AMT_REQ_CREDIT_BUREAU_DAY", "AMT_REQ_CREDIT_BUREAU_HOUR"**

In [100]:
appData_df[["AMT_REQ_CREDIT_BUREAU_YEAR","AMT_REQ_CREDIT_BUREAU_QRT","AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_WEEK",
"AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_HOUR"]].describe()

Unnamed: 0,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR
count,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,1.899974,0.265474,0.267395,0.034362,0.007,0.006402
std,1.869295,0.794056,0.916002,0.204685,0.110757,0.083849
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0
75%,3.0,0.0,0.0,0.0,0.0,0.0
max,25.0,261.0,27.0,8.0,9.0,4.0


* **These above columns represent number of enquries made for the customer(which should be discrete and not continous).**
* **From above describe results we see that all values are numerical and can conclude that for imputing missing we should not use mean as it is in decimal form, hence for imputing purpose we will use median for all these columns.**

In [101]:
#creating "amt_credit" variable having these columns "AMT_REQ_CREDIT_BUREAU_YEAR","AMT_REQ_CREDIT_BUREAU_QRT","AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_WEEK",
#"AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_HOUR"

amt_req_credit = ["AMT_REQ_CREDIT_BUREAU_YEAR","AMT_REQ_CREDIT_BUREAU_QRT","AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_WEEK",
"AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_HOUR"]

In [102]:
#filling missing values with median values

appData_df.fillna(appData_df[amt_req_credit].median(),inplace = True)

In [None]:
missing_values(appData_df).head(10)

**Still there some missing value coloumns but we will not impute them as the missing value count very less.**

In [104]:
# Casting variable into numeric in the dataset

numerical_columns=['TARGET','CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','REGION_POPULATION_RELATIVE',
                 'DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','HOUR_APPR_PROCESS_START',
                 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
                'DAYS_LAST_PHONE_CHANGE']

In [None]:
appData_df[numerical_columns] = appData_df[numerical_columns].apply(pd.to_numeric)
appData_df.head(10)

In [106]:
appData_df.shape

(307511, 42)

## **Standardising values**

In [None]:
appData_df.describe()

**Insights:**

From above describe result we can see that

* Columns DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE which counts days have negative values thus will correct those values
* Convert DAYS_BIRTH to AGE in years , DAYS_EMPLOYED to YEARS EMPLOYED
* Columns AMT_INCOME_TOTAL, AMT_CREDIT, AMT_GOODS_PRICE have very high values, thus will make these numerical columns in categorical columns for better understanding.



### **Dealing with columns :**
**"DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION", "DAYS_ID_PUBLISH", "DAYS_LAST_PHONE_CHANGE"**


**Columns DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE which counts days have negative values. thus will correct those values**


In [None]:
# creating "days_col" varibale to store all days columns
days_col = ["DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION", "DAYS_ID_PUBLISH", "DAYS_LAST_PHONE_CHANGE"]

appData_df[days_col].describe()

**From above, we get that days are in negative that is not usual, so to correct it we use absolute function as below**

In [109]:
#using abs() function to correct the days values

appData_df[days_col]= abs(appData_df[days_col])

In [None]:
# Reviewing correct data

appData_df[days_col].describe()

**Now convert DAYS_BIRTH, DAYS_EMPLOYED columns in terms of Years and binning years for better understanding, that is adding two more categorical column**

In [111]:
appData_df["AGE"] = appData_df["DAYS_BIRTH"]/365
bins = [0,20,25,30,35,40,45,50,55,60,100]
slots = ["0-20","20-25","25-30","30-35","35-40","40-45","45-50","50-55","55-60","60 Above"]

appData_df["AGE_GROUP"] = pd.cut(appData_df["AGE"], bins=bins, labels=slots)

In [None]:
appData_df["AGE_GROUP"].value_counts(normalize= True)*100

In [113]:
#creating column "EMPLOYEMENT_YEARS" from "DAYS_EMPLOYED"

appData_df["YEARS_EMPLOYED"] = appData_df["DAYS_EMPLOYED"]/365
bins = [0,5,10,15,20,25,30,50]
slots = ["0-5","5-10","10-15","15-20","20-25","25-30","30 Above"]

appData_df["EMPLOYEMENT_YEARS"] = pd.cut(appData_df["YEARS_EMPLOYED"], bins=bins, labels=slots)

In [None]:
appData_df["EMPLOYEMENT_YEARS"].value_counts(normalize= True)*100

**Taking care of Columns: AMT_INCOME_TOTAL, AMT_CREDIT, AMT_GOODS_PRICE**

In [115]:
# Binning Numerical Columns to create a categorical column

# Creating bins for AMT_INCOME_TOTAL in term of Lakhs
appData_df['AMT_INCOME_TOTAL']=appData_df['AMT_INCOME_TOTAL']/100000

bins = [0,1,2,3,4,5,6,7,8,9,10,100]
slot = ['0-1L','1L-2L', '2L-3L','3L-4L','4L-5L','5L-6L','6L-7L','7L-8L','8L-9L','9L-10L','10L Above']

appData_df['AMT_INCOME_RANGE']=pd.cut(appData_df['AMT_INCOME_TOTAL'],bins,labels=slot)

In [None]:
appData_df["AMT_INCOME_RANGE"].value_counts(normalize = True)*100

In [117]:
# Creating bins for AMT_CREDIT in term of Lakhs
appData_df['AMT_CREDIT']=appData_df['AMT_CREDIT']/100000

bins = [0,1,2,3,4,5,6,7,8,9,10,100]
slots = ['0-1L','1L-2L', '2L-3L','3L-4L','4L-5L','5L-6L','6L-7L','7L-8L','8L-9L','9L-10L','10L Above']

appData_df['AMT_CREDIT_RANGE']=pd.cut(appData_df['AMT_CREDIT'],bins=bins,labels=slots)

In [None]:
appData_df["AMT_CREDIT_RANGE"].value_counts(normalize = True)*100

In [119]:
# Creating bins for AMT_GOODS_PRICE in term of Lakhs
appData_df['AMT_GOODS_PRICE']=appData_df['AMT_GOODS_PRICE']/100000

bins = [0,1,2,3,4,5,6,7,8,9,10,100]
slots = ['0-1L','1L-2L', '2L-3L','3L-4L','4L-5L','5L-6L','6L-7L','7L-8L','8L-9L','9L-10L','10L Above']

appData_df['AMT_GOODS_PRICE_RANGE']=pd.cut(appData_df['AMT_GOODS_PRICE'],bins=bins,labels=slots)

In [None]:
appData_df["AMT_GOODS_PRICE_RANGE"].value_counts(normalize = True)*100

## Identifying Outliers

In [None]:
appData_df.describe()

**INSIGHT**
* From above we could find all the columns those wo have high difference between max and 75 percentile and the ones which makes no sense having max value to be so high are captured below:

In [124]:
outlier_col = ["CNT_CHILDREN","AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "AMT_GOODS_PRICE",
               "DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION"]

In [None]:
# Function for outliers/distribution of -> outlier_col variable - boxplot
for col in outlier_col:
  plt.figure(figsize = [5,4])
  plt.title(col)
  sns.boxplot(y = appData_df[col])
  plt.show()


**Insight:**

**It can be seen that in current application data**

* **CNT_CHILDREN, AMT_ANNUITY, AMT_CREDIT, AMT_GOODS_PRICE, have some number of outliers.**
* **AMT_INCOME_TOTAL has huge number of outliers which indicate that few of the loan applicants have very high income when compared to the others.**
* **DAYS_BIRTH has no outliers which means the data is reliable.**
* **DAYS_EMPLOYED has outlier values around 350000(days) which is around 958 years which is impossible and hence this has to be incorrect entry.**

In [None]:
appData_df.nunique().sort_values()

In [None]:
#Checking the number of unique values each column possess to identify categorical columns

appData_df.info()

### Converting Desired columns from Object to categorical column

In [None]:
appData_df.columns

In [152]:
#from the list, we have taken out the desired columns for conversion

categorical_columns = ['NAME_CONTRACT_TYPE','CODE_GENDER','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE',
                       'NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','OCCUPATION_TYPE','WEEKDAY_APPR_PROCESS_START',
                       'ORGANIZATION_TYPE','FLAG_OWN_REALTY','LIVE_CITY_NOT_WORK_CITY',
                       'REG_CITY_NOT_LIVE_CITY','REG_CITY_NOT_WORK_CITY','REG_REGION_NOT_WORK_REGION',
                       'LIVE_REGION_NOT_WORK_REGION','WEEKDAY_APPR_PROCESS_START',
                       'CNT_CHILDREN']

In [153]:
for col in categorical_columns:
    appData_df[col] = pd.Categorical(appData_df[col])

In [155]:
len(categorical_columns) # Converting total of 18 columns to categorical one

18

In [None]:
appData_df.info()

**Insight**

* **After imputing we have 49 columns and we will do Data Analysis on these columns.**

# Work on dataset - previous_application.csv

In [158]:
# importing previous_application.csv

prev_appl = pd.read_csv("previous_application.csv")

In [None]:
prev_appl.head()

In [161]:
#Checking rows and columns of the raw data
prev_appl.shape

(1248492, 37)

In [None]:
#Checking information of all the columns like data types
prev_appl.info()

* **There are 37 columns having various data types like object, int, float and 1670214 rows.**

In [None]:
# Checking the numeric variables of the dataframes
prev_appl.describe()

**Insight**

* **There are 37 columns and 1679214 rows.**
* **There are columns having negative, postive values which includes days, will fix it**

In [167]:
#checking how many null values are present in each of the columns in percentage
missing_values(prev_appl)

RATE_INTEREST_PRIVILEGED       99.644852
RATE_INTEREST_PRIMARY          99.644852
AMT_DOWN_PAYMENT               53.341471
RATE_DOWN_PAYMENT              53.341471
NAME_TYPE_SUITE                49.110207
NFLAG_INSURED_ON_APPROVAL      40.134819
DAYS_TERMINATION               40.134819
DAYS_LAST_DUE                  40.134819
DAYS_LAST_DUE_1ST_VERSION      40.134819
DAYS_FIRST_DUE                 40.134819
DAYS_FIRST_DRAWING             40.134819
AMT_GOODS_PRICE                22.971553
AMT_ANNUITY                    22.207351
CNT_PAYMENT                    22.207191
PRODUCT_COMBINATION             0.021226
NAME_YIELD_GROUP                0.000080
NAME_SELLER_INDUSTRY            0.000080
SELLERPLACE_AREA                0.000080
CHANNEL_TYPE                    0.000080
AMT_CREDIT                      0.000080
NAME_PORTFOLIO                  0.000000
NAME_PRODUCT_TYPE               0.000000
SK_ID_PREV                      0.000000
NAME_GOODS_CATEGORY             0.000000
NAME_CLIENT_TYPE

In [169]:
#creating a variable prev_null_col_50 for storing null columns having missing values more than 50%

prev_null_col_50 = missing_values(prev_appl)[missing_values(prev_appl)>50]

In [170]:
prev_null_col_50

RATE_INTEREST_PRIVILEGED    99.644852
RATE_INTEREST_PRIMARY       99.644852
AMT_DOWN_PAYMENT            53.341471
RATE_DOWN_PAYMENT           53.341471
dtype: float64

* **There only 4 columns with missing valus more than 50%**

In [171]:
#dropping null columns having missing values more than 50%

prev_appl.drop(columns = prev_null_col_50.index, inplace = True)

In [172]:
#creating a variable prev_null_col_15 for storing null columns having missing values more than 15%

prev_null_col_15 = missing_values(prev_appl)[missing_values(prev_appl)>15]

In [173]:
prev_null_col_15

NAME_TYPE_SUITE              49.110207
DAYS_FIRST_DRAWING           40.134819
DAYS_TERMINATION             40.134819
DAYS_LAST_DUE                40.134819
DAYS_LAST_DUE_1ST_VERSION    40.134819
DAYS_FIRST_DUE               40.134819
NFLAG_INSURED_ON_APPROVAL    40.134819
AMT_GOODS_PRICE              22.971553
AMT_ANNUITY                  22.207351
CNT_PAYMENT                  22.207191
dtype: float64

In [174]:
prev_appl[prev_null_col_15.index]

In [None]:
prev_appl.columns

In [177]:
# Listing down unused columns
unused_prev_appl = ['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START','FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY']

prev_appl.drop(unused_prev_appl,axis =1, inplace = True)

prev_appl.shape

(1248492, 29)

In [178]:
# Imputing values "Unknown" as this a categorical column
prev_appl["NAME_TYPE_SUITE"] = prev_appl["NAME_TYPE_SUITE"].fillna("Unknown")

missing_values(prev_appl)

NFLAG_INSURED_ON_APPROVAL    40.134819
DAYS_TERMINATION             40.134819
DAYS_LAST_DUE                40.134819
DAYS_LAST_DUE_1ST_VERSION    40.134819
DAYS_FIRST_DUE               40.134819
DAYS_FIRST_DRAWING           40.134819
AMT_GOODS_PRICE              22.971553
AMT_ANNUITY                  22.207351
CNT_PAYMENT                  22.207191
PRODUCT_COMBINATION           0.021226
AMT_CREDIT                    0.000080
NAME_YIELD_GROUP              0.000080
NAME_SELLER_INDUSTRY          0.000080
SELLERPLACE_AREA              0.000080
CHANNEL_TYPE                  0.000080
NAME_PRODUCT_TYPE             0.000000
SK_ID_PREV                    0.000000
NAME_PORTFOLIO                0.000000
SK_ID_CURR                    0.000000
NAME_CLIENT_TYPE              0.000000
NAME_TYPE_SUITE               0.000000
CODE_REJECT_REASON            0.000000
NAME_PAYMENT_TYPE             0.000000
DAYS_DECISION                 0.000000
NAME_CONTRACT_STATUS          0.000000
NAME_CASH_LOAN_PURPOSE   

* **There are missing values in columns 'DAYS_FIRST_DUE', 'DAYS_TERMINATION', 'DAYS_FIRST_DRAWING', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE' and these columns count days thus will keeping null values as they are**

In [None]:
#Analying numerical columns using describe

prev_appl[prev_null_col_15.index].describe()

In [None]:
# To convert negative days to postive days creating a varaible "prev_days_col"

prev_days_col = ['DAYS_DECISION','DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION']

prev_appl[prev_days_col].describe()

In [None]:
# Converting Negative days to positive days

prev_appl[prev_days_col] = abs(prev_appl[prev_days_col])

prev_appl[prev_null_col_15.index].describe()

In [183]:
# Days group calculation e.g. 369 will be grouped as with in 2 years

bins = [0,1*365,2*365,3*365,4*365,5*365,6*365,7*365,10*365]
slots = ["1","2","3","4","5","6","7","7 above"]
prev_appl['YEARLY_DECISION'] = pd.cut(prev_appl['DAYS_DECISION'],bins,labels=slots)

In [None]:
prev_appl['YEARLY_DECISION'].value_counts(normalize=True)*100

**Insight:**

* **Almost 35% loan applicatants have applied for a new loan within 1 year of previous loan decision**

In [None]:
prev_appl.nunique()

In [None]:
missing_values(prev_appl)

#### Dealing with continuos variables "AMT_ANNUITY", "AMT_GOODS_PRICE"
#### To impute null values in continuous variables, we plotted the distribution of the columns and used
* **Median if the distribution is skewed**
* **Mode if the distribution pattern is preserved.**

In [None]:
#plotting a kdeplot to understand distribution of "AMT_ANNUITY"

plt.figure(figsize=(12,6))
sns.kdeplot(prev_appl['AMT_ANNUITY'])
plt.show()

**Insight:**
* **There is a single peak at the left side of the distribution and it indicates the presence of outliers and hence imputing with mean would not be the right approach and hence imputing with median.**

In [None]:
#imputing missing values with median

prev_appl['AMT_ANNUITY'].fillna(prev_appl['AMT_ANNUITY'].median(),inplace = True)

In [None]:
# Plotting kde plot for "AMT_GOODS_PRICE" to understand the distribution

plt.figure(figsize=(12,6))
sns.kdeplot(prev_appl['AMT_GOODS_PRICE'])
plt.show()

* **There are several peaks along the distribution. Let's impute using the mode, mean and median and see if the distribution is still about the same.**

In [None]:
# Creating new dataframe for "AMT_GOODS_PRICE" with columns imputed with mode, median and mean

statsDF = pd.DataFrame()
statsDF['AMT_GOODS_PRICE_mode'] = prev_appl['AMT_GOODS_PRICE'].fillna(prev_appl['AMT_GOODS_PRICE'].mode()[0])
statsDF['AMT_GOODS_PRICE_median'] = prev_appl['AMT_GOODS_PRICE'].fillna(prev_appl['AMT_GOODS_PRICE'].median())
statsDF['AMT_GOODS_PRICE_mean'] = prev_appl['AMT_GOODS_PRICE'].fillna(prev_appl['AMT_GOODS_PRICE'].mean())

cols = ['AMT_GOODS_PRICE_mode', 'AMT_GOODS_PRICE_median','AMT_GOODS_PRICE_mean']

plt.figure(figsize=(18,10))
plt.suptitle('Distribution of Original data vs imputed data')
plt.subplot(221)
sns.distplot(prev_appl['AMT_GOODS_PRICE'][pd.notnull(prev_appl['AMT_GOODS_PRICE'])]);
for i in enumerate(cols):
    plt.subplot(2,2,i[0]+2)
    sns.distplot(statsDF[i[1]])


* **The original distribution is closer with the distribution of data imputed with mode in this case, thus will impute mode for missing values**

In [None]:
# Imputing null values with mode

prev_appl['AMT_GOODS_PRICE'].fillna(prev_appl['AMT_GOODS_PRICE'].mode()[0], inplace=True)

#### Imputing CNT_PAYMENT with 0 as the NAME_CONTRACT_STATUS for these indicate that most of these loans were not started:

In [None]:
#taking out values count for NAME_CONTRACT_STATUS categories where CNT_PAYMENT have null values.

prev_appl.loc[prev_appl['CNT_PAYMENT'].isnull(),'NAME_CONTRACT_STATUS'].value_counts()

In [None]:
#imputing null values as 0

prev_appl['CNT_PAYMENT'].fillna(0,inplace = True)

In [None]:
prev_appl.columns

In [175]:
#Converting required categoical columns from Object to categorical

prev_catgorical_col = ['NAME_CASH_LOAN_PURPOSE','NAME_CONTRACT_STATUS','NAME_PAYMENT_TYPE',
                    'CODE_REJECT_REASON','NAME_CLIENT_TYPE','NAME_GOODS_CATEGORY','NAME_PORTFOLIO',
                   'NAME_PRODUCT_TYPE','CHANNEL_TYPE','NAME_SELLER_INDUSTRY','NAME_YIELD_GROUP','PRODUCT_COMBINATION',
                    'NAME_CONTRACT_TYPE']

for col in prev_catgorical_col:
    prev_appl[col] =pd.Categorical(prev_appl[col])