## About Company

Happy Customer Bank is a mid-sized private bank which deals in all kinds of loans. They have presence across all major cities in India and focus on lending products. They have a digital arm which sources customers from the internet.

## Problem

Digital arms of banks today face challenges with lead conversion, they source leads through mediums like search, display, email campaigns and via affiliate partners. Here Happy Customer Bank faces same challenge of low conversion ratio. They have given a problem to identify the customers segments having higher conversion ratio for a specific loan product so that they can specifically target these customers, here they have provided a partial data set for salaried customers only from the last 3 months. They also capture basic details about customers like gender, DOB, existing EMI, employer Name, Loan Amount Required, Monthly Income, City, Interaction data and many others. Let’s look at the process at Happy Customer Bank.

In [None]:
from IPython.display import Image
Image(filename = "Images/Process_at_Happy_Customer_Bank.png",width=800, height=400)

In above process, customer applications can drop majorly at two stages, at login and approval/ rejection by bank. Here we need to identify the segment of customers having higher disbursal rate in next 30 days.
Data Set
We have train and test data set, train data set has both input and output variable(s). Need to predict probability of disbursal for test data set.

## Data Set
We have train and test data set, train data set has both input and output variable(s). Need to predict probability of disbursal for test data set.

Source of data:

https://discuss.analyticsvidhya.com/t/hackathon-3-x-predict-customer-worth-for-happy-customer-bank/3802

## Input variables:


    ID - Unique ID (can not be used for predictions)
    Gender- Sex
    City - Current City
    Monthly_Income - Monthly Income in rupees
    DOB - Date of Birth
    Lead_Creation_Date - Lead Created on date
    Loan_Amount_Applied - Loan Amount Requested (INR)
    Loan_Tenure_Applied - Loan Tenure Requested (in years)
    Existing_EMI - EMI of Existing Loans (INR)
    Employer_Name - Employer Name
    Salary_Account- Salary account with Bank
    Mobile_Verified - Mobile Verified (Y/N)
    Var5- Continuous classified variable
    Var1- Categorical variable with multiple levels
    Loan_Amount_Submitted- Loan Amount Revised and Selected after seeing Eligibility
    Loan_Tenure_Submitted- Loan Tenure Revised and Selected after seeing Eligibility (Years)
    Interest_Rate- Interest Rate of Submitted Loan Amount
    Processing_Fee- Processing Fee of Submitted Loan Amount (INR)
    EMI_Loan_Submitted- EMI of Submitted Loan Amount (INR)
    Filled_Form- Filled Application form post quote
    Device_Type- Device from which application was made (Browser/ Mobile)
    Var2- Categorical Variable with multiple Levels
    Source- Categorical Variable with multiple Levels
    Var4- Categorical Variable with multiple Levels
    
## Outcomes:

    LoggedIn- Application Logged (Variable for understanding the problem – cannot be used in prediction)
    Disbursed- Loan Disbursed (Target Variable)

# -------------------------------------------------------------------------------------------------------------

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
import matplotlib.pyplot as plt

In [None]:
#Reading data. The file has been saved according to the ISO-8859-1 standard what it refers to as "Latin alphabet no. 1"
data = pd.read_csv('Data/HappyCustomerBank/Train_nyOWmfK.csv',encoding='latin_1')
print(data.shape)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.City.value_counts()

# Data preparation

### Cities

Informations about cityies are important. The number of applications from a given city correlates with the actual city size. Without using an additional file with the number of poles in individual cities of India, we can easily use the number of applications that approximately shows the size of the city.
missing values can be supplemented with the category "Unknown".

In [None]:
data['City'].isna().sum()

In [None]:
data['City'].fillna('NotGiven', inplace=True)

In [None]:
cities_dict = {}
cities_bins = pd.cut(data.City.value_counts(),
                     bins=5,
                     labels=['V','IV','III','II','I'])
cities_bins.head(10)

In [None]:
cities_vals = cities_bins.values
cities_idx = cities_bins.index

for vals, indx in zip(cities_vals, cities_idx):
    cities_dict[indx] = vals
    
#cities_dict

In [None]:
data['City_grouped'] = data.City.map(cities_dict)

In [None]:
data.City_grouped.head()

In [None]:
cities_chosen = data.City.value_counts().index[:3]
data["City_name"] = data.City.where(data['City'].isin(cities_chosen), other = 'other' )

In [None]:
data["City_name"].value_counts()

In [None]:
data.drop(['City'], axis=1, inplace=True)
data.head()

### Date of birth

Assuming that dataset was composed in 2015, we can estimate age of the client. The rest of information stored in this variable can be dropped.

In [None]:
data['DOB'] = pd.to_datetime(data['DOB'], format='%d-%b-%y')
data['Lead_Creation_Date'] = pd.to_datetime(data['Lead_Creation_Date'], format='%d-%b-%y')
data['Age'] = data.Lead_Creation_Date.apply(lambda x: x.year) - data.DOB.apply(lambda x: x.year)

In [None]:
data.Age.hist()
plt.show()

In [None]:
data[data['Age'] < 0].head(3)

Inaccurate DOB have been inserted (last 2 digits) and date format has inserted the front 2 digits of the current year. That is why the negative figure for the age has come.

In [None]:
data['Age'] = data.Age.where(data.Age > 0, data.Age+100)

In [None]:
data = data.drop(['DOB','Lead_Creation_Date'], axis=1)

In [None]:
data.Age.hist()
plt.show()

### Missing data

In [None]:
data.info()

The four categories have the same amount of missing data, eg Loan_Amount_Applied. It is difficult to supplement this data with a random, mean, or median value. There is a small percentage of this type of data. It is possible to remove it from the data set.

In [None]:
data = data.dropna(subset=['Loan_Amount_Applied'])

### Salary Account

11,693 bank name entries are missing. We will assign names proportionally from the set of names of all banks. 

In [None]:
data['Salary_Account'].isna().sum()

In [None]:
data['Salary_Account'].value_counts(dropna=False)

In [None]:
import random

mask = data['Salary_Account'].isnull()
samples = random.choices(data['Salary_Account'][~mask].values , k=mask.sum())
data.loc[mask, 'Salary_Account'] = samples

In [None]:
data['Salary_Account'].isna().sum()