# Bank Fears Loanliness
This is my submission for the Coding Challenge at Hacker Earth. The problem is to figure out the probability that a member will default.

You can learn more about the challenge and the coding task here - https://www.hackerearth.com/challenge/competitive/machine-learning-challenge-one/machine-learning/bank-fears-loanliness/
Note: If the dexription id not available you can get the dataset from here (contains descriptions as well)-
https://www.dropbox.com/s/hhk58gf4zl0paco/Bank%20Fears%20Loanliness.zip?dl=0

In [1]:
# importing basic libraries we need to do our analysis
import pandas as pd
import numpy as np
# sklearn library for machine learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

ImportError: DLL load failed: %1 is not a valid Win32 application.

In [2]:
# Let's start by reading in the data file and understanding what our data looks like.
train_df = pd.read_csv("train_indessa.csv")
test_df = pd.read_csv("test_indessa.csv")
train_df

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,loan_status
0,58189336,14350,14350,14350.00,36 months,,19.19,E,E3,clerk,...,0.0,74.0,INDIVIDUAL,,26th week,0.0,0.0,28699.0,30800.0,0
1,70011223,4800,4800,4800.00,36 months,BAT1586599,10.99,B,B4,Human Resources Specialist,...,0.0,,INDIVIDUAL,,9th week,0.0,0.0,9974.0,32900.0,0
2,70255675,10000,10000,10000.00,36 months,BAT1586599,7.26,A,A4,Driver,...,0.0,,INDIVIDUAL,,9th week,0.0,65.0,38295.0,34900.0,0
3,1893936,15000,15000,15000.00,36 months,BAT4808022,19.72,D,D5,Us office of Personnel Management,...,0.0,,INDIVIDUAL,,135th week,0.0,0.0,55564.0,24700.0,0
4,7652106,16000,16000,16000.00,36 months,BAT2833642,10.64,B,B2,LAUSD-HOLLYWOOD HIGH SCHOOL,...,0.0,,INDIVIDUAL,,96th week,0.0,0.0,47159.0,47033.0,0
5,10247268,15000,15000,14950.00,36 months,BAT2575549,8.90,A,A5,Design Consultant,...,0.0,,INDIVIDUAL,,113th week,0.0,0.0,350619.0,29500.0,0
6,8089625,5000,5000,4975.00,36 months,,7.90,A,A4,TOYOTA OF NORTH HOLLYWOOD,...,0.0,,INDIVIDUAL,,117th week,0.0,1023.0,13272.0,55500.0,1
7,23043116,6000,6000,6000.00,36 months,,9.17,B,B1,Banker,...,0.0,54.0,INDIVIDUAL,,78th week,0.0,0.0,272579.0,11800.0,0
8,45900933,6000,6000,6000.00,36 months,BAT4136152,13.99,C,C4,LVN,...,0.0,,INDIVIDUAL,,44th week,0.0,0.0,281521.0,62100.0,0
9,41272507,34550,34550,34550.00,60 months,BAT4694572,17.14,D,D4,Registered Nurse,...,0.0,,INDIVIDUAL,,52th week,0.0,0.0,76034.0,33200.0,0


In [3]:
#As we can see the dataset is huge and there are a lot of rows/columns in it. 
#We look at the shape of the data and later try to understand what these columns mean.
train_df.shape

(532428, 45)

### Cleaning the dataset
Our training dataset has 532428 rows and 45 columns. Which means that we have 532427 instances or different records for the customers of our bank. The 45 columns means that we have 45 different features or measurements for each of our instances. 

The main step of any data science or machine learning problem is setting up the data in such a way that we it can be used in our machine learning model to yield the highest accuracy. In other words, feature extraction and selection and data tyding plays more important role than the machine learning model/algorithm we use.
Thus, we are looking at the unique values in each column to understand the sparcity of the data and to drop all the columns which are unnecessary

In [4]:
# basic exploration of a column of our dataset. Looking at three different things
def studyColumn(df, columnName):
    # different ways of looking at unique values in a column
    #df["member_id"].unique()
    print("Unique values in column are: ", df[columnName].unique())
    # counting the total number of null or nan values
    print("Total number of null values is: ", df[columnName].isnull().sum())
    # counting th  total number of each type of value in a column
    print("The frequency of each unique value is: ", df[columnName].value_counts(normalize = True))

In [5]:
studyColumn(train_df, "emp_length")

Unique values in column are:  ['9 years' '< 1 year' '2 years' '10+ years' '5 years' '8 years' '7 years'
 '4 years' 'n/a' '1 year' '3 years' '6 years']
Total number of null values is:  0
The frequency of each unique value is:  10+ years    0.328880
2 years      0.088793
< 1 year     0.079359
3 years      0.079213
1 year       0.064238
5 years      0.062718
4 years      0.059315
n/a          0.050506
7 years      0.050110
8 years      0.049665
6 years      0.048346
9 years      0.038856
Name: emp_length, dtype: float64


In [6]:
# Dropping columns

'''As we can see in our dataset we have a lot of empty columns and a lot of columns with 0 values or 
90% - 95% of 0s. There are various methods to fill the empty values, however for the first dry run and to just get a 
baseline for future reference we will simply ignore/drop these columns to create our first model'''

def droppingCols(df):
    # stripping " months" from each value in column 'term'
    df['term'] = df['term'].map(lambda x: x.rstrip(' months'))
    #converting to int
    df['term'] = pd.to_numeric(df['term'], errors='ignore')
    # multiplying by 4 to convert to weeks
    df['term'] *= 4
    # removed "th week" from each in value in 'last_week_pay', already in weeks so no need to convert
    df['last_week_pay'] = df['last_week_pay'].map(lambda x: x.rstrip('th week'))

    # Removing < and +years from all the emp_length values
    df['emp_length'] = df['emp_length'].map(lambda x: x.rstrip('+ years, year'))
    # Replacing n/ with np.null for further processing And changing < 1 = 0
    df['emp_length'].replace(to_replace = "< 1", value = 0, inplace = True)
    df['emp_length'].replace(to_replace = "n/", value = np.nan, inplace = True)

    # from our exploration we have selected the following columns to remove from our dataset
    columnToRemove = ['batch_enrolled', 'grade', 'sub_grade', 'emp_title', 'verification_status', 'desc', 
                      'purpose', 'title', 'zip_code', 'addr_state', 'mths_since_last_delinq', 'mths_since_last_record', 
                      'total_rec_late_fee', 'mths_since_last_major_derog', 'verification_status_joint']
    # only do once in each run as it will give an error if done without loading the dataset again (columns already removed)
    df.drop(columnToRemove, axis = 1, inplace = True)
    df.shape
    return df

In [7]:
# Creating a new column in our df
# money_bank_paid = funded_amnt - funded_amnt_inv (Substracting in this way to not have negative values)
def addCols(df):
    df['money_bank_paid'] = df['funded_amnt'] - df['funded_amnt_inv']

    # Converting string based abstract classes to category type
    df['home_ownership'] = df['home_ownership'].astype('category')
    df['pymnt_plan'] = df['pymnt_plan'].astype('category')
    df['initial_list_status'] = df['initial_list_status'].astype('category')
    df['application_type'] = df['application_type'].astype('category')

    # all columsn with dtype category.
    categoryColumns = df.select_dtypes(['category']).columns
    df[categoryColumns] = df[categoryColumns].apply(lambda x: x.cat.codes)

    # way to convert a single column
    #df['home_ownership'] = df['home_ownership'].cat.rename_categories(range(len(df['home_ownership'].cat.categories)))
    df.shape
    return df

In [8]:
def replaceEmpty(df):
    # replacing the empty spaces with nan values and nan with 0
    df = df.replace(r'\s+', np.nan, regex=True)
    df = df.replace(np.nan, 0, regex=True)
    df = df.replace('NA', 0, regex=True)
    return df

In [9]:
# calling the changes on both the dataframes
# first for the trainig dataset
train_df = droppingCols(train_df)
train_df = addCols(train_df)
train_df = replaceEmpty(train_df)

test_df = droppingCols(test_df)
test_df = addCols(test_df)
test_df = replaceEmpty(test_df)


### Building the Model
Now that we've our dataset cleaned and the empty and null values dealt with, we can go ahead and build our first model with this new dataset and get a baseline for accuracy and predictions. We will use Random forest for the first model we're going to build. Random Trees are a form of decision tress and you can read more about them here - 
https://en.wikipedia.org/wiki/Random_forest

In [10]:
# removing extra indexes from our Y_test and Y_train
def dropIndex(df):
    df = df.reset_index(level=None, drop=False, name=None, inplace=False)
    #df.drop('index', axis = 1, inplace = True)
    #df.set_index('loan_status', drop=True, append=False, inplace=True, verify_integrity=False)
    return df

In [11]:
# our training dataset split in to x and y
Y_train = train_df['loan_status']
X_train = train_df.drop('loan_status', axis = 1)

In [12]:
# We are using the random forest for this problem
RFmodel = RandomForestRegressor(n_estimators=10)
# fitting the model on the training dataset
RFmodel.fit(X_train, Y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [13]:
# Now that we've our model we can test it on our test dataset for predicting the results
Y_pred = RFmodel.predict(test_df)

In [14]:
# Creating a temp df to write the results in to a csv for submission
temp = pd.DataFrame(test_df['member_id'])
temp['loan_status'] = Y_pred
temp.reset_index(level = None, drop = True, inplace = True)
temp['loan_status'].replace(0, 0.01,inplace=True)
temp['loan_status'].replace(1, 0.99,inplace=True)
temp.to_csv("submission.csv", sep = ",", index = False)
temp

Unnamed: 0,member_id,loan_status
0,11937648,0.01
1,38983318,0.01
2,27999917,0.01
3,61514932,0.01
4,59622821,0.01
5,28822038,0.01
6,10718089,0.90
7,58114582,0.01
8,35023176,0.80
9,1268247,0.01
