# Capstone Project

### Objective
Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.

You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:

k-Nearest Neighbour
Decision Tree
Support Vector Machine
Logistic Regression

The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:
Jaccard index
F1-score
LogLoss


### Data pre-processing

In [None]:
#Importing all necessary packages

import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

#Install Seaborn for data visualization
!conda install -c anaconda seaborn -y
import seaborn as sns

In [None]:
#Getting the dataset
!wget -O loan_train.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv

In [None]:
#Put the dataset into a nice Panda frame
df = pd.read_csv('loan_train.csv')
df.head()
df.shape

In [None]:
#Convert any dates into proper dates
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

In [None]:
#Understand the counts of the loan statuses
df['loan_status'].value_counts()

In [None]:
#Visualize dataset with Seaborn (gender, principal, loan status)
bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
#Visualize dataset with Seaborn (gender, age, loan status)
bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [None]:
#create a day of the week
df['dayofweek'] = df['effective_date'].dt.dayofweek

#Visualize dataset with Seaborn (gender, day of week, loan status)
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

#create a weekend binary variable if day of the week is > 4 (0- Mon, 1- Tues, 2- Wed, 3- Thu, 4-Fri, 5- Sat, 6-Sun)
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>4)  else 0)
df.head()

### Converting categorical to binary variables 

In [None]:
#converting sex categorical variable into 0 for for male and 1 for female
df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)

#converting the loan categorical variable into 0 for PAIDOFF and 1 for COLLECTION
df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=True)

#using inplace to ensure the dataframe is overwritten with this new binary classification
df.head()

In [None]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)

In [None]:
#Assign all features we want to a new dataframe
Feature = df[['Principal','terms','age','Gender','weekend']]

#use one hot encoding technique to conver categorical varables to binary variables and append them to the feature dataframe
Feature = df[['Principal','terms','age','Gender','weekend']]
Feature.head()
Feature.shape

EducationDummies = pd.get_dummies(df[['education']])
EducationDummies.shape

Feature = pd.concat([Feature,EducationDummies], axis=1)
#Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()

### Converting the feature set into normalized values

In [None]:
#Putting everything in a new dataframe, because a lot of data manipulation will happen
X = Feature
y = df['loan_status'].values

#Ensuring they are still compatibly shaped
X.shape
y.shape

In [None]:
#Normalizing the features using scikit-learn
X = preprocessing.StandardScaler().fit(X).transform(X)

# Classification

The actual classification algorithms start from here.

    K Nearest Neighbor(KNN)
    Decision Tree
    Support Vector Machine
    Logistic Regression


## K-Nearest Neighbour (KNN)

In [None]:
#import additional required libraries
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

In [None]:
#splitting my dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) #42 being the answer to life, the universe and everything
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)