<a href="https://colab.research.google.com/github/angelicatrento/machine_learning/blob/master/HMEQ_MachineLearning_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2019 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# HMQE Machine Learning Classification Solution


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://drive.google.com/open?id=1q15JVawG1wC2u2YZM-mLrlDcTtV7Fkcu"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/angelicatrento/machine_learning/blob/master/HMEQ_MachineLearning_Classification.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <!--td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/r2/tutorials/estimators/lin.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td-->
</table>

## Overview

This jupyter notebook uses tensorflow API and sk-learn to solve the HMEQ Classification Problem.
This data has 13 features and the target variable is a binary of Default (1) or Not defaulted Loan (0).


## Setup

In [0]:
!pip install sklearn

!pip install tensorflow-estimator

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

In [0]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow.compat.v2.feature_column as fc

import tensorflow as tf

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import math
import seaborn as sns

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from mpl_toolkits.mplot3d import Axes3D
from sklearn import metrics
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
##import tensorflow.contrib.learn.python.learn as ln
##from tensorflow.contrib.learn.python.learn import learn_io, estimator

## Load the dataset
The target (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%).
For each applicant, 11 input variables were recorded.

*   BAD : 

> 1 = client defaulted on loan

> 0 = loan repaid

*   LOAN : Amount of the loan request
*   MORTDUE : Amount due on existing mortgage
*   VALUE : Value of current property
*   REASON : DebtCon = debt consolidation HomeImp = home improvement
*   JOB : Six occupational categories
*   YOJ : Years at present job
*   DEROG : Number of major derogatory reports
*   DELINQ : Number of delinquent credit lines
*   CLAGE : Age of oldest trade line in months
*   NINQ : Number of recent credit lines
*   CLNO : Number of credit lines
*   DEBTINC : Debt-to-income ratio

*   List item
*   List item





In [0]:
#Load dataset 
file = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSoYNcoKkdU3fIwWUXQMOEB_hWg3HR1qwd-LuSqrGKvgdHYqjKVTKaMdD9-ODE35Hykg4LraaMHnAFl/pub?gid=1831163667&single=true&output=csv';
dfhmeq = pd.read_csv(file)


## Explore the data

The dataset contains the following features

In [0]:
dfhmeq.head()

In [0]:
dfhmeq.describe()

Dataset has 5960 examples

In [0]:

# Provide the names for the columns since the CSV file with the data does
# not have a header row.
feature_names = ["BAD","LOAN","MORTDUE","VALUE" ,'REASON','JOB'
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC"]

numeric_feature_names = ["BAD","LOAN","MORTDUE","VALUE" 
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC"]

# Load source the data from a CSV file that is comma separated.
loan_source_data = pd.read_csv(file, header=None,sep=',',skiprows=[0],names=feature_names, encoding='latin-1')

loan_data_remove_missing = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols=["BAD","LOAN","MORTDUE","VALUE" ,'REASON','JOB'
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC"],names=feature_names, encoding='latin-1')
# Load in the data from a CSV file that is comma separated.
loan_data = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols=["BAD","LOAN","MORTDUE","VALUE" ,'REASON','JOB'
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC"],names=feature_names, encoding='latin-1')

loan_data_clean = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols=["BAD","LOAN","MORTDUE","VALUE" ,'REASON','JOB'
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC"],names=feature_names, encoding='latin-1')


print("Data set loaded. Num examples: ", len(loan_data))


## Data cleanup

In [0]:
loan_data_clean.fillna(loan_data_clean.mean(), inplace=True);

loan_data_clean.describe()

We use fillna to deal with NaN values by adding the mean value in its place

In [0]:
CATEGORICAL_COLUMNS = ["REASON","JOB"]

NUMERIC_COLUMNS = ["LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"]

loan_data_clean = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols= ["REASON","JOB","LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"],names=feature_names, encoding='latin-1')

loan_data_clean.dropna(axis=0, how='any', inplace=True)

loan_data_clean = loan_data_clean.reindex(np.random.permutation(loan_data_clean.index))


In [0]:
LOAN = tf.feature_column.numeric_column("LOAN")
MORTDUE = tf.feature_column.numeric_column("MORTDUE")
VALUE = tf.feature_column.numeric_column("VALUE")
REASON = tf.feature_column.categorical_column_with_vocabulary_list("REASON", ['DebtCon', 'HomeImp'])
JOB = tf.feature_column.categorical_column_with_vocabulary_list("JOB", ['Office', 'Other', 'ProfExe', 'Mgr', 'Sales', 'Self'])
YOJ = tf.feature_column.numeric_column("YOJ")
DEROG = tf.feature_column.numeric_column("DEROG")
DELINQ = tf.feature_column.numeric_column("DELINQ")
CLAGE = tf.feature_column.numeric_column("CLAGE")
NINQ = tf.feature_column.numeric_column("NINQ")
CLNO = tf.feature_column.numeric_column("CLNO")
DEBTINC = tf.feature_column.numeric_column("DEBTINC")
feature_columns = [LOAN, MORTDUE, VALUE, REASON, JOB, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC]

In [0]:
data=pd.read_csv(file)
data.fillna(data.mean(), inplace=True);

DEROG_x_DELINQ = tf.feature_column.crossed_column(["DEROG", "DELINQ"], hash_bucket_size=100)

feature_columns = [LOAN, MORTDUE, VALUE, REASON, JOB, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC]

feature_columns_derived = [LOAN, MORTDUE, VALUE, REASON, JOB, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC, DEROG_x_DELINQ]

## Features Histogram

In [0]:
dfhmeq.BAD.hist(bins=20)

In [0]:
dfhmeq.DEROG.hist(bins=20)

In [0]:
for feature_name in numeric_feature_names:
  data.hist(column=feature_name,bins = 40,figsize=(10,4))

## Data analysis 

Plots of features to understand what is available for our training algorithm 

Also the heatmap to analyse the correlation of features and the target value BAD

In [0]:
dfhmeq.JOB.value_counts().plot(kind='barh')

In [0]:
dfhmeq.REASON.value_counts().plot(kind='barh')

In [0]:
corr = loan_data_clean.corr()

#Plot figsize
fig, ax = plt.subplots(figsize=(10,8))

#Generate Color Map
colormap = sns.diverging_palette(220, 10, as_cmap=True)

#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap=colormap, annot=True, fmt=".2f")

#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);

#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)

#show plot
plt.show()

In [0]:
for feature_name_1 in numeric_feature_names:
    for feature_name_2 in numeric_feature_names:
         if feature_name_1 != feature_name_2: 
             plt.xlabel(feature_name_1)
             plt.ylabel(feature_name_2)
             plt.scatter(loan_data_clean[feature_name_1],loan_data_clean[feature_name_2])
             plt.show()

## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model. 

When training, we build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. 

The split ratio is often 80 to 70% of the data for training and 20 to 30% of it for test/validation.
One way of validating the model is to split the data into three sets: train, validation and test. Then we could use the training data to understand which classifier to use; the validation set to test and tweak parameters; and the test set to get an understanding of how your final model would work in practice. 

For the purpose of this project, we will only be randomly splitting our data into test and train.

We will use the function train_test_split from the scikit-learn library


### Linear Regression
We will first try a LinerRegression algorithm and plot the results 

In [0]:
loan_data_clean = loan_data_clean.reindex(np.random.permutation(loan_data_clean.index))

loan_data_remove_missing.dropna(axis=0, how='any', inplace=True)
loan_data_remove_missing = loan_data_remove_missing.reindex(np.random.permutation(loan_data_remove_missing.index))


X_source = loan_data_remove_missing[['DEROG','DELINQ','YOJ']].values
y_source = loan_data_remove_missing['BAD'].values

X_source_training_data, X_source_test_data, y_source_training_data, y_source_test_data = train_test_split(X_source, y_source, test_size=0.3, random_state=0)

regressor_source = LinearRegression()  
regressor_source.fit(X_source_training_data, y_source_training_data)


y_source_pred = regressor_source.predict(X_source_test_data)

df_source = pd.DataFrame({'Actual': y_source_test_data, 'Predicted': y_source_pred})
df1_source = df_source.head(25)

df1_source

As we can see, we didn't achieve good results with a LinearRegression in our data

In [0]:
df1_source.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [0]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_source_test_data, y_source_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_source_test_data, y_source_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_source_test_data, y_source_pred)))

### Decision Tree Classifier

In [0]:

X_train, X_test, y_train, y_test = train_test_split(data.DELINQ, data.BAD, test_size=0.30, random_state=42)



from sklearn import tree

c = tree.DecisionTreeClassifier()
c.fit(X_train.values.reshape(-1, 1), y_train)

y_test_size = y_test.size
y_train_size = y_train.size

accu_train = np.sum(c.predict(X_train.values.reshape(-1, 1)) == y_train)/y_train_size
accu_test = np.sum(c.predict(X_test.values.reshape(-1, 1)) == y_test)/y_test_size

print("Accuracy on Train: ", accu_train)
print("Accuracy on Test: ", accu_test)

In [0]:
train = (loan_data_clean.head(2673))
test = (loan_data_clean.tail(691))


feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = train[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float64))

In [0]:
len(test)

### Models from sklearn 

We are going to test some different algorithms from the sk-learn library

In [0]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_source_training_data = sc.fit_transform(X_source_training_data)
X_source_test_data = sc.transform(X_source_test_data)

#### Random Forest Regressor

In [0]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=2000, random_state=0)
regressor.fit(X_source_training_data, y_source_training_data)
y_pred = regressor.predict(X_source_training_data)

In [0]:
y_source_pred = regressor.predict(X_source_test_data)

In [0]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_source_test_data, y_source_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_source_test_data, y_source_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_source_test_data, y_source_pred)))

In [0]:

df_source_new = pd.DataFrame({'Actual': y_source_test_data, 'Predicted': y_source_pred})
df1_source_new = df_source_new.head(25)

df1_source_new

In [0]:

df1_source_new.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [0]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_source_test_data, y_source_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_source_test_data, y_source_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_source_test_data, y_source_pred)))

### Percentage of BAD per REASON

In [0]:
CATEGORICAL_COLUMNS = ["REASON","JOB"]

NUMERIC_COLUMNS = ["LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"]

loan_data_clean = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols= ["REASON","JOB","LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"],names=feature_names, encoding='latin-1')

loan_data_clean.dropna(axis=0, how='any', inplace=True)

loan_data_clean = loan_data_clean.reindex(np.random.permutation(loan_data_clean.index))


train = (loan_data_clean.head(2673))
test = (loan_data_clean.tail(691))


#msk = np.random.rand(len(loan_data_clean)) < 0.8

#train = loan_data_clean[msk]
#print(train['BAD'].unique())
#test = loan_data_clean[~msk]
y_train = train.pop('BAD')
y_eval = test.pop('BAD')

#print(loan_data_clean)

#X_source = loan_data_remove_missing[['DEROG','DELINQ','CLAGE']].values
#y_source = loan_data_remove_missing['BAD'].values
#X = loan_data_clean[['DEROG','DELINQ','CLAGE']].values
#y = loan_data_clean['BAD'].values

#X_source_training_data, X_source_test_data, y_source_training_data, y_source_test_data = train_test_split(X_source, y_source, test_size=0.2, random_state=0)

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = train[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float64))

In [0]:
pd.concat([train, y_train], axis=1).groupby('REASON').BAD.mean().plot(kind='barh').set_xlabel('% BAD')



### Percentage of BAD per JOB

In [0]:
pd.concat([train, y_train], axis=1).groupby('JOB').BAD.mean().plot(kind='barh').set_xlabel('% BAD')

The `input_function` specifies how data is converted to a `tf.data.Dataset` that feeds the input pipeline in a streaming fashion. `tf.data.Dataset` take take in multiple sources such as a dataframe, a csv-formatted file, and more.

In [0]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  return input_function

train_input_fn = make_input_fn(train, y_train)
eval_input_fn = make_input_fn(test, y_eval, num_epochs=1, shuffle=False)

make_input_fn(train, y_train)
#print(dict(train[0]))
#print(tf.data.Dataset.from_tensor_slices((dict(train), y_train)))
#print(train_input_fn)
#print(eval_input_fn)

#my_dictionary = dict(train.head(1))
#for name in my_dictionary.items():
  #print (name )
  #print(my_dictionaty)

dict(train)
#for name in my_dictionaty.items():
#    my_dictionaty['\'' + name + '\''] = my_dictionaty.pop(name)
    
#print(my_dictionaty)

You can inspect the dataset:

In [0]:
ds = make_input_fn(train, y_train, batch_size=100)()

print(ds.take(1))
for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys()))
  print()
  print('A batch of class:', feature_batch['LOAN'].numpy())
  print()
  print('A batch of Labels:', label_batch.numpy())

You can also inspect the result of a specific feature column using the `tf.keras.layers.DenseFeatures` layer:

In [0]:
feature_columns

In [0]:
print(feature_columns)
DELINQ_column = feature_columns[7]
tf.keras.layers.DenseFeatures([DELINQ_column])(feature_batch).numpy()

In [0]:
print(feature_columns)
DEROG_column = feature_columns[6]
tf.keras.layers.DenseFeatures([DEROG_column])(feature_batch).numpy()

`DenseFeatures` only accepts dense tensors, to inspect a categorical column you need to transform that to a indicator column first:

In [0]:
JOB_column = feature_columns[1]

tf.keras.layers.DenseFeatures([tf.feature_column.indicator_column(JOB_column)])(feature_batch).numpy()

In [0]:

train.REASON.unique()

After adding all the base features to the model, let's train the model. Training a model is just a single command using the `tf.estimator` API:

#### Linear Classifier 

From tensor flow library

In [0]:
CATEGORICAL_COLUMNS = ["REASON","JOB"]

NUMERIC_COLUMNS = ["LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"]

loan_data_clean = pd.read_csv(file, header=None,sep=',',skiprows=[0],usecols= ["REASON","JOB","LOAN","MORTDUE","VALUE"
                 ,"YOJ","DEROG","DELINQ"
,"CLAGE","NINQ","CLNO","DEBTINC","BAD"],names=feature_names, encoding='latin-1')

loan_data_clean.dropna(axis=0, how='any', inplace=True)

loan_data_clean = loan_data_clean.reindex(np.random.permutation(loan_data_clean.index))


train = (loan_data_clean.head(2673))
test = (loan_data_clean.tail(691))


y_train = train.pop('BAD')
y_eval = test.pop('BAD')


feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = train[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float64))
  
  

train_input_fn = make_input_fn(train, y_train)
eval_input_fn = make_input_fn(test, y_eval, num_epochs=1, shuffle=False)

make_input_fn(train, y_train)

dict(train)
#for name in my_dictionaty.items():
#    my_dictionaty['\'' + name + '\''] = my_dictionaty.pop(name)
    
#print(my_dictionaty)


ds = make_input_fn(train, y_train, batch_size=100)()

print(ds.take(1))
for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys()))
  print()
  print('A batch of class:', feature_batch['LOAN'].numpy())
  print()
  print('A batch of Labels:', label_batch.numpy())

In [0]:
print(feature_columns)
DELINQ_column = feature_columns[7]
tf.keras.layers.DenseFeatures([DELINQ_column])(feature_batch).numpy()

In [0]:

print(feature_columns)
DEROG_column = feature_columns[6]
tf.keras.layers.DenseFeatures([DEROG_column])(feature_batch).numpy()

After adding the combination feature to the model, let's train the model again:

In [0]:

ds = make_input_fn(train, y_train, batch_size=10)()

train.YOJ.unique()

LOAN = tf.feature_column.numeric_column("LOAN")
MORTDUE = tf.feature_column.numeric_column("MORTDUE")
VALUE = tf.feature_column.numeric_column("VALUE")
REASON = tf.feature_column.categorical_column_with_vocabulary_list("REASON", ['DebtCon', 'HomeImp'])
JOB = tf.feature_column.categorical_column_with_vocabulary_list("JOB", ['Office', 'Other', 'ProfExe', 'Mgr', 'Sales', 'Self'])
YOJ = tf.feature_column.numeric_column("YOJ")
DEROG = tf.feature_column.numeric_column("DEROG")
DELINQ = tf.feature_column.numeric_column("DELINQ")
CLAGE = tf.feature_column.numeric_column("CLAGE")
NINQ = tf.feature_column.numeric_column("NINQ")
CLNO = tf.feature_column.numeric_column("CLNO")
DEBTINC = tf.feature_column.numeric_column("DEBTINC")
feature_columns = [LOAN, MORTDUE, VALUE, REASON, JOB, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC]


tf.keras.backend.set_floatx('float64')
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns, model_dir=None, n_classes=30)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)

clear_output()
print(result)

In [0]:
DEROG_x_DELINQ = tf.feature_column.crossed_column(["DEROG", "DELINQ"], hash_bucket_size=100)


In [0]:

derived_feature_columns = [DEROG_x_DELINQ]
print(train_input_fn)


In [0]:
feature_names = ["LOAN","MORTDUE","VALUE","YOJ","DEROG","DELINQ","CLAGE","NINQ","CLNO","DEBTINC","BAD"]
X = loan_data_clean[feature_names]
y = loan_data_clean['BAD']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Logistic Regression 

In [0]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

#### Decision Tree

In [0]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

#### KNN Classifier

In [0]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

#### LDA classifier

In [0]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
     .format(lda.score(X_test, y_test)))

#### Gaussian NB Classifier

In [0]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

#### SVM Classifier

In [0]:
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('Accuracy of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
pred = knn.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

In [0]:

X = loan_data_clean[feature_names]
y = loan_data_clean['BAD']


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [0]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [0]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

#### Random Forest Regressor Report

In [0]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

In [0]:
df_source_new = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1_source_new = df_source_new.head(25)


df1_source_new.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

In [0]:
from google.colab import drive
drive.mount('/content/drive')