# <font color='#2F4F4F'>1. Defining the Question</font>

## a) Specifying the Data Analysis Question

What is the question or problem you are trying to solve?

I am trying to predict whether a loan offered to the certain small and medium enterprise will be paid off.


## b) Defining the Metric for Success

How will you know your project has succeeded?

When i am able to create a model that is able to predict whether a loan offered will be paid or not with high accuracy score. i.e of 80%.



## c) Understanding the context 

What is the background information surrounding the research question?

Kopokopo is a company that offers small and medium enterprises loans.
The data set provides data including if the loan was approved and amount approved.
It also contains data on if the loan was defaulted or not.- which is our target variable.
As a data scientist working for kopokopo, my target to have amodel that predicts loan default and fits the training data 80%.

## d) Recording the Experimental Design

What steps will you take to answer the research question?

1. Load datasets and libraries
2. Clean data
3. Perform univariate and bivariate analysis
4. Check that the assumptions of my model isnt violated
5. train the model
6. Summarize findings
7. Provide recommendations
8. Challenge the solution


## e) Data Relevance

Is the data provided relevant to the research question?

The dataset provided is appropriate and relevant to the research question.

# <font color='#2F4F4F'>2. Data Cleaning & Preparation</font>

In [125]:
# load libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.set_option("display.max.columns", None)
pd.set_option("display.max_colwidth", None)

In [126]:
# load dataset
df = pd.read_csv('kopokopo_dataset - datasets_602860_1082167_SBAcase.11.13.17.csv')
df.head(3)

Unnamed: 0,Selected,LoanNr_ChkDgt,Name,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,New,RealEstate,Portion,daysterm,xx,Default
0,0,1004285007,SIMPLEX OFFICE SOLUTIONS,92801,532420,15074,2001,36,1,1.0,0,0,1,0,Y,N,,15095.0,32812,0,P I F,0,30000,15000,0,0,0.5,1080,16175.0,0
1,1,1004535010,DREAM HOME REALTY,90505,531210,15130,2001,56,1,1.0,0,0,1,0,Y,N,,15978.0,30000,0,P I F,0,30000,15000,0,0,0.5,1680,17658.0,0
2,0,1005005006,"Winset, Inc. dba Bankers Hill",92103,531210,15188,2001,36,10,1.0,0,0,1,0,Y,N,,15218.0,30000,0,P I F,0,30000,15000,0,0,0.5,1080,16298.0,0


In [127]:
# load glossary
glossary = pd.read_csv('Koppokoppo - Glossary - t0001-10.1080_10691898.2018.1434342 (1).csv')
glossary

Unnamed: 0,Variable name,Data type,Description of variable
0,LoanNr_ChkDgt,Text,Identifier – Primary key
1,Name,Text,Borrower name
2,City,Text,Borrower city
3,State,Text,Borrower state
4,Zip,Text,Borrower zip code
5,Bank,Text,Bank name
6,BankState,Text,Bank state
7,ICS,Text,Industry classification system code
8,ApprovalDate,Date/Time,Date SBA commitment issued
9,ApprovalFY,Text,Fiscal year of commitment


From what I can tell, our target variable is 'MIS_Status' as it can help us identify whether a loan is likely to be paid (PIF = paid in full) or not (CHGOFF = chargeoff).

Using the glossary, we will remove the following variables either because they are not needed or because their definitions are not included in the glossary:
- Selected
- LoanNr_ChkDgt
- Name
- New
- RealEstate
- Portion
- daysterm
- xx
- Default

In [128]:
# dropping unneeded columns
df = df.drop(columns = ['Selected', 'LoanNr_ChkDgt', 'Name', 'New', 'RealEstate', 
                        'Portion', 'daysterm', 'xx', 'Default'])
df.shape

(2102, 21)

In [129]:
# dropping duplicates, if any
df.drop_duplicates(inplace = True)
df.shape

(2102, 21)

In [130]:
# checking for missing values
df.isnull().sum()

Zip                     0
ICS                     0
ApprovalDate            0
ApprovalFY              0
Term                    0
NoEmp                   0
NewExist                1
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr               2
LowDoc                  3
ChgOffDate           1405
DisbursementDate        3
DisbursementGross       0
BalanceGross            0
MIS_Status              0
ChgOffPrinGr            0
GrAppv                  0
SBA_Appv                0
dtype: int64

'ChgOffDate' has missing values for more than half of the records so we will drop that variable. As for the rest of the variables with missing values, we will remove only the records that have null values.

In [131]:
df.drop(columns = ['ChgOffDate'], inplace = True)

df.dropna(inplace = True)
df.isna().sum()

Zip                  0
ICS                  0
ApprovalDate         0
ApprovalFY           0
Term                 0
NoEmp                0
NewExist             0
CreateJob            0
RetainedJob          0
FranchiseCode        0
UrbanRural           0
RevLineCr            0
LowDoc               0
DisbursementDate     0
DisbursementGross    0
BalanceGross         0
MIS_Status           0
ChgOffPrinGr         0
GrAppv               0
SBA_Appv             0
dtype: int64

In [132]:
# check datatypes
df.dtypes

Zip                    int64
ICS                    int64
ApprovalDate           int64
ApprovalFY             int64
Term                   int64
NoEmp                  int64
NewExist             float64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
UrbanRural             int64
RevLineCr             object
LowDoc                object
DisbursementDate     float64
DisbursementGross      int64
BalanceGross           int64
MIS_Status            object
ChgOffPrinGr           int64
GrAppv                 int64
SBA_Appv               int64
dtype: object

In [133]:
# changing the 'ApprovalDate' and 'DisbursementDate' variables to datetime
df['ApprovalDate'] = pd.to_datetime(df['ApprovalDate'], unit = 'd')
df['DisbursementDate'] = pd.to_datetime(df['DisbursementDate'], unit = 'd')
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2011-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2011-05-01,32812,0,P I F,0,30000,15000
1,90505,531210,2011-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2013-09-30,30000,0,P I F,0,30000,15000
2,92103,531210,2011-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2011-09-01,30000,0,P I F,0,30000,15000
3,92108,531312,2013-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2013-01-31,50000,0,P I F,0,50000,25000
4,91345,531390,2016-02-09,2006,240,65,1.0,3,65,1,1,0,N,2016-04-12,343000,0,P I F,0,343000,343000


In [134]:
# our datetime conversion is 10 years ahead. We'll offset this
df['ApprovalDate'] = df['ApprovalDate'] - pd.DateOffset(years = 10)
df['DisbursementDate'] = df['DisbursementDate'] - pd.DateOffset(years = 10)
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2001-05-01,32812,0,P I F,0,30000,15000
1,90505,531210,2001-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2003-09-30,30000,0,P I F,0,30000,15000
2,92103,531210,2001-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2001-09-01,30000,0,P I F,0,30000,15000
3,92108,531312,2003-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2003-01-31,50000,0,P I F,0,50000,25000
4,91345,531390,2006-02-09,2006,240,65,1.0,3,65,1,1,0,N,2006-04-12,343000,0,P I F,0,343000,343000


In [135]:
# extracting the year in 'DisbursementDate'
df['DisbursementDate'] = pd.DatetimeIndex(df['DisbursementDate']).year
df.head()

Unnamed: 0,Zip,ICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001-04-10,2001,36,1,1.0,0,0,1,0,Y,N,2001,32812,0,P I F,0,30000,15000
1,90505,531210,2001-06-05,2001,56,1,1.0,0,0,1,0,Y,N,2003,30000,0,P I F,0,30000,15000
2,92103,531210,2001-08-02,2001,36,10,1.0,0,0,1,0,Y,N,2001,30000,0,P I F,0,30000,15000
3,92108,531312,2003-01-14,2003,36,6,1.0,0,0,1,0,Y,N,2003,50000,0,P I F,0,50000,25000
4,91345,531390,2006-02-09,2006,240,65,1.0,3,65,1,1,0,N,2006,343000,0,P I F,0,343000,343000


In [136]:
# drop 'ApprovalDate' since it is no longer needed
df.drop(columns = ['ApprovalDate'], inplace = True)

In [137]:
# check unique values in each variable to ensure there is no inconsistency
my_cols = df.columns.to_list()

for col in my_cols:
    print("Variable:", col)
    print("Number of unique values:", df[col].nunique())
    print(df[col].unique())
    print()

Variable: Zip
Number of unique values: 810
[92801 90505 92103 92108 91345 95831 90255 90808 92704 94583 91360 94103
 95746 91354 95965 91505 94550 92101 95695 91356 92507 90662 91604 94080
 90305 95030 94901 94707 95037 91208 95630 93610 92656 92545 92708 92868
 92307 93274 93546 92590 92301 91941 91789 92831 92648 92083 91730 94117
 93611 92260 94931 92701 95945 92688 92887 95628 91502 93604 91221 94070
 92879 94526 92115 90210 93010 94546 92807 91301 92056 91311 92130 90036
 94588 91607 90504 91914 92173 90274 90006 92649 90016 90242 91030 91780
 94706 94115 94519 93906 91606 92612 91201 90010 95521 92084 92627 91342
 91364 92411 95337 90211 90266 95678 91601 94523 91709 92553 91776 95010
 92549 90201 94606 90631 92110 92544 92234 90004 90293 90303 92883 91106
 92501 91303 92404 93433 94104 95821 95621 93905 94510 93654 90039 92630
 95603 92009 91762 91362 92660 92026 94541 90706 95361 94089 92504 94301
 92653 95677 95825 90025 90063 95128 95123 94111 92663 93060 91205 92408
 90606 9

The 'BalanceGross' variable has only one unique value (0) so we will remove it.

In [138]:
df.drop(columns = ['BalanceGross'], inplace = True)

We will encode some of our categorical variables for modeling.

In [139]:
df['RevLineCr'] = df['RevLineCr'].replace({'Y' : '1', 'N' : '2', 'T' : '3'})
df['LowDoc'] = df['LowDoc'].replace({'N' : '1', 'Y' : '2', 'S' : '3', 'A' : '4'})
df['MIS_Status'] = df['MIS_Status'].replace({'P I F' : '0', 'CHGOFF' : '1'})

df.head()

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,2001,32812,0,0,30000,15000
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,2003,30000,0,0,30000,15000
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,2001,30000,0,0,30000,15000
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,2003,50000,0,0,50000,25000
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,2006,343000,0,0,343000,343000


In [140]:
# changing the variable data types to their appropriate formats
to_obj = ['Zip', 'ICS', 'ApprovalFY', 'NewExist', 'FranchiseCode', 'UrbanRural', 'DisbursementDate']

for item in to_obj:
    df[item] = df[item].astype('object')
    
# confirming our data types have been appropriately modified
df.dtypes

Zip                  object
ICS                  object
ApprovalFY           object
Term                  int64
NoEmp                 int64
NewExist             object
CreateJob             int64
RetainedJob           int64
FranchiseCode        object
UrbanRural           object
RevLineCr            object
LowDoc               object
DisbursementDate     object
DisbursementGross     int64
MIS_Status           object
ChgOffPrinGr          int64
GrAppv                int64
SBA_Appv              int64
dtype: object

There are no other anomalies or inconsistencies with the data so we will now move on to analysis.

In [141]:
df.to_csv('kopokopo_clean.csv', index = False)

df = pd.read_csv('kopokopo_clean.csv')
df.head()

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,2001,32812,0,0,30000,15000
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,2003,30000,0,0,30000,15000
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,2001,30000,0,0,30000,15000
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,2003,50000,0,0,50000,25000
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,2006,343000,0,0,343000,343000


# <font color='#2F4F4F'>3. Data Analysis</font>

Because there are too many variables to carry out individual univariate and bivariate analysis, we will use Pandas Profiling to generate a report for this section.

In [142]:
df.head(5)

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,2001,32812,0,0,30000,15000
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,2003,30000,0,0,30000,15000
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,2001,30000,0,0,30000,15000
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,2003,50000,0,0,50000,25000
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,2006,343000,0,0,343000,343000


In [143]:
from pandas_profiling import ProfileReport



In [144]:
profile = ProfileReport

In [145]:
df.drop(columns = ['DisbursementDate', 'GrAppv', 'SBA_Appv'], inplace = True)

In [146]:
df.head(5)

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,MIS_Status,ChgOffPrinGr
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,32812,0,0
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,30000,0,0
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,30000,0,0
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,50000,0,0
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,343000,0,0


# <font color='#2F4F4F'>4. Data Modeling</font>

In [155]:
df.head(5)

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,MIS_Status,ChgOffPrinGr
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,32812,0,0
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,30000,0,0
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,30000,0,0
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,50000,0,0
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,343000,0,0


In [156]:
# split into features (X) and target (Y)

X = df.drop(['MIS_Status'], axis=1)  #Independent/predictor variables
y = df['MIS_Status']  # Dependent/label variable


In [157]:
# splitting into 75-25 training and test sets
# split the data into training and validation sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


In [158]:
X.head(5)

Unnamed: 0,Zip,ICS,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,ChgOffPrinGr
0,92801,532420,2001,36,1,1.0,0,0,1,0,1,1,32812,0
1,90505,531210,2001,56,1,1.0,0,0,1,0,1,1,30000,0
2,92103,531210,2001,36,10,1.0,0,0,1,0,1,1,30000,0
3,92108,531312,2003,36,6,1.0,0,0,1,0,1,1,50000,0
4,91345,531390,2006,240,65,1.0,3,65,1,1,0,1,343000,0


In [159]:
# scaling our features

# Performing normalisation
 
 
from sklearn.preprocessing import MinMaxScaler  
norm = MinMaxScaler().fit(X_train) 
X_train = norm.transform(X_train) 
X_test = norm.transform(X_test)



In [160]:
# fit the classifiers to the training data and make predictions on the test set

# Logistic Regression
from sklearn.linear_model import LogisticRegression
logistic_classifier = LogisticRegression()

# Decision Tree 
from sklearn.tree import DecisionTreeClassifier
decision_classifier = DecisionTreeClassifier()

# Support Vector Machine
from sklearn.svm import SVC
svm_classifier = SVC()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
naive_classifier = GaussianNB()

# K-Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier()

# Bagging
from sklearn.ensemble import BaggingClassifier
bagging_meta_classifier = BaggingClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier()

# Ada Boosting
from sklearn.ensemble import AdaBoostClassifier
ada_boost_classifier = AdaBoostClassifier()

# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gbm_classifier = GradientBoostingClassifier() 

# XG Boosting
from xgboost import XGBClassifier
import xgboost as xgb 
xg_boost_classifier = xgb.XGBClassifier() 

In [161]:
# evaluating the classification reports and confusion matrices of each classifier
from sklearn.metrics import classification_report, confusion_matrix


# Logistic Regression
logistic_classifier.fit(X_train, y_train)
logistic_y_prediction = logistic_classifier.predict(X_test)
print("Logistic Regression Classifier", confusion_matrix(logistic_y_prediction, y_test))


# Decision Tree
decision_classifier.fit(X_train, y_train)
decision_y_prediction = decision_classifier.predict(X_test) 
print("Decision Trees Classifier", confusion_matrix(decision_y_prediction, y_test))

# Support Vector Machine
svm_classifier.fit(X_train, y_train)
svm_y_prediction = svm_classifier.predict(X_test) 
print("SVN Classifier", confusion_matrix(svm_y_prediction, y_test))

# Naive Bayes
naive_classifier.fit(X_train, y_train)
naive_y_prediction = naive_classifier.predict(X_test) 
print("Naive Bayes Classifier", confusion_matrix(naive_y_prediction, y_test))

# K-Neighbors
knn_classifier.fit(X_train, y_train)
knn_y_prediction = knn_classifier.predict(X_test) 
print("KNN Classifier", confusion_matrix(knn_y_prediction, y_test))

# Bagging
bagging_meta_classifier.fit(X_train, y_train)
bagging_y_classifier = bagging_meta_classifier.predict(X_test)
print("Bagging Classifier", confusion_matrix(bagging_y_classifier, y_test)) 


# Random Forest
random_forest_classifier.fit(X_train, y_train)
random_forest_y_classifier = random_forest_classifier.predict(X_test) 
print("Random Forest Classifier", confusion_matrix(random_forest_y_classifier, y_test))

# Ada Boosting
ada_boost_classifier.fit(X_train, y_train)
ada_boost_y_classifier = ada_boost_classifier.predict(X_test)
print("Ada Boost Classifier", confusion_matrix(ada_boost_y_classifier, y_test))


# Gradient Boosting
gbm_classifier.fit(X_train, y_train)
gbm_y_classifier = gbm_classifier.predict(X_test)
print("GBM Classifier", confusion_matrix(gbm_y_classifier, y_test))


# XG Boosting
xg_boost_classifier.fit(X_train, y_train)
xg_boost_y_classifier = xg_boost_classifier.predict(X_test)
print("XGBoost Classifier", confusion_matrix(xg_boost_y_classifier, y_test))



Logistic Regression Classifier [[331  26]
 [ 30 137]]
Decision Trees Classifier [[361   1]
 [  0 162]]
SVN Classifier [[340  24]
 [ 21 139]]
Naive Bayes Classifier [[275   0]
 [ 86 163]]
KNN Classifier [[337  25]
 [ 24 138]]
Bagging Classifier [[361   0]
 [  0 163]]
Random Forest Classifier [[361   0]
 [  0 163]]
Ada Boost Classifier [[361   0]
 [  0 163]]
GBM Classifier [[361   2]
 [  0 161]]
XGBoost Classifier [[361   0]
 [  0 163]]


In [162]:
# We now print the classification report 
# We use the precision values which give us accuracy values.
# The precision will be "how many are correctly classified among that class".
# The recall means "how many of this class you find over the whole number of element of this class".
# The f1-score is the harmonic mean between precision & recall.
# The support is the number of occurence of the given class in your dataset.


print('Logistic classifier:')
print(classification_report(y_test, logistic_y_prediction))

print('Decision Tree classifier:')
print(classification_report(y_test, decision_y_prediction))

print('SVM Classifier:')
print(classification_report(y_test, svm_y_prediction))

print('KNN Classifier:')
print(classification_report(y_test, knn_y_prediction))

print('Naive Bayes Classifier:')
print(classification_report(y_test, naive_y_prediction)) 


print('Bagging Meta Classifier:')
print(classification_report(y_test, bagging_y_classifier)) 

print('Random Forest Classifier:')
print(classification_report(y_test, random_forest_y_classifier)) 



print('Ada Boost Classifier:')
print(classification_report(y_test, ada_boost_y_classifier)) 

print('GBM Classifier:')
print(classification_report(y_test, gbm_y_classifier)) 

print('XGBoost Classifier:')
print(classification_report(y_test, xg_boost_y_classifier)) 



Logistic classifier:
              precision    recall  f1-score   support

           0       0.93      0.92      0.92       361
           1       0.82      0.84      0.83       163

    accuracy                           0.89       524
   macro avg       0.87      0.88      0.88       524
weighted avg       0.89      0.89      0.89       524

Decision Tree classifier:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       361
           1       1.00      0.99      1.00       163

    accuracy                           1.00       524
   macro avg       1.00      1.00      1.00       524
weighted avg       1.00      1.00      1.00       524

SVM Classifier:
              precision    recall  f1-score   support

           0       0.93      0.94      0.94       361
           1       0.87      0.85      0.86       163

    accuracy                           0.91       524
   macro avg       0.90      0.90      0.90       524
weighted av

# <font color='#2F4F4F'>6. Recommendations</font>

What recommendations can you provide?

Decision tree has the best precision score of 99% nd recall & f1 score of 1.
while KNN is the lowest below 90%.

The best Optimized Classifiers are below

1.   Random Forest Classifier
2.   Ada Boost Classifier
3.   GBM Classisfier
4.   X Boost Classifer
5.   Bagging Meta Classifier


# <font color='#2F4F4F'>7. Challenging your Solution</font>

#### a) Did we have the right question?
Yes


#### b) Did we have the right data?
yes

#### c) What can be done to improve the solution?


We can still perform futher model opmization techiniques i.e. data cleaning, feature engineering, checking for model assumptions, etc.to further get the best classifier.

### 7.1 Improving the Solution

We will try cross validation and K-Folds cross validation in an attempt to improve our models.

#### 7.1.1 Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score as cvs
from sklearn import metrics

In [None]:

log_scores = cvs(logistic_classifier, X, y, cv = 3)
print("Logistic Regression CV scores:", log_scores)

tree_scores = cvs(decision_classifier, X, y, cv = 3)
print("Decision Tree CV scores:", tree_scores)

svm_scores = cvs(svm_classifier , X, y, cv = 3)
print("Support Vector Machine CV scores:", svm_scores)

nb_scores = cvs(naive_classifier, X, y, cv = 3)
print("Naive Bayes CV scores:", nb_scores)

knn_scores = cvs(knn_classifier, X, y, cv = 3)
print("K-Neighbors CV scores:", knn_scores)

bag_scores = cvs(bagging_meta_classifier, X, y, cv = 3)
print("Bagging CV scores:", bag_scores)

rf_scores = cvs(random_forest_classifier, X, y, cv = 3)
print("Random Forest CV scores:", rf_scores)

ada_scores = cvs(ada_boost_classifier, X, y, cv = 3)
print("Ada Boosting CV scores:", ada_scores)

grad_scores = cvs(gbm_classifier, X, y, cv = 3)
print("Gradient Boosting CV scores:", grad_scores)

xgb_scores = cvs(xg_boost_classifier, X, y, cv = 3)
print("XG Boosting CV scores:", xgb_scores)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression CV scores: [0.98567335 0.99713467 0.97560976]
Decision Tree CV scores: [0.98710602 0.99570201 0.98995696]
Support Vector Machine CV scores: [0.88108883 0.89255014 0.86226686]
Naive Bayes CV scores: [0.98710602 0.91833811 0.99139168]
K-Neighbors CV scores: [0.97421203 0.99283668 0.97560976]
Bagging CV scores: [0.98710602 0.99856734 0.9856528 ]
Random Forest CV scores: [0.98710602 1.         0.98852224]
Ada Boosting CV scores: [0.98710602 0.96561605 0.99139168]
Gradient Boosting CV scores: [0.98710602 0.99713467 0.9928264 ]
XG Boosting CV scores: [0.98710602 1.         0.98995696]


#### 7.1.2 K-Folds Cross Validation

In [124]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

# applying KFold with 3 splits onto our features
kf3 = KFold(n_splits = 3, shuffle = True)
kf3.split(X)

print("We are using " + str(kf3.get_n_splits(X)) + " folds.\n")

model_count = 1 # to keep track of the model we are working with

# creating training and test sets using these folds
for train_index, test_index in kf3.split(X):
    print("Training fold", model_count)
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    logistic_classifier.fit(X_train, y_train)
    log_pred = logistic_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Linear Regression Confusion Matrix:\n", confusion_matrix(y_test, log_pred))
    print("Accuracy:", accuracy_score(y_test, log_pred))
    print()
    
    decision_classifier.fit(X_train, y_train)
    tree_pred = decision_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Decision Tree Confusion Matrix:\n", confusion_matrix(y_test, tree_pred))
    print("Accuracy:", accuracy_score(y_test, tree_pred))
    print()
    
    svm_classifier.fit(X_train, y_train)
    svm_pred = svm_classifier .predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Support Vector Machine Confusion Matrix:\n", confusion_matrix(y_test, svm_pred))
    print("Accuracy:", accuracy_score(y_test, svm_pred))
    print()
    
    naive_classifier.fit(X_train, y_train)
    nb_pred = naive_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Naive Bayes Confusion Matrix:\n", confusion_matrix(y_test, nb_pred))
    print("Accuracy:", accuracy_score(y_test, nb_pred))
    print()
    
    knn_classifier.fit(X_train, y_train)
    knn_pred = knn_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("K-Neighbors Confusion Matrix:\n", confusion_matrix(y_test, knn_pred))
    print("Accuracy:", accuracy_score(y_test, knn_pred))
    print()
    
    bagging_meta_classifier.fit(X_train, y_train)
    bag_pred = bagging_meta_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Bagging Confusion Matrix:\n", confusion_matrix(y_test, bag_pred))
    print("Accuracy:", accuracy_score(y_test, bag_pred))
    print()
    
    random_forest_classifier.fit(X_train, y_train)
    rf_pred = random_forest_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Random Forest Confusion Matrix:\n", confusion_matrix(y_test, rf_pred))
    print("Accuracy:", accuracy_score(y_test, rf_pred))
    print()
    
    ada_boost_classifier.fit(X_train, y_train)
    ada_pred = ada_boost_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Ada Boosting Confusion Matrix:\n", confusion_matrix(y_test, ada_pred))
    print("Accuracy:", accuracy_score(y_test, ada_pred))
    print()
    
    gbm_classifier.fit(X_train, y_train)
    grad_pred = gbm_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("Gradient Boosting Confusion Matrix:\n", confusion_matrix(y_test, grad_pred))
    print("Accuracy:", accuracy_score(y_test, grad_pred))
    print()
    
    xg_boost_classifier.fit(X_train, y_train)
    xgb_pred = xg_boost_classifier.predict(X_test)
    print("Assessing accuracy of model", model_count)
    print("XG Boosting Confusion Matrix:\n", confusion_matrix(y_test, xgb_pred))
    print("Accuracy:", accuracy_score(y_test, xgb_pred))
    print()
    
    model_count += 1

We are using 3 folds.

Training fold 1


KeyError: ignored