# Project 3: Classification
#Daniel Arthur

Index:
Business Understanding: Notebook clearly explains the project’s value for helping a specific stakeholder solve a real-world problem.
    
    Introduction explains the real-world problem the project aims to solve
    Introduction identifies stakeholders who could use the project and how they would use it
    Conclusion summarizes implications of the project for the real-world problem and stakeholders 


Data Understanding:  Notebook clearly describes the source and properties of the data to show how useful the data are for solving the problem of interest

    Describe the data sources and explain why the data are suitable for the project
    Present the size of the dataset and descriptive statistics for all features used in the analysis
    Justify the inclusion of features based on their properties and relevance for the project
    Identify any limitations of the data that have implications for the project


Data Preparation: Notebook shows how you prepare your data and explains why by including

    Instructions or code needed to get and prepare the raw data for analysis
    Code comments and text to explain what your data preparation code does
    Valid justifications for why the steps you took are appropriate for the problem you are solving


Modeling: Notebook demonstrates an iterative approach to model-building.

    Runs and interprets a simple, baseline model for comparison
    Introduces new models that improve on prior models and interprets their results
    Explicitly justifies model changes based on the results of prior models and the problem context
    Explicitly describes any improvements found from running new models



Evaluation: Notebook shows how well a final model solves the real-world problem.

    Justifies choice of metrics using context of the real-world problem and consequences of errors
    Identifies one final model based on performance on the chosen metrics with validation data
    Evaluates the performance of the final model using holdout test data
    Discusses implications of the final model evaluation for solving the real-world problem


Code Quality: Code in notebook and related files meets professional standards 
    Code is easy to read, using comments, spacing, variable names, and function docstrings
    All code runs and no code or comments are included that are not needed for the project 
    Code minimizes repetition, using loops, functions, and classes
    Code adapted from others is properly cited with author names and location of the cited material


Dummy Model\
Log Regression\
Forests\
    Random Forest\
    Decision Trees\
    Bagged Trees\
Weak Learners\
    Addaboost\
    Gradient Boosting\
KNeighborsClassifier\

Introduction

Buisness Understanding

Importing the necessary libraries

In [180]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import statsmodels.api as sm
import statsmodels.formula.api as smf


from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.impute import MissingIndicator, SimpleImputer

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score
# plot_confusion_matrix is a handy visual tool, added in the latest version of scikit-learn
# if you are running an older version, comment out this line and just use confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

Importing the data that will be used.
Data Understanding

In [181]:
data_labels=pd.read_csv('Training_Set_Values.csv')
data_values=pd.read_csv('Training_Set_Labels.csv')
data = data_values.merge(data_labels, on='id')

In [182]:
data.head()

Unnamed: 0,id,status_group,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,functional,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,functional,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,functional,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,non functional,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,functional,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


The target Variable is 'Status_Group'

In [183]:
data.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

Data Preperation

In [184]:
waste_features=['wpt_name','num_private','subvillage','region_code','recorded_by','management_group',
                'extraction_type_group','extraction_type_class','scheme_name','payment','quality_group',
                'quantity_group','source_type','source_class','waterpoint_type_group','ward','installer',
                'public_meeting','permit','date_recorded','construction_year','id']
data.drop(waste_features, axis = 1, inplace = True)

In [185]:
#data.info()

In [186]:
data['funder'] = pd.factorize(data['funder'])[0]
data['scheme_management'] = pd.factorize(data['scheme_management'])[0]
data['extraction_type'] = pd.factorize(data['extraction_type'])[0]
data['management'] = pd.factorize(data['management'])[0]
data['payment_type'] = pd.factorize(data['payment_type'])[0]
data['water_quality'] = pd.factorize(data['water_quality'])[0]
data['quantity'] = pd.factorize(data['quantity'])[0]
data['source'] = pd.factorize(data['source'])[0]
data['waterpoint_type'] = pd.factorize(data['waterpoint_type'])[0]
data['basin'] = pd.factorize(data['basin'])[0]
data['region'] = pd.factorize(data['region'])[0]
data['lga'] = pd.factorize(data['lga'])[0]
data['district_code'] = pd.factorize(data['district_code'])[0]

In [187]:
y = data.status_group
X = data.drop('status_group', axis = 1)

In [188]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [189]:
X_scaled_train = StandardScaler().fit_transform(X_train)
X_scaled_test = StandardScaler().fit_transform(X_test)

Useful Functions

In [190]:
def Evaluate(true, preds, model_name):
    acc = accuracy_score(true, preds)
    f1 = f1_score(true, preds, average = 'macro')
    prec = precision_score(true, preds, average = 'macro')
    rec = recall_score(true, preds, average = 'macro')
    
    print("Model: {}".format(model_name))
    print("Accuracy: {}".format(acc))
    print("F1-Score: {}".format(f1))
    print('Precision: {}'.format(prec))
    print('Recall: {}'.format(rec))

In [191]:
class ModelWithCV():
    '''Structure to save the model and more easily see its crossvalidation'''
    
    def __init__(self, model, model_name, X, y, cv_now=True):
        self.model = model
        self.name = model_name
        self.X = X
        self.y = y
        # For CV results
        self.cv_results = None
        self.cv_mean = None
        self.cv_median = None
        self.cv_std = None
        #
        if cv_now:
            self.cross_validate()
        
    def cross_validate(self, X=None, y=None, kfolds=10):
        '''
        Perform cross-validation and return results.
        
        Args: 
          X:
            Optional; Training data to perform CV on. Otherwise use X from object
          y:
            Optional; Training data to perform CV on. Otherwise use y from object
          kfolds:
            Optional; Number of folds for CV (default is 10)  
        '''
        
        cv_X = X if X else self.X
        cv_y = y if y else self.y

        self.cv_results = cross_val_score(self.model, cv_X, cv_y, cv=kfolds)
        self.cv_mean = np.mean(self.cv_results)
        self.cv_median = np.median(self.cv_results)
        self.cv_std = np.std(self.cv_results)

        
    def print_cv_summary(self):
        cv_summary = (
        f'''CV Results for `{self.name}` model:
            {self.cv_mean:.5f} ± {self.cv_std:.5f} accuracy
        ''')
        print(cv_summary)

        
    def plot_cv(self, ax):
        '''
        Plot the cross-validation values using the array of results and given 
        Axis for plotting.
        '''
        ax.set_title(f'CV Results for `{self.name}` Model')
        # Thinner violinplot with higher bw
        sns.violinplot(y=self.cv_results, ax=ax, bw=.4)
        sns.swarmplot(
                y=self.cv_results,
                color='orange',
                size=10,
                alpha= 0.8,
                ax=ax
        )

        return ax

In [192]:
def plot_feature_importances(model):
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

In [193]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print("Best Value for k: {}".format(best_k))
    print("F1-Score: {}".format(best_score))

In [194]:
def PrintEvaluation(y_train, y_test, X_train, X_test, train_preds, test_preds, model):
    print("Training Metrics")
    Evaluate(y_train, train_preds, model)
    print("")
    
    print('Training Cross Val Score')
    print(cross_val_score(model, X_train, y_test, cv=5))
    print("")
    
    print("Testing Metrics")
    Evaluate(y_test, test_preds, model_name='simple_logreg_model')
    print("")
    
    print('Testing Cross Val Score')
    print(cross_val_score(model, X_test, y_test, cv=5))

Modeling

1st Model - Dummy Model

In [195]:
#instantiate model
dummy_model = DummyClassifier(strategy='stratified')

#fit model
dummy_model.fit(X_scaled_train, y_train)

# make predictions
dummy_model_pred_train = dummy_model.predict(X_scaled_train)
dummy_model_pred_test = dummy_model.predict(X_scaled_test)

In [196]:
# Save Information
dummy_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_scaled_train, 
                        y=y_train)

Model Evaluation

In [197]:
PrintEvaluation(y_train, y_test, X_train, X_test, dummy_model_pred_train, dummy_model_pred_test, dummy_model)

Training Metrics
Model: DummyClassifier(strategy='stratified')
Accuracy: 0.4462962962962963
F1-Score: 0.3318015130760756
Precision: 0.3318045260978902
Recall: 0.33183233944243823

Training Cross Val Score
[0.44680135 0.44781145 0.45454545 0.44882155 0.44494949]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.44912457912457915
F1-Score: 0.33396615416312
Precision: 0.3339863040047922
Recall: 0.3339819090804473

Testing Cross Val Score
[0.45454545 0.44259259 0.4473064  0.44949495 0.44494949]


In [198]:
#fig, ax = plt.subplots()

#ax = dummy_model_results.plot_cv(ax)
#plt.tight_layout();

#dummy_model_results.print_cv_summary()

In [199]:
#fig, ax = plt.subplots()

#fig.suptitle("Dummy Model")

#plot_confusion_matrix(dummy_model, X_scaled_train, y_train, ax=ax, cmap="plasma");

In [200]:
# just the numbers (this should work even with older scikit-learn)
#confusion_matrix(y_train, dummy_model_pred)

2nd Model: Log Regression

In [201]:
# Instantitate Model
simple_logreg_model = LogisticRegression(random_state=2021, penalty='none')

# Fit Model
simple_logreg_model.fit(X_scaled_train, y_train)

# Make Predictions
simple_logreg_pred_train = simple_logreg_model.predict(X_scaled_train)
simple_logreg_pred_test = simple_logreg_model.predict(X_scaled_test)

In [202]:
# Save Information
simple_logreg_results = ModelWithCV(
                        model=simple_logreg_model,
                        model_name='simple_logreg',
                        X=X_scaled_train, 
                        y=y_train
)

Evaluate Models

In [203]:
PrintEvaluation(y_train, y_test, X_scaled_train, X_scaled_test, simple_logreg_pred_train,
                    simple_logreg_pred_test, simple_logreg_model)

Training Metrics
Model: LogisticRegression(penalty='none', random_state=2021)
Accuracy: 0.6374074074074074
F1-Score: 0.4249180534740104
Precision: 0.5040754899410311
Recall: 0.43848257377702243

Training Cross Val Score
[0.54242424 0.54242424 0.54242424 0.54225589 0.54242424]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.6364983164983165
F1-Score: 0.42412569521087423
Precision: 0.4851825189070944
Recall: 0.43837201067062076

Testing Cross Val Score
[0.63417508 0.62794613 0.63198653 0.64410774 0.63535354]


In [204]:
# Saving variable for convenience
#model_results = simple_logreg_results

# Plot CV results
#fig, ax = plt.subplots()
#ax = model_results.plot_cv(ax)
#plt.tight_layout();
# Print CV results
#model_results.print_cv_summary()

In [205]:
#confusion_matrix(y_train, simple_logreg_pred)

In [206]:
#fig, ax = plt.subplots()

#fig.suptitle("Logistic Regression with Numeric Features Only")

#plot_confusion_matrix(simple_logreg_model, X_scaled_train, y_train, ax=ax, cmap="plasma");

3rd Model: Random Forest Classifier

In [207]:
#instanitate model
rfc = RandomForestClassifier(n_estimators=500,max_features='auto',
                                         min_samples_split=8)

#fit model
rfc.fit(X_train, y_train)

#predict with model
rfc_prediction_train = rfc.predict(X_train)
rfc_prediction_test = rfc.predict(X_test)

In [208]:
# Save Information
rfc_model_results = ModelWithCV(
                        model=rfc,
                        model_name='Random Forest Classifier',
                        X=X_train, 
                        y=y_train)

Evaluate Model

In [209]:
#Evaluate(y_test, rfc_prediction_test, rfc)
#confusion_matrix(y_train, rfc_prediction_train)

array([[15790,    74,   285],
       [  699,  1263,   181],
       [  884,    60, 10464]], dtype=int64)

In [210]:
PrintEvaluation(y_train, y_test, X_train, X_test, rfc_prediction_train, rfc_prediction_test, rfc)

Training Metrics
Model: RandomForestClassifier(min_samples_split=8, n_estimators=500)
Accuracy: 0.9264983164983165
F1-Score: 0.8641686929592242
Precision: 0.9234422733328991
Recall: 0.8281270941706819

Training Cross Val Score
[0.49410774 0.49646465 0.49360269 0.50016835 0.49107744]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.8004377104377104
F1-Score: 0.6815955085555426
Precision: 0.7388070005016617
Recall: 0.6557502782626199

Testing Cross Val Score
[0.79646465 0.79360269 0.79057239 0.79713805 0.79949495]


4th Model: Decision Tree Classifier

In [211]:
#instantiate model
dtc = DecisionTreeClassifier(criterion='entropy')

#fit model
dtc.fit(X_train, y_train)

#make predictions
dtc_predictions_train = dtc.predict(X_train)
dtc_predictions_test = dtc.predict(X_test)

In [212]:
# Save Information
dtc_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate Model

In [213]:
PrintEvaluation(y_train, y_test, X_train, X_test, dtc_predictions_train, dtc_predictions_test, dtc)

Training Metrics
Model: DecisionTreeClassifier(criterion='entropy')
Accuracy: 0.997037037037037
F1-Score: 0.9931803680070402
Precision: 0.9948372520216977
Recall: 0.9915452017401787

Training Cross Val Score
[0.44225589 0.44292929 0.4506734  0.44983165 0.44747475]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.7454208754208754
F1-Score: 0.6422453162551603
Precision: 0.640876535251733
Recall: 0.6436755174861489

Testing Cross Val Score
[0.74276094 0.74444444 0.72878788 0.73434343 0.73047138]


Feature importance

In [214]:
# Feature importance
#dtc.feature_importances_

Plot the decision tree

In [215]:
#fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (3,3), dpi=300)
#tree.plot_tree(dtc,
#               feature_names = data.columns, 
#               class_names=np.unique(y).astype('str'),
#               filled = True)
#plt.show()

5th Model: Bagged Trees

In [216]:
# Instantiate a BaggingClassifier
bagged_tree =  BaggingClassifier(DecisionTreeClassifier(criterion='gini', max_depth=5), 
                                 n_estimators=20)

# Fit to the training data
bagged_tree.fit(X_scaled_train, y_train)

# Make Predictions
bagged_tree_predictions_train = bagged_tree.predict(X_scaled_train)
bagged_tree_predictions_test = bagged_tree.predict(X_scaled_test)

In [217]:
# Save Information
bagged_tree_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate the Model

In [218]:
PrintEvaluation(y_train, y_test, X_train, X_test,
                bagged_tree_predictions_train, bagged_tree_predictions_test, bagged_tree)

Training Metrics
Model: BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=5),
                  n_estimators=20)
Accuracy: 0.7147138047138047
F1-Score: 0.5054846994384254
Precision: 0.7176038314934585
Recall: 0.5021809554099562

Training Cross Val Score
[0.54259259 0.54225589 0.54158249 0.54276094 0.54208754]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.6922222222222222
F1-Score: 0.5096219594642347
Precision: 0.6253718379963326
Recall: 0.4993226025899993

Testing Cross Val Score
[0.70757576 0.70218855 0.70858586 0.71885522 0.70909091]


6th Model: Random forests

In [242]:
# Instantiate  RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10, max_depth= 15)

# fit a RandomForestClassifier
forest.fit(X_scaled_train, y_train)

# make Predictions
forest_train_prediction = forest.predict(X_scaled_train)
forest_test_prediction = forest.predict(X_scaled_test)

In [220]:
# Save Information
Random_Forests_model_results = ModelWithCV(
                        model=forest,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate

In [243]:
PrintEvaluation(y_train, y_test, X_scaled_train, X_scaled_test,
                forest_train_prediction, forest_test_prediction , forest)

Training Metrics
Model: RandomForestClassifier(max_depth=15, n_estimators=10)
Accuracy: 0.8817845117845118
F1-Score: 0.8068529432028665
Precision: 0.9033609557800872
Recall: 0.7612616713807151

Training Cross Val Score
[0.52626263 0.52643098 0.52828283 0.52643098 0.53131313]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.7344107744107744
F1-Score: 0.5743104690742362
Precision: 0.6936947048132912
Recall: 0.5523432917544999

Testing Cross Val Score
[0.77323232 0.77188552 0.77542088 0.77794613 0.78080808]


In [222]:
#plot_feature_importances(forest)

Look at the trees in your forest

In [223]:
# Instantiate  a RandomForestClassifier
#forest_2 = RandomForestClassifier(n_estimators = 5, max_features= 10, max_depth= 2)

#  fit a RandomForestClassifier
#forest_2.fit(X_scaled_test, y_test)

In [224]:
# First tree from forest_2
#rf_tree_1 = forest_2.estimators_[0]

In [225]:
# Feature importance
#plot_feature_importances(rf_tree_1)

In [226]:
# Second tree from forest_2
#rf_tree_2 = forest_2.estimators_[1]

In [227]:
# Feature importance
#plot_feature_importances(rf_tree_2)

Evaluate

Weak Learners

7th Model: AdaBoostClassifier

In [228]:
# Instantiate an AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(random_state=42)

# Fit the models
adaboost_clf.fit(X_train, y_train)

# AdaBoost model predictions
adaboost_train_preds = adaboost_clf.predict(X_train)
adaboost_test_preds = adaboost_clf.predict(X_test)

In [229]:
# Save Information
AdaBoost_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate the Model

In [230]:
PrintEvaluation(y_train, y_test, X_train, X_test,
                adaboost_train_preds, adaboost_test_preds , adaboost_clf)

Training Metrics
Model: AdaBoostClassifier(random_state=42)
Accuracy: 0.7149831649831649
F1-Score: 0.511089385108986
Precision: 0.6526855030212608
Recall: 0.5079811207360851

Training Cross Val Score
[0.54309764 0.54309764 0.54225589 0.54124579 0.54074074]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.7161616161616161
F1-Score: 0.512649009520762
Precision: 0.6649707101849494
Recall: 0.5099027504312733

Testing Cross Val Score
[0.71632997 0.71313131 0.71329966 0.72794613 0.71498316]


In [231]:
#adaboost_confusion_matrix = confusion_matrix(y_test, adaboost_test_preds)
#adaboost_confusion_matrix

In [232]:
#adaboost_classification_report = classification_report(y_test, adaboost_test_preds)
#print(adaboost_classification_report)

8th Model: GradientBoostingClassifier

In [233]:
# Instantiate an GradientBoostingClassifier
gbt_clf = GradientBoostingClassifier(random_state=42)

# Fit the models
gbt_clf.fit(X_train, y_train)

# GradientBoosting model predictions
gbt_clf_train_preds = gbt_clf.predict(X_train)
gbt_clf_test_preds = gbt_clf.predict(X_test)

In [234]:
# Save Information
GradientBoostingClassifier_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate the Model

In [235]:
PrintEvaluation(y_train, y_test, X_train, X_test,
                gbt_clf_train_preds, gbt_clf_test_preds , gbt_clf)

Training Metrics
Model: GradientBoostingClassifier(random_state=42)
Accuracy: 0.7523905723905724
F1-Score: 0.5800240444284713
Precision: 0.74501847205463
Recall: 0.5592450912633624

Training Cross Val Score
[0.54090909 0.54309764 0.54006734 0.53973064 0.53989899]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.7456228956228956
F1-Score: 0.5753571170220589
Precision: 0.7259104767101037
Recall: 0.5553569988564615

Testing Cross Val Score
[0.7479798  0.74208754 0.746633   0.75319865 0.73804714]


In [236]:
#gbt_confusion_matrix = confusion_matrix(y_test, gbt_clf_test_preds)
#gbt_confusion_matrix

In [237]:
#gbt_classification_report = classification_report(y_test, gbt_clf_test_preds)
#print(gbt_classification_report)

9th Model: KNeighborsClassifier

In [239]:
# Instantiate KNeighborsClassifier
KNC = KNeighborsClassifier()

# Fit the classifier
KNC.fit(X_scaled_train, y_train)

# Predict on the test set
KNC_prediction_train = KNC.predict(X_scaled_test)
KNC_prediction_test = KNC.predict(X_scaled_test)

In [None]:
# Save Information
KNeighborsClassifier_model_results = ModelWithCV(
                        model=KNC,
                        model_name=KNeighborsClassifier,
                        X=X_scaled_train, 
                        y=y_train)

Evaluate Model

In [240]:
PrintEvaluation(y_train, y_test, X_scaled_train, X_scaled_test,
                KNC_prediction_train, KNC_prediction_test , KNC)

Training Metrics
Model: KNeighborsClassifier()
Accuracy: 0.46114478114478114
F1-Score: 0.3293098285571336
Precision: 0.3309834386969518
Recall: 0.33077503137588643

Training Cross Val Score
[0.47777778 0.49545455 0.49343434 0.49579125 0.48164983]

Testing Metrics
Model: simple_logreg_model
Accuracy: 0.7488215488215488
F1-Score: 0.6301863336615193
Precision: 0.660779368879722
Recall: 0.6132580183380363

Testing Cross Val Score
[0.75117845 0.74242424 0.74191919 0.7523569  0.74646465]


Improve model performance

In [None]:
find_best_k(X_scaled_train, y_train, X_scaled_test, y_test)

10th Model: Improved using 'K' learned above

In [None]:
# Instantiate KNeighborsClassifier
KNC2 = KNeighborsClassifier(n_neighbors=3)

# Fit the classifier
KNC.fit(X_scaled_train, y_train)

# Predict on the test set
KNC2_prediction_train = clf.predict(X_scaled_train)
KNC2_prediction_test = clf.predict(X_scaled_test)

In [None]:
# Save Information
KNC2_model_results = ModelWithCV(
                        model=dummy_model,
                        model_name='dummy',
                        X=X_train, 
                        y=y_train)

Evaluate

In [None]:
PrintEvaluation(y_train, y_test, X_scaled_train, X_scaled_test,
                KNC2_prediction_train, KNC2_prediction_test , KNC2)

Models Compared

Conclusion
Buisness Understanding