<style>
h1, h2, h3 {
    color: black;
    font-weight: bold;
}
h4,h5 {
    color: gray
}
</style>
## <span style="font-weight:bold;">Logistic Regression, Decision Tree, KNearest Neighbors, and Support Vector Machines Classifiers </span>
### <span style="font-weight:bold;">Overview</span>
This project compares the performance of classic classifiers: Logistic Regression, Decision Tree, KNearest Neighbors, and Support Vector Machines
using dataset provided by a Portuguese banking institution.  A configuration .ini file is used to selectively control the regressor used allowing 
individual debugging of each classifier model.
Two generic classification functions are defined, the first performing the basic model classification based on the regressor passed in.  The second 
function utilizes the grid hyper-parameters passed into to perform a GridSearch using the regressor.  Classification model training times, and 
the model's classification report is returned for comparasions at the end.

Classifier performance is performed by comparing Accuracy, Precision, Recall, F1-scoring, model training time from the model classification process.

### <span style="font-weight:bold;">Source:</span>  <span style="color:black;">https://archive.ics.uci.edu/dataset/222/bank+marketing</span>
The data is a "Multivariate" business use data from direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based 
on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be 
('yes') or not ('no') subscribed. This project uses the dataset: bank-addition-full.csv with 41,188 rows with 20 columns, ordered by date (from May 2008 to 
November 2010).  The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

### <span style="font-weight:bold;">Project Organization</span>
The project is organized with the objective that it to be used in an automated environment.  Individual directories, configuration files, and trained models
can be wrote out and read back for testing new data.
#### <span style="font-weight:bold;">Dataset: </span>./data/bank-additional-full.csv
#### <span style="font-weight:bold;">Configuration:</span> ./BankTermDeposit.ini
The configuration file serves multiple purposes: it identifies the source of the model training data, controls the train/test data split ratio, manages verbosity, 
and oversees the activation training and testing of the classifier. Additionally, it specifies the name of the trained model for local storage.
#### <span style="font-weight:bold;">Trained Models: </span> specified by model_outputFile in .ini
Generated for each of the classifier activated in the configuration file.  Example:  <model_outputFile>_LogisticRegression<timestamp>.pkl.
These trained model files are stored in the local directory.  These models can be read back and used for testing to classify new unknown datasets with
the same data frame format.
#### <span style="font-weight:bold;">Results</span>
Classification results from the selected classifiers are tabulated and printed at the end of the process run.






### <span style="font-weight:bold;">Process Flow</span>
#### <span style="font-weight:bold;">Configuration</span>
Read in necessary system and process flow control related configuration from the supplied .ini file.
#### <span style="font-weight:bold;">Data Pre-processing, cleaning</span>
Prepare the data by removing bad data, null values, and make the data frame available as the common dataset for later process stages.  Details are provided 
later in this project.
#### <span style="font-weight:bold;">Split data for training and testing</span>
The dataset is split according to the to proportions specified in the configuration file: BankTermDeposit.ini, variable: train_test_split, currently set to 0.3 split.

#### <span style="font-weight:bold;">Generic Classification Modules</span>
These are the 2 generic classifiers functions performing classification, and grid search based on the regressor passed in.  The resulting classification report and 
measured training time along with accuracy scores are gathered for later comparisions with other regressors.  Classifier training time, or the data fitting time is 
calculated by timestamping the start and end of the classifier training data fit time.

#### <span style="font-weight:bold;">Classification</span>
The execution of each of the regressors: Logistic Regression, Decision Tree, KNearest Neighbors, and Support Vector Machines are controlled by ini file's 
parameters: LogisticRegression, SVMGridSearch,DecisionTreeClassifier,KNNearestNeighbors.  Each module performs basic regressor model classification, and then
prepares the proper hyper-parameters for performing grid search.  The results from these classification tasks are gathered, and tabulated later for comparing
the different classifier.
#### <span style="font-weight:bold;">Results Tabulation</span>
This step tabulates the results gathered from all the classification performed.  Based on the size of the dataset, the specific trained module can be stored,
and read back directly from a file to be used directly for prediction purposes.

## <span style="font-weight:bold;">Classifier Comparison Result</span>

Basic Classifiers

|    Classifier Model    | Accuracy | Precision | Recall | F1 Score | Train Score | Test Score | Model Fit Time (s) |
|------------------------|----------|-----------|--------|----------|-------------|------------|--------------------|
|    LinearRegression    |  0.9131  |  0.9017   | 0.9131 |  0.9028  |   0.9082    |   0.9131   |     0.0550551      |
|  KNeighborsClassifier  |  0.9037  |  0.8926   | 0.9037 |  0.8962  |   0.9267    |   0.9037   |     0.0156393      |
|     SVMGridSearch      |  0.9134  |  0.9017   | 0.9134 |  0.9020  |   0.9185    |   0.9134   |     4.2046297      |
| DecisionTreeClassifier |  0.8889  |  0.8930   | 0.8889 |  0.8909  |   1.0000    |   0.8889   |     0.1003885      |


GridSearch Best Estimators

|    Classifier Model    | Accuracy | Precision | Recall | F1 Score | Train Score | Test Score | Model Fit Time (s) |                         hyper-parameters                         |
|------------------------|----------|-----------|--------|----------|-------------|------------|--------------------|------------------------------------------------------------------|
|    LinearRegression    |  0.9131  |  0.9017   | 0.9131 |  0.9028  |   0.9081    |   0.9131   |     6.2736654      |      {'regressor__C': 1, 'regressor__solver': 'liblinear'}       |
|  KNeighborsClassifier  |  0.9075  |  0.8934   | 0.9075 |  0.8948  |   1.0000    |   0.9075   |     52.1640265     | {'regressor__n_neighbors': 20, 'regressor__weights': 'distance'} |
|     SVMGridSearch      |  0.9118  |  0.8994   | 0.9118 |  0.8983  |   0.9130    |   0.9118   |    1197.6210067    |          {'regressor__C': 10, 'regressor__gamma': 0.01}          |
| DecisionTreeClassifier |  0.9127  |  0.9094   | 0.9127 |  0.9109  |   0.9345    |   0.9127   |     7.6676013      | {'regressor__max_depth': 10, 'regressor__min_samples_split': 10} |

##### Given the same dataset, all classifiers showed similar accuracy and precision perform, with similar training and testing score.
##### For this set of banking data, the Logistic, and SVM classifier showing relatively close results showing the data classification favors linear type classification
##### Classifiers SVM, and KNearest Neighbors favoring non-linear classification have bad timing performance yielding similar accuracy performance


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer

from collections import defaultdict

import re
import string
import time
import pickle
import warnings

warnings.filterwarnings('ignore')

## Configuration
```
Configuration file: BankTermDeposit.ini supplies the necessary configuration to control the process flow of the entire code.
Even though this project is presented in a Jupyter Notebook format, however, it can be converted to a python script, and
controlled by the configuration file.

examples:

train_test_split = 0.3
model_prefix     = BankTermDesposit_
model_outputFile = BankDepositModel_
model_inputFile  = BankDepositModel_


[DATASET]
DataFile = ./data/bank-additional-full.csv
....
[MODELS]
...
LogisticRegression     = TRUE
SVMGridSearch          = TRUE
DecisionTreeClassifier = FALSE
KNNearestNeighbors     = TRUE
...
In this example. DataFile specify the data file path, and its train_test_split portion is specified as 0.3.  The model is prefixed
with the name specified, and the trained model to be output and read in from the model_outputFile, and model_inputFile parameters.```


In [2]:
import configparser

current_time = str(time.time())

config = configparser.ConfigParser()
config.read('BankTermDeposit.ini')
print(f'Configuration Sections: {config.sections()}')

Proc_LogisticRegression     = config['MODELS']['LogisticRegression'    ]
Proc_SVMGridSearch          = config['MODELS']['SVMGridSearch'         ]
Proc_KNNearestNeighbors     = config['MODELS']['KNNearestNeighbors'    ]
Proc_DecisionTreeClassifier = config['MODELS']['DecisionTreeClassifier']

print(f'LogisticRegression     : {Proc_LogisticRegression    }')
print(f'KNNearestNeighbors     : {Proc_KNNearestNeighbors    }')
print(f'SVMGridSearch          : {Proc_SVMGridSearch         }')
print(f'DecisionTreeClassifier : {Proc_DecisionTreeClassifier}')

model_outFile = config['DEFAULT']['model_outputFile'] + current_time + '.pkl'
model_inFile  = config['DEFAULT']['model_inputFile' ] + current_time + '.pkl'
model_prefix_ = config['DEFAULT']['model_prefix'    ]
print(f'Model Output File: {model_outFile}')

grid_search_verbose = int(config['PROCESS']['gridSearchVerbose'])
print(f'grid_search_verbose: {grid_search_verbose}')

dataset_file  = config['DATASET']['DataFile']
print(f'DataFile: {dataset_file}')

dataset_split = float(config['DEFAULT']['train_test_split'])
print(f'dataset_split: {dataset_split}')

readback_test = config['MODELS']['ReadBackTest']
print(f'readback_test: {readback_test}')



Configuration Sections: ['DATASET', 'FEATURE_PROCESSING', 'PROCESS', 'MODELS']
LogisticRegression     : TRUE
KNNearestNeighbors     : TRUE
SVMGridSearch          : TRUE
DecisionTreeClassifier : TRUE
Model Output File: BankDepositModel_1724741541.9205184.pkl
grid_search_verbose: 3
DataFile: ./data/bank-additional-full.csv
dataset_split: 0.3
readback_test: FALSE


## Data Pre-Processing
### Remove Duplicates
None found
### Strip extra "" characters
Discoverd that even though the dataset isin the csv delimited ';' format; additional "" between the field data causes panda's pd.read_csv unable to read the field data and
populate it into the proper column.
### Create DataFrame
Create new data frame based on the processed column from prior step.  From dataset documentation, setup a mapping dictionary which maps the column name and the type of the column.
Three types of column types found:  numeric (integer), categorial, and binary (output class). 
Next perform the same character replacement procedure for all the data in the dataset and populate the entire data frame.
### Map / Convert 'month', 'day_of_week' columns
Map and convert to numeric data for these columns
### Label Encoding Categorical and Binary Columns
For the remaining categorical, and binary columns, perform LabelEncoder to convert data to numeric numbers
### Training, Testing data split
Split the dataset and prepare for classification

In [3]:
model_name_prefix = model_prefix_
df_dataset = dataset_file

df_split   = dataset_split

df = pd.read_csv( df_dataset )
df.drop_duplicates(inplace=True)



In [4]:
########################################################################################
##  Dataset Pre-Processing                                                            ##
########################################################################################
dataset_dtypes = {
    'age'            : 'integer',
    'job'            : 'categorical',
    'marital'        : 'categorical',
    'education'      : 'categorical',
    'default'        : 'categorical',
    'housing'        : 'categorical',
    'loan'           : 'categorical',
    'contact'        : 'categorical',
    'month'          : 'categorical',
    'day_of_week'    : 'categorical',
    'duration'       : 'integer',
    'campaign'       : 'integer',
    'pdays'          : 'integer',
    'previous'       : 'integer',
    'poutcome'       : 'categorical',
    'emp.var.rate'   : 'integer',
    'cons.price.idx' : 'integer',
    'cons.conf.idx'  : 'integer',
    'euribor3m'      : 'integer',
    'nr.employed'    : 'integer',
    'y'              : 'binary',
}

# remove extra "" from column name, and field data
data_column = df.columns
col_names = data_column.str.split(';')
col_names = [[elem.replace('"', '') for elem in sublist] for sublist in col_names] 

col_names = [item for sublist in col_names for item in sublist]     # flatten column names

orig_column = df.columns[0]
#print(f'{type(orig_column), {orig_column}}')
# Split the data in the original column into multiple columns
df[orig_column] = df[orig_column].astype(str)

split_data = df[orig_column].str.split(';', expand=True)

# remove extra "" for entire data
split_data = split_data.applymap(lambda x: x.replace('"', '') if x else x)

# Check the number of columns in split_data
num_cols = split_data.shape[1]

# Generate column names dynamically or trim the existing col_names list
if len(col_names) == num_cols:
    split_data.columns = col_names
else:
    split_data.columns = [f'{orig_column}_{i}' for i in range(num_cols)]

df = split_data.copy()

# map column to documented data type
for col in df.columns:
    dftype = dataset_dtypes[col]
    if dftype == 'integer':
        df[col] = pd.to_numeric( df[col],errors='coerce').astype('int64')
    elif dftype == 'float':
        df[col] = pd.to_numeric( df[col],errors='coerce').astype('float64')
    else:
        df[col] = df[col].astype('category')
#    print(f'[{col}] ; {df[col].dtype}')
    


In [5]:
# Mapping Month and Weekday Features
month_mapping = {
    'jan'  : 1, 'feb' : 2,  'mar'  : 3,  'apr' : 4,
    'may'  : 5, 'jun' : 6,  'jul'  : 7,  'aug' : 8,
    'sep'  : 9, 'oct' : 10, 'nov'  : 11, 'dec' : 12
}

weekday_mapping = {
    'mon': 1, 'tue' : 2, 'wed' : 3, 'thu': 4,
    'fri': 5, 'sat' : 6, 'sun' : 7
}

# LabelEncode the rest of the columns to numeric values
label_encoders = {}
for column in df.select_dtypes(include=['category']).columns:
    if column not in ['month', 'day_of_week']:
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le
        
df['month']       = df['month'].map(month_mapping)
df['day_of_week'] = df['day_of_week'].map(weekday_mapping)



In [6]:
## print(f'{df.head(10)}')
# Display the number of missing values in each column
print("\n Missing Values Count per Column:")
print(df.isnull().sum())

# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")

# Remove duplicate rows
df.drop_duplicates(inplace=True)
print(f"Number of rows after removing duplicates: {df.shape[0]}")




 Missing Values Count per Column:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64
Number of duplicate rows: 1
Number of rows after removing duplicates: 41175


In [7]:
####################################################################
##  Dataset Preprocessing                                         ##
##  checking                                                      ##
####################################################################

x = df.drop(columns='y',axis=1)
y  = df['y']
print(f'{x.head()}')
for col in x.columns:
    print(f'[{col}] : {df[col].dtype}')
    

   age  job  marital  education  default  housing  loan  contact month  \
0   56    3        1          0        0        0     0        1     5   
1   57    7        1          3        1        0     0        1     5   
2   37    7        1          3        0        2     0        1     5   
3   40    0        1          1        0        0     0        1     5   
4   56    7        1          3        0        0     2        1     5   

  day_of_week  duration  campaign  pdays  previous  poutcome  emp.var.rate  \
0           1       261         1    999         0         1             1   
1           1       149         1    999         0         1             1   
2           1       226         1    999         0         1             1   
3           1       151         1    999         0         1             1   
4           1       307         1    999         0         1             1   

   cons.price.idx  cons.conf.idx  euribor3m  nr.employed  
0              93          

In [8]:
# Split the data for training and testing

from collections import defaultdict

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=df_split, random_state=12)

models_data = defaultdict(list)


In [9]:
####################################################################
##  Perform training data fitting, and the compute classification ##
##  result.  Keep track of model training time                    ##   
####################################################################
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def ModelClassification(regressor, regressor_text, X_train,y_train, X_test,y_test):

    start_time = time.time()

    model_regressor = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('regressor', regressor)
    ])
    model_fit_param = model_regressor.fit(X_train, y_train)

    model_train_time = time.time() - start_time

    y_pred                 = model_regressor.predict (X_test)
    model_accuracy         = accuracy_score(y_test,y_pred)
    model_confusion_matrix = confusion_matrix(y_test, y_pred)
    
    report_classification  = classification_report(y_test,y_pred)
    report_classification_ = classification_report(y_test,y_pred,output_dict=True)

    model_f1_score         = f1_score( y_test,y_pred,average='weighted')

    training_score = model_regressor.score(X_train,y_train)
    testing_score  = model_regressor.score(X_test,y_test)

    classifier_report = {
        'train_time'       : [ model_train_time ],
        'fit_parameters'   : [ model_fit_param ],
        'model_regressor'  : [ model_regressor ],
        'model_accuracy'   : [ model_accuracy ],
        'model_precision'  : [report_classification_['weighted avg']['precision']],
        'model_recall'     : [report_classification_['weighted avg']['recall']],
        'model_f1_score'   : [report_classification_['weighted avg']['f1-score']],
        'confusion_matrix' : [ model_confusion_matrix ],
        'classify_report'  : [ report_classification, report_classification_ ],
        'train_score'      : [ training_score ],
        'test_score'       : [ testing_score ]
    }
##    print(f'regressor: {regressor_text}\n{classifier_report}')
    return classifier_report



In [10]:
####################################################################
##  Perform training data fitting, and the compute classification ##
##  result.  Keep track of model training time/ This is the Grid  ##
##  Search version                                                ##   
####################################################################

def ModelClassification_GridSearch(regressor, regressor_text, X_train,y_train,X_test,y_test,param_grid):
    start_time = time.time()

    model_regressor = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('regressor', regressor)
    ])
    grid_search = GridSearchCV(model_regressor, param_grid, cv=10, scoring='accuracy')
    model_fit_param = grid_search.fit(X_train, y_train)
    best_regressor  = grid_search.best_estimator_

    model_train_time = time.time() - start_time

    y_pred                 = best_regressor.predict(X_test)
    model_accuracy         = accuracy_score(y_test,y_pred)
    model_confusion_matrix = confusion_matrix(y_test,y_pred)
    
    report_classification = classification_report(y_test,y_pred)
    report_classification_= classification_report(y_test,y_pred, output_dict=True)
    model_f1_score        = f1_score(y_test,y_pred,average='weighted')

    grid_training_score = best_regressor.score(X_train,y_train)
    grid_testing_score  = best_regressor.score(X_test, y_test )

    classifier_report = {
        'train_time'       : [ model_train_time ],
        'model_regressor'  : [ grid_search ],
        'model_accuracy'   : [ model_accuracy ],
        'model_precision'  : [report_classification_['weighted avg']['precision']],
        'model_recall'     : [report_classification_['weighted avg']['recall']],
        'model_f1_score'   : [report_classification_['weighted avg']['f1-score']],
        'fit_parameters'   : [ model_fit_param ],
        'confusion_matrix' : [ model_confusion_matrix ],
        'classify_report'  : [ report_classification,report_classification_ ],
        'train_score'      : [ grid_training_score ],
        'test_score'       : [ grid_testing_score ]
    }
##    print(f'regressor: {regressor_text}\n{classifier_report}')
    return classifier_report
    

In [11]:
####################################################################
## Perform the basic Logistic Regression, and its GridSearchCV    ##
## version with the supplied search hyper-parameter ranges.       ##
## Process controlled by configuration file.                      ##
####################################################################
if Proc_LogisticRegression == 'TRUE':
    classifier_report = ModelClassification(LogisticRegression(), 'Logistic Regression', X_train, y_train, X_test, y_test) 
    param_grid_lr = {
        'regressor__C': [0.1, 1, 10],
        'regressor__solver': ['liblinear', 'saga']
    }
    start_time = time.time()
    
    classifier_gridsearch_report = ModelClassification_GridSearch(LogisticRegression(), 'Logistic Regression GridSearch', X_train, y_train, X_test, y_test, param_grid_lr)
    models_data['LinearRegression'] = [ classifier_report, classifier_gridsearch_report ]

    model_outputFile_ = model_prefix_ + 'LinearRegression' + current_time +'.pkl'
    with open( model_outputFile_,'wb') as model_file:
        pickle.dump(models_data['LinearRegression'],model_file)
        print(f'LinearRegression model save to: {model_outputFile_}')

LinearRegression model save to: BankTermDesposit_LinearRegression1724741541.9205184.pkl


In [12]:
####################################################################
## Perform the basic KNNeighbors regression, and its GridSearchCV ##
## version with the supplied search hyper-parameter ranges.       ##
## Process controlled by configuration file.                      ##
####################################################################

if Proc_KNNearestNeighbors == 'TRUE':
    classifier_report = ModelClassification(KNeighborsClassifier(), 'KNeighborsClassifier', X_train, y_train, X_test, y_test) 
    
    param_grid_knn = {
        'regressor__n_neighbors': np.arange(1, 21),
        'regressor__weights': ['uniform', 'distance']
    }
    
    classifier_gridsearch_report = ModelClassification_GridSearch(KNeighborsClassifier(), 'KNeighborsClassifier', X_train, y_train, X_test, y_test, param_grid_knn) 
    models_data['KNeighborsClassifier'] = [ classifier_report, classifier_gridsearch_report ]

    model_outputFile_ = model_prefix_ + 'KNeighborsClassifier' + current_time +'.pkl'
    with open( model_outputFile_,'wb') as model_file:
        pickle.dump(models_data['KNeighborsClassifier'],model_file)
        print(f'KNeighborsClassifier model save to: {model_outputFile_}')
                   

KNeighborsClassifier model save to: BankTermDesposit_KNeighborsClassifier1724741541.9205184.pkl


In [13]:
####################################################################
## Perform the basic Support Vector Machines,and its GridSearchCV ##
## version with the supplied search hyper-parameter ranges.       ##
## Process controlled by configuration file.                      ##
####################################################################

if Proc_SVMGridSearch == 'TRUE':
    classifier_report = ModelClassification(SVC(), 'SVMGridSearch', X_train, y_train, X_test, y_test) 
    param_grid_svc = {
        'regressor__C': [0.1,1, 10], 
        'regressor__gamma': [1,0.1,0.01]
    } 
    
    classifier_gridsearch_report = ModelClassification_GridSearch(SVC(), 'SVMGridSearch', X_train, y_train, X_test, y_test, param_grid_svc) 
    models_data['SVMGridSearch'] = [ classifier_report, classifier_gridsearch_report ]

    model_outputFile_ = model_prefix_ + 'SVMGridSearch' + current_time +'.pkl'
    with open( model_outputFile_,'wb') as model_file:
        pickle.dump(models_data['SVMGridSearch'],model_file)
        print(f'SVMGridSearch model save to: {model_outputFile_}')
    

SVMGridSearch model save to: BankTermDesposit_SVMGridSearch1724741541.9205184.pkl


In [14]:
#########################################################################
## Perform the basic DecisionTree Classification, and its GridSearchCV ##
## version with the supplied search hyper-parameter ranges.            ##
## Process controlled by configuration file.                           ##
#########################################################################

if Proc_DecisionTreeClassifier == 'TRUE':
    classifier_report = ModelClassification(DecisionTreeClassifier(), 'DecisionTreeClassifier', X_train, y_train, X_test, y_test) 
    param_grid_dtc = {
        'regressor__max_depth': [None, 10, 20],
        'regressor__min_samples_split': [2, 5, 10]
    }
    
    classifier_gridsearch_report = ModelClassification_GridSearch(DecisionTreeClassifier(), 'DecisionTreeClassifier', X_train, y_train, X_test, y_test, param_grid_dtc) 
    models_data['DecisionTreeClassifier'] = [ classifier_report, classifier_gridsearch_report ]

    model_outputFile_ = model_prefix_ + 'DecisionTreeClassifier' + current_time +'.pkl'
    with open( model_outputFile_,'wb') as model_file:
        pickle.dump(models_data['DecisionTreeClassifier'],model_file)
        print(f'DecisionTreeClassifier model save to: {model_outputFile_}')
    

DecisionTreeClassifier model save to: BankTermDesposit_DecisionTreeClassifier1724741541.9205184.pkl


In [15]:
#########################################################################
## Tabulate all the results from the classification models.  Output    ##
## model to file for another system to go directly to prediction       ##
## without training the model.                                         ##
#########################################################################
from tabulate import tabulate

f_row, f_acc, f_prec, f_f1, f_recall = 'weighted avg','model_accuracy','model_precision','model_f1_score','model_recall'
f_train, f_trainscore, f_testscore = 'train_time','train_score','test_score'

headers = ['Classifier Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'Train Score', 'Test Score', 'Model Fit Time (s)']
tab_data_basic = []
tab_data_grids = []
for model, report_ in models_data.items():
    c_report      = report_[0]
    b_accuracy    = c_report[f_acc][0]
    b_precision   = c_report[f_prec][0]
    b_recall      = c_report[f_recall][0]
    b_f1_score    = c_report[f_f1][0]
    b_train_time  = c_report[f_train][0]
    b_train_score = c_report[f_trainscore][0]
    b_test_score  = c_report[f_testscore][0]
    b_model       = c_report['model_regressor'][0]

    g_report      = report_[1]
    g_accuracy    = g_report[f_acc][0]
    g_precision   = g_report[f_prec][0]
    g_recall      = g_report[f_recall][0]
    g_f1_score    = g_report[f_f1][0]
    g_train_time  = g_report[f_train][0]
    g_train_score = g_report[f_trainscore][0]
    g_test_score  = g_report[f_testscore][0]
    gsearch       = g_report['model_regressor'][0]

    tdata = [ f'{model}', f'{b_accuracy:.4f}', f'{b_precision:.4f}',  f'{b_recall:.4f}',  f'{b_f1_score:.4f}', f'{b_train_score:.4f}', f'{b_test_score:.4f}', f'{b_train_time:.7f}' ]
    tab_data_basic.append( tdata )

    tdata_g = [ f'{model}', f'{g_accuracy:.4f}', f'{g_precision:.4f}',  f'{g_recall:.4f}',  f'{g_f1_score:.4f}', f'{g_train_score:.4f}' ,f'{g_test_score:.4f}', f'{g_train_time:.7f}', f'{gsearch.best_params_}' ]
    tab_data_grids.append( tdata_g )

print()
##print(f'*****  DataSet: {df_dataset}, test/train split: {df_split}  *****')
print()
basic_table = tabulate(tab_data_basic, headers, tablefmt='pretty')
title = 'Basic Classifiers'
basic_full_table = f'{title}\n{basic_table}'
print(basic_full_table)
print()

gs_headers = ['Classifier Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'Train Score', 'Test Score', 'Model Fit Time (s)', 'hyper-parameters']
gs_table = tabulate(tab_data_grids, gs_headers, tablefmt='pretty')
title = 'GridSearch Best Estimators'
gs_full_table = f'{title}\n{gs_table}'
print(gs_full_table)
print()

import pickle
with open(model_outFile,'wb') as model_file:
    pickle.dump(models_data, model_file)
    print(f'saved models to {model_outFile}')





Basic Classifiers
+------------------------+----------+-----------+--------+----------+-------------+------------+--------------------+
|    Classifier Model    | Accuracy | Precision | Recall | F1 Score | Train Score | Test Score | Model Fit Time (s) |
+------------------------+----------+-----------+--------+----------+-------------+------------+--------------------+
|    LinearRegression    |  0.9131  |  0.9017   | 0.9131 |  0.9028  |   0.9082    |   0.9131   |     0.0597091      |
|  KNeighborsClassifier  |  0.9037  |  0.8926   | 0.9037 |  0.8962  |   0.9267    |   0.9037   |     0.0281515      |
|     SVMGridSearch      |  0.9134  |  0.9017   | 0.9134 |  0.9020  |   0.9185    |   0.9134   |     3.9079716      |
| DecisionTreeClassifier |  0.8877  |  0.8921   | 0.8877 |  0.8898  |   1.0000    |   0.8877   |     0.1100681      |
+------------------------+----------+-----------+--------+----------+-------------+------------+--------------------+

GridSearch Best Estimators
+-------

In [16]:
import pickle
if readback_test == 'TRUE':
    with open(model_outFile,'wb') as model_file:
        pickle.dump(models_data, model_file)
        print(f'saved models to {model_outFile}')
    
    # Load file back ###########################################################
    with open(model_inFile,'rb') as input_model:
        mdata = pickle.load(input_model)
        print(f'Loaded model data from: {model_inFile}')
    
    tab1_data_basic = []
    tab1_data_grids = []
    model_detectors = defaultdict()
    
    for model, report_ in mdata.items():
        c_report      = report_[0]
        b_accuracy    = c_report[f_acc][0]
        b_precision   = c_report[f_prec][0]
        b_recall      = c_report[f_recall][0]
        b_f1_score    = c_report[f_f1][0]
        b_train_time  = c_report[f_train][0]
        b_train_score = c_report[f_trainscore][0]
        b_test_score  = c_report[f_testscore][0]
    
        g_report      = report_[1]
        g_accuracy    = g_report[f_acc][0]
        g_precision   = g_report[f_prec][0]
        g_recall      = g_report[f_recall][0]
        g_f1_score    = g_report[f_f1][0]
        g_train_time  = g_report[f_train][0]
        g_train_score = g_report[f_trainscore][0]
        g_test_score  = g_report[f_testscore][0]
        gsearch       = g_report['model_regressor'][0]
    
        model_detectors[ model ] = gsearch
    
        tdata = [ f'{model}', f'{b_accuracy:.4f}', f'{b_precision:.4f}',  f'{b_recall:.4f}',  f'{b_f1_score:.4f}', f'{b_train_score:.4f}', f'{b_test_score:.4f}', f'{b_train_time:.7f}' ]
        tab1_data_basic.append( tdata )
    
        tdata_g = [ f'{model}', f'{g_accuracy:.4f}', f'{g_precision:.4f}',  f'{g_recall:.4f}',  f'{g_f1_score:.4f}', f'{g_train_score:.4f}' ,f'{g_test_score:.4f}', f'{g_train_time:.7f}', f'{gsearch.best_params_}' ]
        tab1_data_grids.append( tdata_g )
    
    print()
    ##print(f'*****  DataSet: {df_dataset}, test/train split: {df_split}  *****')
    print()
    basic_table = tabulate(tab1_data_basic, headers, tablefmt='pretty')
    title = 'Basic Classifiers'
    basic_full_table = f'{title}\n{basic_table}'
    print(basic_full_table)
    print()
    
    gs_headers = ['Classifier Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'Train Score', 'Test Score', 'Model Fit Time (s)', 'hyper-parameters']
    gs_table = tabulate(tab1_data_grids, gs_headers, tablefmt='pretty')
    title = 'GridSearch Best Estimators'
    gs_full_table = f'{title}\n{gs_table}'
    print(gs_full_table)
    print()
    