# Basic Machine Learning Example

###  This example reflects what a simple manual process for comming up witn an effective model for a classificaiton problem



In [1]:
# import dependencies

import numpy as np
import pandas as pd
import pickle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV, KFold, train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
import tqdm

## EVALUATE FUNCTION

Since we will be trying lots of different models, it would be nice to have a single function that will evaluate all our models and provide a standardized reporting format.

This will allow us to easily pick out the model we want to move forward with.

This function takes in a model ( pipeline ) and our train test split data. From there it simply performes predictions and generates results

In [3]:
def evaluate(pipeline, X_train, X_test, y_train, y_test):
    '''
    Evaluate a pipeline on training and test datasets
    '''    
    pipeline.fit(X_train, y_train)
    y_train_hat = pipeline.predict(X_train)
    y_test_hat = pipeline.predict(X_test)
    train_f1 = f1_score(y_train_hat, y_train)
    train_acc = accuracy_score(y_train_hat, y_train)
    test_f1 = f1_score(y_test_hat, y_test)
    test_acc = accuracy_score(y_test_hat, y_test)

    print(f"========== Predictor: {type(pipeline).__name__} ==========")
    print(f"Training result: f1: {train_f1:.3f}, acc: {train_acc:.3f}")
    print(f"Test result: f1: {test_f1:.3f}, acc: {test_acc:.3f}")
    print()


## DATA

In this case we are reading in transfusion data.  With this data we are trying to predict in an individual has given blood on Marth 2007 based on specific features.

#### The features are:
- Recency  ->   How long since the individual last gave blood
- Frequency -> How many times has the indivuaul give blood
- Monetary -> Amount of usable blood given
- Time -> How many months have they been given blood

In [17]:
# load dataset
df = pd.read_csv("transfusion.csv")
df.head(10)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
5,4,4,1000,4,0
6,2,7,1750,14,1
7,1,12,3000,35,0
8,2,9,2250,22,1
9,5,46,11500,98,1


## Extract Features from Results

In [5]:
# build X and y matrices
X = df.drop(['whether he/she donated blood in March 2007'], axis=1)
y = df[['whether he/she donated blood in March 2007']].values.reshape(-1)


## Preliminary Data Analysis

In [6]:
# make sure there is no nan
# if there is nan, you need to deal with it, either by imputing or discarding
df.isnull().sum(axis = 0)

Recency (months)                              0
Frequency (times)                             0
Monetary (c.c. blood)                         0
Time (months)                                 0
whether he/she donated blood in March 2007    0
dtype: int64

## Data Cleanup

Had the above test ( or any others wey may want to add ) had encountered issues we need to address, a lot more code could be required here....

## Train Test Split

The stratify argument is used to make sure the train test split data has similar populations

In [7]:
# split to training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Pick A Model For A Base Point To Evaluate Other Models Against

In this case we are choosing Logistric Regression

In [8]:
# try LogisticRegression to establish a baseline performance
pipeline = Pipeline([
    ('scale', StandardScaler()), # remember to scale first before feeding data into lgr
    ('lgr', LogisticRegression()),
])
evaluate(pipeline, X_train, X_test, y_train, y_test)

Training result: f1: 0.201, acc: 0.774
Test result: f1: 0.186, acc: 0.767



## Now Let's Try A Few More...

In [19]:
# try other predictors
evaluate(XGBClassifier(n_jobs=-1), X_train, X_test, y_train, y_test)
evaluate(LGBMClassifier(n_jobs=-1), X_train, X_test, y_train, y_test)
evaluate(RandomForestClassifier(n_jobs=-1), X_train, X_test, y_train, y_test)
evaluate(GradientBoostingClassifier(), X_train, X_test, y_train, y_test)

Training result: f1: 0.798, acc: 0.916
Test result: f1: 0.471, acc: 0.760

Training result: f1: 0.672, acc: 0.871
Test result: f1: 0.464, acc: 0.753

Training result: f1: 0.855, acc: 0.938
Test result: f1: 0.435, acc: 0.740

Training result: f1: 0.589, acc: 0.846
Test result: f1: 0.467, acc: 0.787



## Let's Pick a Final Model To Move Forward With

From the above evaluations, it looks like XGBClassifier is a very promising candidate

We will then hypertune the classifier model to come up with the best model we can.

## Let's Create Our Tuning Object

In [10]:
# RandomizedSearchCV on XGB
xgb_param_grid = {
    'n_estimators': [10, 20, 50, 100, 200, 300, 400],
    'max_depth': np.arange(5, 20),
    'learning_rate': [0.01, 0.05, 0.1, 0.15, 0.2],
    'subsample': np.arange(0.5, 1.0, 0.05),
    'min_child_weight': np.arange(1, 10),
    'colsample_bytree': np.arange(0.2, 1.0, 0.1),
    'gamma': [0, 0.001, 0.002, 0.003, 0.004, 0.005, 1e-2],
    'n_jobs': [-1]
}


## Let's find The Best Model We Can

The RandomizedSearchCV function will try all our combinations above and select the most accurate model.  

That best model is found in the best_estimator_ property of the RandomizedSerachCV object. 

In [20]:
predictor = XGBClassifier()
rs = RandomizedSearchCV(predictor, xgb_param_grid, cv=5, scoring='f1', n_jobs=-1, n_iter=100, verbose=1)
rs.fit(X_train, y_train)
evaluate(rs.best_estimator_, X_train, X_test, y_train, y_test)


Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 419 tasks      | elapsed:    9.2s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    9.6s finished


Training result: f1: 0.593, acc: 0.846
Test result: f1: 0.462, acc: 0.767



# Evaluate Our Model Further

Now we are going to shuffle the data over and over and apply our new model to the results to further determine if we want to use this model.

In [21]:
# evaluate model with kfold
kfold = KFold(n_splits=10)
results = cross_val_score(rs.best_estimator_, X, y, cv=kfold, n_jobs=-1)
print("Results: %.2f (%.2f) accuracy" % (results.mean(), results.std()))

Results: 0.77 (0.13) accuracy


## Save The Model For Future Use

In [22]:
# save model
with open(f'best_xgb_model.pickle', 'wb') as f:
    pickle.dump(rs.best_estimator_, f)