# SOLVING CLASSIFICATION PROBLEM

# USING ENSEMBLES AND SAVING A TRAINED MODEL FOR DEPLOYMENT

# Using Adult Income Dataset

The Adult dataset is from the Census Bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc..

There are two class values in the dataset namely:  ‘>50K‘ and ‘<=50K‘, meaning it is a binary classification task. The classes are imbalanced, with a skew toward the ‘<=50K‘ class label.

‘>50K’: majority class, approximately 25%.
‘<=50K’: minority class, approximately 75%.

# PART 1

In [None]:
import pandas as pd
import numpy as np

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
print("Vis setup Complete")
print("Setup complete.")

In [None]:
Adult_data = pd.read_csv("adult.csv")

Adult_data.head(10)

In [None]:
Adult_data.shape

We will also specify two lists, one which contains the categorical columns, and one which contains the numeric columns of interest. The categorical columns of interest are: "workclass", "education", "marital-status", "occupation","relationship", "race", "gender", "native-country". The numeric columns are: "age", "education-num", "capital-gain", "capital-loss", "hours-per-week". We will exclude fnlwgt as it is not a particularly useful variable 

In [None]:
CATEGORICAL_COLUMNS = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country"] 

# I did not use any categorical variable because I skipped the data encoding stage. You should do it.
CONTINUOUS_COLUMNS = ["age", "educational-num", "capital-gain", "capital-loss",
                      "hours-per-week"]

In [None]:
#Designate the input features as X
X= Adult_data[CONTINUOUS_COLUMNS]

In [None]:
X.head()

In [None]:
#Designate the outcome or target variable as y
y = Adult_data.income
y.head()

# Sampling (train, test(1st) and "validation" (validation here is another test data that will later arise from the cross_val process)

In [None]:
# "validation data (the second test data) helps in tunning 
#the model at the training stage 
# Without necccessary touching the "original test data".

First, let us divide the Adult dataset into train/validation/test partitions. 
Partition of 20% will be use as the test data to evaluate our model at the final end (this stands for unseen real world data).
We will not use the test data until the final stage of testing the model. 
The remainder will be subdivided into train and validation data.
We can use the validation data to test any intermediary decision on the go at the time of training our model with train data.

A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model’s hyperparameters.

The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

Random_state is a seed for the way the data is split- if you use the same seed in the future, you will be guaranteed the exact same data will be in each of the training and validation sets as before. In order words, Using a random_state, we can seed the random numbers generator to make its behavior replicable.

The validation set essentially allows us to check how “overfitted” or “underfitted” our model is.

It allow us to both tune the model complexity to the sweet spot and provides a much better estimate of how the model will perform with unseen data since the model does not use the validation data to train on.

Note that it is entirely normal (even probable) that the validation accuracy will be lower than the training accuracy. In fact, if they were very similar, it’d be a great indicator that your model might not be complex enough (underfitted).

That said the training accuracy doesn’t matter.

The only thing that matters is getting the best possible validation accuracy, since this is actually somewhat reflective of how the model will perform in the wild.

# The Trade-offs

More training data is nice because it means your model sees more examples and thus hopefully finds a better solution. If you have a tiny training data set your model won’t be able to learn general principles and will have bad validation / test set performance (in other words, it won’t work.)

More validation data is nice because it helps you make a better decision about which model is “The Best.” If you don’t have enough validation data, then there will be a lot of noise in your estimate of which model is “The Best” and you might not make a good choice.

More test data is nice because it gives you a better sense of how well your model generalizes to unseen data. If you don’t have enough test data, your final assessment of the model’s generalization ability might not be accurate.

###  Splitting into TRAIN and TEST DATA. (We will do a second test data later as the "validation data" during the cross validation process)

In [None]:
#Without stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
X_train.shape #Training data features

In [None]:
y_train.shape # Training data target

### Go Hide this

In [None]:
X_test.shape # Is the original test data features. Don't touch this until the very last stage

In [None]:
y_test.shape # Is the original test data target. Don't touch this until the very last stage

## Automatic Outlier Detection

In [None]:
## Use DBSCAN 

In [None]:
#import the implementation of this algorihm from sklearn
from sklearn.cluster import DBSCAN

#Use the algorithm for outlier detection, the return in clusters will show the membership of each point
#Any point labelled as -1 is an outlier

outlier_detection = DBSCAN(min_samples = 3, eps = 3)
clusters = outlier_detection.fit_predict(X_train)

#Count total number of outliers as count of those labelled as -1
TotalOutliers=list(clusters).count(-1)
print("Total number of outliers identified is: ",TotalOutliers)

In [None]:
len(clusters)

In [None]:
np.unique(clusters, return_counts = True)

In [None]:
# select all rows that are not outliers and update 
#the X_train and y_train.
mask = clusters != -1
X1_train, y1_train = X_train[mask], y_train[mask]

In [None]:
# summarize the shape of the updated training dataset
print(X1_train.shape, y1_train.shape)

In [None]:
ground_truth = clusters
print ("Ground truth: \n", ground_truth)

In [None]:
len(ground_truth)

In [None]:
## Use also Isolation Forest

In [None]:
#import the implementation of this algorihm from sklearn
from sklearn.ensemble import IsolationForest

#Use the algorithm for outlier detection, then use it 
#to predict each point
#Any point labelled as -1 is an outlier

clf = IsolationForest(max_samples=150, random_state = 1, contamination= 'auto')
preds = clf.fit_predict(X_train)
print(preds)
totalOutliers=0
for pred in preds:
    if pred == -1:
        totalOutliers=totalOutliers+1
print("Total number of outliers identified is: ",totalOutliers)

#Calculate number of erroneos predictions where outlier 
#predicction does not coindice with groundtruth
newarray= ((preds == -1) & (ground_truth==0))

n_errors= len([i for i in newarray if i==True])
print("Number of incorrectly identified outliers: ",n_errors)

In [None]:
np.unique(preds, return_counts = True)

In [None]:
newarray

In [None]:
# select all rows that are not outliers and update the X_train and y_train.
mask = preds != -1
X2_train, y2_train = X_train[mask], y_train[mask] # Hence forth we will continue with X2_train, y2_train.

In [None]:
# summarize the shape of the updated training dataset
print(X2_train.shape, y2_train.shape)

### So we would go with the isolation forest

# Balancing of the Data

In [None]:
y2_train.value_counts()

In [None]:
#Concatenate y2_train and X2_train to apply balancing, we would seperate them later again.
df = pd.concat([y2_train, X2_train], axis=1)

In [None]:
df.head(10)

# Apply Up Sampling Technique

In [None]:
from sklearn.utils import resample

# Separate majority and minority classes. 

df_majority = df[df.income!=">50K"]
df_minority = df[df.income==">50K"]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=26347,     # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Show dataset statistics
print(df_upsampled.describe())
 
# Display new class counts
df_upsampled.income.value_counts()


In [None]:
#Show distribution of the class on whole dataset
sns.countplot(x= 'income', data=df_upsampled)

In [None]:
#Now check the NEW upsampled dataframe
df_upsampled.head()

In [None]:
#mapping the income column into numerical data using map function
df_upsampled['income'] = df_upsampled['income'].map({'<=50K': 0, '>50K': 1}).astype(int)

In [None]:
df_upsampled.head(10)

In [None]:
df_upsampled.tail(10)

we can see that now our income attribute has numerical data. Pandas .map() function has replaced every ‘<=50K’ with 0 value and ‘>50K’ with 1 and .astype(int) is to mention that replaced value should be of type int.

In [None]:
df_upsampled.shape

In [None]:
df_upsampled.dtypes #Confirm to have the right data type for all colunmns.

# Check for Missing Data

In [None]:
# get the number of missing data points per column
missing_values_count = df_upsampled.isnull().sum()
missing_values_count

In [None]:
#Save the cleaned training dataset in csv format
df_upsampled.to_csv('Adult_traindata.csv', index=False)

In [None]:
#Designate the input features as X
features=["age", "educational-num", "capital-gain", "capital-loss",
                      "hours-per-week"]
X= df_upsampled[features]
X.head()

In [None]:
#Designate the outcome or target variable as y
y = df_upsampled.income
y.head()

# Split the training dataframe into new train and test data(X_test1, validation data)

In [None]:
X_train, X_test1, y_train, y_test1 = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
X_train.shape

In [None]:
X_test1.shape

In [None]:
y_test1.shape

# TRAINING A SINGLE MODEL 

# Fit a Simple Logistic Regression Algorithm

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
model = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test1)

In [None]:
y_pred

In [None]:
print("Test_score : ", accuracy_score(y_test1, y_pred))
# compare accuracy of the actual with the predicted

In [None]:
from sklearn.metrics import confusion_matrix
print("Confusion Matrix")
print(confusion_matrix(y_test1, y_pred))

In [None]:
from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(y_test1, y_pred))

# Cross Validation with a Suite of other MACHINE LEARNING ALGORITHMS

In [None]:
# compare standalone models for binary classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    models['lr'] = LogisticRegression()
    models['knn'] = KNeighborsClassifier()
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    return models

# Using Cross Validation Approach for Training

In [None]:
# evaluate a given model using cross-validation 
#(Use whole X_train and y_train). 
#Since it does the partioning by itself.

In [None]:
#The evaluate_model() function below takes a model 
#instance and returns a list of scores from three 
#repeats of stratified 10-fold cross-validation.

def evaluate_model(model,X_train, y_train):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [None]:
# get the models to evaluate
models = get_models() # Retrieve all the models for us and store them in "models"
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train) # use the cross validation function
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

We can see that in this case, cart (DecisionTreeClassifier) performs the best with about 77.1 percent mean accuracy.

## Apply Ensemble Modelling

In [None]:
from sklearn.ensemble import StackingClassifier

Next, we can try to combine these five models into a single ensemble model using stacking.
We can use a logistic regression model to learn how to best combine the predictions from each of the separate five models.

In [None]:
# get a stacking ensemble of models
def get_stacking():
    # define the base models
    level0 = list()
    level0.append(('lr', LogisticRegression()))
    level0.append(('knn', KNeighborsClassifier()))
    level0.append(('cart', DecisionTreeClassifier()))
    level0.append(('svm', SVC()))
    level0.append(('bayes', GaussianNB()))
    # define meta learner model
    level1 = LogisticRegression()
    
    # define the stacking ensemble (all of them now called "model")
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)
    return model

In [None]:
# get a list of models to evaluate
def get_models():
    models = dict()
    models['lr'] = LogisticRegression()
    models['knn'] = KNeighborsClassifier()
    models['cart'] = DecisionTreeClassifier()
    models['svm'] = SVC()
    models['bayes'] = GaussianNB()
    models['stacking'] = get_stacking()
    return models

In [None]:
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train) #Ensembles are here now as "model"
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
#pyplot.boxplot(results, labels=names, showmeans=True)
#pyplot.show()

In this case, we can see that the stacking ensemble appears to perform better than any single model on average, with about 77 percent mean accuracy.

## make a prediction for one example

In [None]:
# define the base models
level0 = list()
level0.append(('lr', LogisticRegression()))
level0.append(('knn', KNeighborsClassifier()))
level0.append(('cart', DecisionTreeClassifier()))
level0.append(('svm', SVC()))
level0.append(('bayes', GaussianNB()))

In [None]:
# define meta learner model
level1 = LogisticRegression()
# define the stacking ensemble
model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5)

In [None]:
# fit the model on all available data
model.fit(X_train, y_train)

In [None]:
# make a prediction for one example
data = [[53,9,0,2,45]] # Features(age,educational-num,capital-gain,capital-loss,hours-per-week) in our training data
yhat = model.predict(data) # The model here is the ensemble.
print('Predicted Class: %d' % (yhat))

Running the example fits the stacking ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application. The Predicted class here is 1.

## Save the Model using Pickle

In [None]:
import pickle

In [None]:
# save the model to disk
filename = 'classifierfinalized_model.sav' 
#This is a trained and tested ensemble model.
pickle.dump(model, open(filename, 'wb'))

## Score the Pickled Model on OUR FIRST Test Data (Real World data in this case)

In [None]:
# Remember the first test data (X_test and y_test)

In [None]:
# some time later...............
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test) # New incoming data
print(result)