# AutoML can be a strong tool to ensure optimal data models while scaling applications and to maintain repeatability.
## In this session we explore the capabilities of AutoML through TPOT with possible extensions to autosklearn libraries.

## The data set is a subsection of the data we worked on for the Live In class assignment for Classification (Machine Learning Engineer@Walmart Labs). We use 25k random samples from train and test data, respectively for this exercise (to demonstrate AutoML libraries).

### Created by Sohini Roychowdhury for FourthBrain.ai

## Task 1, Data Loading: Load the data and create Training and Test data sets.
We look first 25k samples from the training and test data sets from the Week 2 Live assignment. The data is pre-wrangled and one-hot encoded already.

In [None]:
## Importing required Libraries
import os
import tensorflow as tf
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
#Read the training data
url = 'https://raw.githubusercontent.com/FourthBrain/AutoMLA/main/Train_data.csv'
df = pd.read_csv(url, error_bad_lines=False)
df.head()

Unnamed: 0,NumOfEventsInJourney,NumSessions,interactionTime,maxPrice,minPrice,NumCart,NumView,InsessionCart,InsessionView,year,month,weekday,timeOfDay,Weekend,Purchase
0,1,1,0,50.63,50.63,0,0,0,0,2019,11,7,2,0,1
1,1,1,0,126.0,126.0,0,1,0,1,2019,11,5,1,0,0
2,1,1,0,334.6,334.6,0,1,0,1,2019,11,2,1,0,0
3,1,1,0,43.24,43.24,0,1,0,1,2019,11,3,5,0,0
4,1,1,0,421.38,421.38,0,1,0,2,2019,11,7,2,0,0


In [None]:
X_train=df.iloc[:,:-1].values
y_train=df.iloc[:,-1].values
print(X_train.shape)

(25000, 14)


In [None]:
#Now read and create test data
url_t = 'https://raw.githubusercontent.com/FourthBrain/AutoMLA/main/Test_data.csv'
df_test = pd.read_csv(url_t, error_bad_lines=False)
X_test=df_test.iloc[:,:-1].values
y_test=df_test.iloc[:,-1].values
print(X_test.shape)
df_test.head()

(25000, 14)


Unnamed: 0,NumOfEventsInJourney,NumSessions,interactionTime,maxPrice,minPrice,NumCart,NumView,InsessionCart,InsessionView,year,month,weekday,timeOfDay,Weekend,Purchase
0,2,1,30,242.99,242.99,0,2,0,4,2019,11,5,4,0,0
1,1,1,0,869.11,869.11,0,1,0,1,2019,11,6,3,0,0
2,1,1,0,715.33,715.33,1,0,1,0,2019,11,5,6,0,0
3,1,1,0,771.27,771.27,0,1,0,1,2019,11,5,3,0,0
4,1,1,0,15.42,15.42,0,1,0,1,2019,11,1,5,0,0


# Task 2, Baselining: Apply Logistic Regression, Linear SVM and Random forest (as done in class) to get some baseline classification performances to classsify purchase vs non purchase records.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [None]:
#i. Balanced Logistic Regression
regb = LogisticRegression(random_state=42, C=0.05,class_weight='balanced').fit(X_train, y_train)
reg_predb = regb.predict(X_test)
cmlog = confusion_matrix(y_test, reg_predb)
acc   = accuracy(y_test, reg_predb)
rec   = recall(y_test, reg_predb)
prec  = precision(y_test, reg_predb)
f1    = f1_score(y_test, reg_predb)
### END CODE HERE ###
# Print the metrics, display the confusion matrix, and visualize the model
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmlog)

Accuracy = 0.94348, Precision = 0.19660620245757754, Recall = 0.8936170212765957, F1-score = 0.32230215827338127
Confusion Matrix is:
[[23251  1373]
 [   40   336]]


In [None]:
#ii. Linear SVM
svmm = make_pipeline(StandardScaler(),LinearSVC(random_state=42, tol=1e-1,class_weight='balanced'))
svmm.fit(X_train, y_train)
svm_predb = svmm.predict(X_test)
cms  = confusion_matrix(y_test, svm_predb)
acc  = accuracy(y_test, svm_predb)
rec  = recall(y_test, svm_predb)
prec = precision(y_test, svm_predb)
f1   = f1_score(y_test, svm_predb)
### END CODE HERE ###
# Print the metrics, display the confusion matrix, and visualize the model
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cms)

Accuracy = 1.0, Precision = 1.0, Recall = 1.0, F1-score = 1.0
Confusion Matrix is:
[[24624     0]
 [    0   376]]


In [None]:
#iii. Random Forest
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=1, max_depth=20)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
cmrf=confusion_matrix(y_test, rf_pred)
acc  = accuracy(y_test, rf_pred)
rec  = recall(y_test, rf_pred)
prec = precision(y_test, rf_pred)
f1   = f1_score(y_test, rf_pred)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmrf)

Accuracy = 0.99868, Precision = 1.0, Recall = 0.9122340425531915, F1-score = 0.9541029207232268
Confusion Matrix is:
[[24624     0]
 [   33   343]]


# Task 3: Run AutoML using TPOT (Tree-based Pipeline Optimization Tool). Consider TPOT your Data Science Assistant.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

[Docu](http://epistasislab.github.io/tpot/using/),
[Github Code](https://github.com/EpistasisLab/tpot)

[Paper] (https://academic.oup.com/bioinformatics/article/36/1/250/5511404)

## Blog: https://machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/

# Installation

The notebook was created based on 
tpot 0.11.6

In [None]:
!pip install tpot



In [None]:
# ignore some annoying warnings for demonstrating auto-sklearn 
# shouldn't be done in real production
np.warnings.filterwarnings('ignore')

# Task 3a.

We want to find the best classifier for the data set at hand.
TPOT aids hyper-parameterization across the following class of classifiers:
1. Naive Bayes
2. Decision Trees
3. RandomForestClassifier
4. GradientBoostingClassifier
5. KNeighborsClassifier
6. Linear SVM
7. Logistic Regression
8. Xtreme Gradient Boosting

Related Blog with description: https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9

In [None]:
from tpot import TPOTClassifier
print('Training features shape: ', X_train.shape)
print('Testing features shape:  ', X_test.shape)

Training features shape:  (25000, 14)
Testing features shape:   (25000, 14)


## The following prcess will take time (few hours), but you get a progress bar to follow time elapsed.

In [None]:
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')


Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.9998799999999999

Generation 2 - Current best internal CV score: 0.9998799999999999

Generation 3 - Current best internal CV score: 0.9998799999999999

Generation 4 - Current best internal CV score: 0.9998799999999999

Generation 5 - Current best internal CV score: 0.99992

Best pipeline: RandomForestClassifier(PCA(input_matrix, iterated_power=4, svd_solver=randomized), bootstrap=False, criterion=gini, max_features=0.45, min_samples_leaf=8, min_samples_split=8, n_estimators=100)
0.9998


So the best classifier is SVM with C=15, tol=0.01, l2 penalty, This verifies our findings above!

In [None]:
from sklearn.pipeline import make_pipeline
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
from sklearn.tree import DecisionTreeClassifier
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LinearSVC(C=15, tol=0.01)),
    DecisionTreeClassifier(max_depth=10, min_samples_leaf=3, min_samples_split=14)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)
 
exported_pipeline.fit(X_train, y_train)
prediction = exported_pipeline.predict(X_test)

In [None]:
cmtp=confusion_matrix(y_test, prediction)
acc  = accuracy(y_test, prediction)
rec  = recall(y_test, prediction)
prec = precision(y_test, prediction)
f1   = f1_score(y_test, prediction)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

Accuracy = 0.99968, Precision = 1.0, Recall = 0.9787234042553191, F1-score = 0.989247311827957
Confusion Matrix is:
[[24624     0]
 [    8   368]]


So we see TPOT finds the best model and hyperparameters automatically!

# Task 3b

To find the best selection of features for the given classification task

In [None]:
from tpot.config import classifier_config_dict
# add FeatureSetSelector into tpot configuration
classifier_config_dict['tpot.builtins.FeatureSetSelector'] = {
        'subset_list': ['https://raw.githubusercontent.com/FourthBrain/AutoMLA/main/test_feature.csv'],
        'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above
    #'sel_subset': list(combinations(range(3), 2)) # select two feature sets
}

tpot = TPOTClassifier(generations=5,
                           population_size=50, verbosity=2,
                           template='FeatureSetSelector-Transformer-Classifier',
                           config_dict=classifier_config_dict)
tpot.fit(X_train, y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=300.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: 0.9995200000000001

Generation 2 - Current best internal CV score: 0.9995200000000001

Generation 3 - Current best internal CV score: 0.9996

Generation 4 - Current best internal CV score: 0.9996

Generation 5 - Current best internal CV score: 0.99968

Best pipeline: DecisionTreeClassifier(PCA(FeatureSetSelector(input_matrix, sel_subset=0, subset_list=https://raw.githubusercontent.com/FourthBrain/AutoMLA/main/test_feature.txt), iterated_power=4, svd_solver=randomized), criterion=entropy, max_depth=7, min_samples_leaf=2, min_samples_split=2)


TPOTClassifier(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.FastICA': {'tol': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])},
                            'sklearn.decomposition.PC

Thus, if we had to select only one classifying dimension, the best solution would be by applying PCA and using the first component to classify!

## Other well known AutoML methods are autosklearn:
[Code] https://colab.research.google.com/drive/1Au5sGCegoGLLxqrIs85GCCW0oWIHqwQw