# Comparing the Predictive Power of TPOT vs XGBoost Quick and Dirty Stylee


#### This is a quick and dirty look at how TPOT compares to XGBoost with minimal tuning.  
#### I am using https://www.kaggle.com/sampadab17/network-intrusion-detection network security dataset from Kaggle

## Imports and custom cell behavior

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from tpot import TPOTClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

# see all the data in dataframe
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# widen jupyter cells to page width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Load the data 

In [2]:
df_train = pd.read_csv('Train_data.csv')
# df_test = pd.read_csv('Test_data.csv')

# ===================  Preprocess data  ==================================

## Check for missing data

In [37]:
df_train.isnull().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

## Another one line check, less info but same check

In [38]:
df_train.isnull().values.any()

False

## Encode data to machine optimized format

In [3]:
df_encoded = pd.get_dummies(df_train, prefix=['protocol_type', 'service', 'flag'], columns=['protocol_type', 'service', 'flag'])
# df_test_encoded = pd.get_dummies(df_test, prefix=['protocol_type', 'service', 'flag'], columns=['protocol_type', 'service', 'flag'])

## Grab X and y values from dataframe

In [5]:
X = df_encoded.drop('class', axis=1).values
y = df_encoded['class'].values

## Standardize data
####  Standarize darta between -1 and 1

In [4]:
std_scaler = StandardScaler()

In [6]:
X_std = std_scaler.fit_transform(X)

## Encode target variables for machine handling

In [9]:
labelencoder = LabelEncoder()
y_encoded = labelencoder.fit_transform(y)

## Split data into train/test 

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_std, y_encoded, test_size=0.2, random_state=93)

# ==========================  XGBoost  ====================================

## Load XGB Classifer

In [21]:
model = xgb.XGBClassifier(n_jobs=-1, random_state=93)

## Fit and predict on XGB Classifer

In [22]:
model.fit(X_train,y_train)
preds = model.predict(X_test)

## Print XGB Classifier accuracy score

In [24]:
accuracy = accuracy_score(y_test, preds)
accuracy

0.9954356023020441

# ============================  TPOT  ====================================

## Load TPOT classifier model

In [29]:
pipeline_optimizer = TPOTClassifier(generations=10, population_size=100, n_jobs=-1, cv=2, random_state=93, verbosity=2)

#### TPOT does exahustive pipeline searches.  This WILL take a long time without adjusting <em>generations</em> and <em>population</em> size.  Knocking down <em>population</em> size will shorten the run time to see if it's working before kicking off exhaustive search.

In [30]:
pipeline_optimizer.fit(X_train, y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=1100, style=ProgressStyle(descrip…

Generation 1 - Current best internal CV score: 0.9968738825375691
Generation 2 - Current best internal CV score: 0.9968738973106761
Generation 3 - Current best internal CV score: 0.9968738973106761
Generation 4 - Current best internal CV score: 0.997072388775543
Generation 5 - Current best internal CV score: 0.9972708556185652
Generation 6 - Current best internal CV score: 0.9972708556185652
Generation 7 - Current best internal CV score: 0.9974197094441085
Generation 8 - Current best internal CV score: 0.9974197094441085
Generation 9 - Current best internal CV score: 0.9974197094441085
Generation 10 - Current best internal CV score: 0.9974197094441085

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.5, max_depth=9, max_features=0.4, min_samples_leaf=19, min_samples_split=2, n_estimators=100, subsample=0.9500000000000001)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=2,
        disable_update_check=False, early_stop=None, generations=10,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=-1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=100,
        random_state=93, scoring=None, subsample=1.0,
        template='RandomTree', use_dask=False, verbosity=2,
        warm_start=False)

## Print TPOT accuracy score

In [36]:
print(pipeline_optimizer.score(X_test, y_test))

0.9978170271879341


## Final Score Comparison

XGBoost = 0.9954356023020441 <br />
TPOT = 0.9978170271879341

#### We see that TPOT out performs XGBoost with default params out of the bag but takes <em>much</em> longer to train. 
#### TPOT beats XGBoost by 0.002% with a vanilla train but train time and parameter tuning could show us different results.  