## TPOT - Auto-ML programming 

Following image depicts how TPOT works:
<img src = 'https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1537396029/output_2_0_d7uh0v.png'/>

For more info visit: https://epistasislab.github.io/tpot/examples/

In [29]:
# import required modules
import pandas as pd
import numpy as np

In [30]:
data = pd.read_csv('online_shoppers_intention-2.csv')
data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


## Data prep

It's generally a good idea to randomly **shuffle** the data before starting to avoid any type of ordering in the data.

In [31]:
# random data shuffle
data_shuffle=data.iloc[np.random.permutation(len(data))]
data1 = data_shuffle.reset_index(drop=True)
data1.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,22,2715.0,0.0,0.009091,54.970364,0.0,May,2,2,4,2,New_Visitor,False,True
1,0,0.0,0,0.0,4,0.0,0.2,0.2,0.0,0.0,Dec,1,1,2,1,Returning_Visitor,False,False
2,3,145.2,0,0.0,11,1646.366667,0.0,0.027778,0.0,0.0,June,2,2,2,1,Returning_Visitor,False,False
3,0,0.0,0,0.0,21,1275.9,0.052381,0.101587,0.0,0.0,Sep,2,4,5,1,Returning_Visitor,False,False
4,0,0.0,0,0.0,2,0.0,0.2,0.2,0.0,0.0,Aug,1,1,3,1,Returning_Visitor,False,False


- Label cetegorical values and deal with missing values 

In [32]:
# labeling categorical values 
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12330 non-null int64
Administrative_Duration    12330 non-null float64
Informational              12330 non-null int64
Informational_Duration     12330 non-null float64
ProductRelated             12330 non-null int64
ProductRelated_Duration    12330 non-null float64
BounceRates                12330 non-null float64
ExitRates                  12330 non-null float64
PageValues                 12330 non-null float64
SpecialDay                 12330 non-null float64
Month                      12330 non-null object
OperatingSystems           12330 non-null int64
Browser                    12330 non-null int64
Region                     12330 non-null int64
TrafficType                12330 non-null int64
VisitorType                12330 non-null object
Weekend                    12330 non-null bool
Revenue                    12330 non-null bool
dtypes: bool(

In [33]:
#binning the Month column by quarter(as seen above)
#new column created-month_bin will have months binned by their respective quarters
def Month_bin(Month) :
    if Month == 'Jan':
        return 1
    elif Month == 'Feb':
        return 1
    elif Month == 'Mar':
        return 1
    elif Month == 'Apr':
        return 2
    elif Month == 'May':
        return 2
    elif Month == 'June':
        return 2
    elif Month == 'Jul':
        return 3
    elif Month == 'Aug':
        return 3
    elif Month == 'Sep':
        return 3
    elif Month == 'Oct':
        return 4
    elif Month == 'Nov':
        return 4
    elif Month == 'Dec':
        return 4

data1['Month_bin'] = data1['Month'].apply(Month_bin)

#binning VisitorType
#creating new column--VisitorType_bin
def VisitorType_bin(VisitorType) :
    if VisitorType == 'Returning_Visitor':
        return 1
    elif VisitorType == 'New_Visitor':
        return 2
    elif VisitorType == 'Other':
        return 3

# apply function
data1['VisitorType_bin'] = data1['VisitorType'].apply(VisitorType_bin)

# get dummies 
data1 = pd.get_dummies(data1, columns=['VisitorType_bin','Month_bin'])
# convert to bool 
data1[['VisitorType_bin_1', 'VisitorType_bin_2', 'VisitorType_bin_3',
       'Month_bin_1', 'Month_bin_2', 'Month_bin_3', 'Month_bin_4']] =  data1[['VisitorType_bin_1', 
    'VisitorType_bin_2', 'VisitorType_bin_3','Month_bin_1', 'Month_bin_2', 'Month_bin_3', 'Month_bin_4']].astype(int)

data1 = data1.drop(['Month','VisitorType'], axis = 1)

# tpot doesn't accept bool dtype
data1[['Revenue','Weekend']] = data1[['Revenue','Weekend']].astype(int)

In [34]:
# store target seperetly 
target = data1.Revenue.values

In [35]:
# handnling NA values
# assumption exit rates cannot be 0
data1['ExitRates'] = data1['ExitRates'].replace(0,np.NaN)
data1['ExitRates'] = data1['ExitRates'].fillna(data1['ExitRates'].median())

## Modeling
- split the DataFrame into a training set and a testing set just like you do while doing any type of machine learning modeling. 
- You can do this via sklearn's **cross_validation** **train_test_split**.

In [36]:
from sklearn.model_selection import train_test_split

training_indices, testing_indices = train_test_split(data1.index,
                                                        stratify = target,
                                                        train_size=0.75, test_size=0.25, random_state = 123)

In [37]:
# size of the training set and validation set
training_indices.size, testing_indices.size

(9247, 3083)

**tpot** training can take up to several hours to finish BUT there are some hyperparameters than can be adjusted so it does not take forever


- **max_time_mins**: how many minutes TPOT has to optimize the pipeline. If not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse.
- **max_eval_time_mins**: how many minutes TPOT has to evaluate a single pipeline. Setting this parameter to higher values will enable TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on assessing time-consuming pipelines. The default is 5.
- **early_stop**: how many generations TPOT checks whether there is no improvement in the optimization process. Ends the optimization process if there is no improvement in the given number of generations.
- **n_jobs**: Number of procedures to use in parallel for evaluating pipelines during the TPOT optimization process. Setting n_jobs=-1 will use as many cores as available on the computer. Beware that using multiple methods on the same machine may cause memory issues for large datasets. The default is 1.
- **subsample**: Fraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. The default is 1.

In [44]:
from tpot import TPOTClassifier

tpot = TPOTClassifier(verbosity=2, max_time_mins=30, 
                      max_eval_time_mins=1.2, population_size=30, early_stop=30, n_jobs=-1,scoring="roc_auc")

tpot.fit(data1.drop('Revenue',axis=1).loc[training_indices].values, # X_train
         data1.loc[training_indices,'Revenue'].values) # y_train

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=30.0, style=ProgressStyle(des…

Generation 1 - Current best internal CV score: 0.9270869547713263
Generation 2 - Current best internal CV score: 0.9282509958632194
Generation 3 - Current best internal CV score: 0.928301959542355
Generation 4 - Current best internal CV score: 0.928301959542355
Generation 5 - Current best internal CV score: 0.928301959542355
Generation 6 - Current best internal CV score: 0.928443330001204
Generation 7 - Current best internal CV score: 0.9289341891557565
Generation 8 - Current best internal CV score: 0.9289341891557565
Generation 9 - Current best internal CV score: 0.9291855022988834
Generation 10 - Current best internal CV score: 0.9291855022988834
Generation 11 - Current best internal CV score: 0.9292618355685613
Generation 12 - Current best internal CV score: 0.9292618355685613
Generation 13 - Current best internal CV score: 0.9292618355685613
Generation 14 - Current best internal CV score: 0.9292618355685613
Generation 15 - Current best internal CV score: 0.9292618355685613
Generati

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=30, generations=100,
               max_eval_time_mins=1.2, max_time_mins=30, memory=None,
               mutation_rate=0.9, n_jobs=-1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=30,
               random_state=None, scoring='roc_auc', subsample=1.0,
               template=None, use_dask=False, verbosity=2, warm_start=False)

- we trained the model for 30 minutes
- the best pipeline was OneHotEncoder: input_matrix, minimum_fraction=0.25, sparse=False, threshold=10
- best model was random forest with the following hyperparameters
    - bootstrap=True, criterion=entropy, max_features=0.4, min_samples_leaf=20, min_samples_split=14, n_estimators=100

One of the key difference here is we use both `X_test` and `y_test` in the code below, since the `.score()` method below combines the __prediction__ and __evaluation__ in the same step.

- our AUC score is 93.06%

In [45]:
tpot.score(data1.drop('Revenue',axis=1).loc[testing_indices].values, #X_test
           data1.loc[testing_indices, 'Revenue'].values) # y_test

0.9306108625313944

In [46]:
# we can export this pipeline and reuse it
tpot.export('tpot_pipeline.py')