# Tutorial: Automated Machine Learning
This is the code for the paper entitled "**[IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective](https://arxiv.org/abs/2209.08018)**" published in *Engineering Applications of Artificial Intelligence* (Elsevier's Journal, IF:7.8).<br>
Authors: Li Yang (lyang339@uwo.ca) and Abdallah Shami (Abdallah.Shami@uwo.ca)<br>
Organization: The Optimized Computing and Communications (OC2) Lab, ECE Department, Western University

L. Yang and A. Shami, "IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective", *Engineering Applications of Artificial Intelligence*, vol. 116, pp. 1-33, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105366.

# Code Part 1: Automated Offline/Static/Batch Learning
Batch learning: Batch learning methods analyze static data in batches and often need access to the entire dataset prior to model training. Traditional ML algorithms can effectively solve batch learning tasks. Although batch learning models often achieve high performance due to their ability to learn diverse data patterns, it is often difficult to update these models once created. Therefore, batch learning faces two significant challenges: model degradation and data unavailability.

## Dataset 2: IoTID20
A subset of the IoT network traffic data randomly sampled from the [IoTID20 dataset](https://sites.google.com/view/iot-network-intrusion-dataset/home).   

IoTID20 dataset was created by using normal and attack virtual machines as network platforms, simulating IoT services with the node-red tool, and extracting features with the Information Security Center of Excellence (ISCX) flow meter program. A typical smart home environment was established for generating this dataset using five IoT devices or services: a smart fridge, a smart thermostat, motion-activated lights, a weather station, and a remotely-activated garage door. Thus, the traffic data samples of normal and abnormal IoT devices are collected in Pcap files.

## Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
import lightgbm as lgb
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import shapiro
from imblearn.over_sampling import SMOTE
import time

In [2]:
import warnings 
warnings.filterwarnings('ignore')

## Read the sampled IoTID20 dataset

In [3]:
df = pd.read_csv("Data/IoT_2020_b_0.01.csv")

In [4]:
df

Unnamed: 0,Flow_ID,Src_IP,Src_Port,Dst_IP,Dst_Port,Timestamp,Flow_Duration,TotLen_Fwd_Pkts,TotLen_Bwd_Pkts,Bwd_Pkt_Len_Max,...,Fwd_Pkts/s,Bwd_Pkts/s,Pkt_Len_Std,FIN_Flag_Cnt,Pkt_Size_Avg,Init_Bwd_Win_Byts,Idle_Mean,Idle_Max,Idle_Min,Label
0,12708,25886,53190,200,9020,3760,124,0.0,2776.0,1388.0,...,0.00,16129.03,0.00,0,2082.0,1869,124.00,124.0,124.0,1
1,60,25883,60357,110,7760,3649,46189,2528.0,0.0,0.0,...,43.30,43.30,698.97,0,632.0,126,15396.33,40225.0,1910.0,1
2,12695,25883,9020,203,52739,2261,73,1388.0,1388.0,1388.0,...,13698.63,13698.63,0.00,0,2082.0,1869,73.00,73.0,73.0,1
3,12690,25886,52717,200,9020,1963,149,0.0,0.0,0.0,...,13422.82,6711.41,0.00,0,0.0,32679,74.50,75.0,74.0,1
4,12690,25886,52717,200,9020,1905,83,0.0,0.0,0.0,...,12048.19,12048.19,0.00,0,0.0,32422,83.00,83.0,83.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6247,12659,25886,10125,200,9020,1057,70,0.0,2776.0,1388.0,...,0.00,28571.43,0.00,0,2082.0,1869,70.00,70.0,70.0,1
6248,13708,25883,60158,233,8899,3264,40,128.0,32.0,32.0,...,100000.00,25000.00,0.00,0,38.4,-1,10.00,21.0,5.0,1
6249,7655,15614,8487,200,554,3697,1283,0.0,0.0,0.0,...,0.00,1558.85,0.00,0,0.0,14600,1283.00,1283.0,1283.0,1
6250,62696,25889,64783,233,9988,3192,222,96.0,32.0,32.0,...,13513.51,4504.50,0.00,0,40.0,-1,74.00,206.0,8.0,1


# 1. Automated Data Pre-Processing

## Automated Transformation/Encoding
Automatically identify and transform string/text features into numerical features to make the data more readable by ML models

In [5]:
# Define the automated data encoding function
def Auto_Encoding(df):
    cat_features=[x for x in df.columns if df[x].dtype=="object"] ## Find string/text features
    le=LabelEncoder()
    for col in cat_features:
        if col in df.columns:
            i = df.columns.get_loc(col)
            # Transform to numerical features
            df.iloc[:,i] = df.apply(lambda i:le.fit_transform(i.astype(str)), axis=0, result_type='expand')
    return df

In [6]:
df=Auto_Encoding(df)

## Automated Imputation
Detect and impute missing values to improve data quality

In [7]:
# Define the automated data imputation function
def Auto_Imputation(df):
    if df.isnull().values.any() or np.isinf(df).values.any(): # if there is any empty or infinite values
        df.replace([np.inf, -np.inf], np.nan, inplace=True)
        df.fillna(0, inplace = True)  # Replace empty values with zeros; there are other imputation methods discussed in the paper
    return df

In [8]:
df=Auto_Imputation(df)

## Automated normalization
Normalize the range of features to a similar scale to improve data quality

In [9]:
def Auto_Normalization(df):
    stat, p = shapiro(df)
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret
    alpha = 0.05
    numeric_features = df.drop(['Label'],axis = 1).dtypes[df.dtypes != 'object'].index
    
    # The selection strategy is based on the following article: 
    # https://medium.com/@kumarvaishnav17/standardization-vs-normalization-in-machine-learning-3e132a19c8bf
    # Check if the data distribution follows a Gaussian/normal distribution
    # If so, select the Z-score normalization method; otherwise, select the min-max normalization
    # Details are in the paper
    if p > alpha:
        print('Sample looks Gaussian (fail to reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.mean()) / (x.std()))
        print('Z-score normalization is automatically chosen and used')
    else:
        print('Sample does not look Gaussian (reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.min()) / (x.max()-x.min()))
        print('Min-max normalization is automatically chosen and used')
    return df

In [10]:
df=Auto_Normalization(df)

Statistics=0.108, p=0.000
Sample does not look Gaussian (reject H0)
Min-max normalization is automatically chosen and used


## Train-test split
Split the dataset into the training and the test set

In [11]:
X = df.drop(['Label'],axis=1)
y = df['Label']

# Here we used the 80%/20% split, it can be changed based on specific tasks
#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2,random_state = 0)

## Automated data balancing
Generate minority class samples to solve class-imbalance and improve data quality.  
Synthetic Minority Over-sampling Technique (SMOTE) method is used.

In [12]:
pd.Series(y_train).value_counts()

1    4717
0     284
Name: Label, dtype: int64

In [13]:
# For binary data (can be modified for multi-class data with the same logic)
def Auto_Balancing(X_train, y_train):
    number0 = pd.Series(y_train).value_counts().iloc[0]
    number1 = pd.Series(y_train).value_counts().iloc[1]
    
    if number0 > number1:
        nlarge = number0
    else:
        nlarge = number1
    
    # evaluate whether the incoming dataset is imbalanced (the abnormal/normal ratio is smaller than a threshold (e.g., 50%)) 
    if (number1/number0 > 1.5) or (number0/number1 > 1.5):
        smote=SMOTE(n_jobs=-1,sampling_strategy={0:nlarge, 1:nlarge})
        X_train, y_train = smote.fit_resample(X_train, y_train)
        
    return X_train, y_train

In [14]:
X_train, y_train = Auto_Balancing(X_train, y_train)

In [15]:
pd.Series(y_train).value_counts()

1    4717
0    4717
Name: Label, dtype: int64

## Model learning (for Comparison)

In [16]:
%%time
lg = lgb.LGBMClassifier(verbose = -1)
lg.fit(X_train,y_train)
t1=time.time()
predictions = lg.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
F1-score: 100.0%
Time: 4.78267
Wall time: 245 ms


In [17]:
%%time
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
t1=time.time()
predictions = rf.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.83999999999999%
Precision: 99.83%
Recall: 100.0%
F1-score: 99.91499999999999%
Time: 12.75663
Wall time: 915 ms


In [18]:
%%time
nb = GaussianNB()
nb.fit(X_train,y_train)
t1=time.time()
predictions = nb.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 70.584%
Precision: 99.876%
Recall: 68.73899999999999%
F1-score: 81.43299999999999%
Time: 1.56773
Wall time: 25.9 ms


In [19]:
%%time
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
t1=time.time()
predictions = knn.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 98.801%
Precision: 99.82799999999999%
Recall: 98.893%
F1-score: 99.358%
Time: 220.03672
Wall time: 281 ms


In [20]:
import tensorflow as tf
from keras.layers import Input,Dense,Dropout,BatchNormalization,Activation
from keras import Model
import keras.backend as K
import keras.callbacks as kcallbacks
from keras import optimizers
from keras.optimizers import Adam

from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
def ANN(optimizer = 'sgd',neurons=16,batch_size=1024,epochs=80,activation='relu',patience=8,loss='binary_crossentropy'):
    K.clear_session()
    inputs=Input(shape=(X.shape[1],))
    x=Dense(1000)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.3)(x)
    x=Dense(256)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.25)(x)
    x=Dense(2,activation='softmax')(x)
    model=Model(inputs=inputs,outputs=x,name='base_nlp')
    model.compile(optimizer='adam',loss='categorical_crossentropy')
#     model.compile(optimizer=Adam(lr = 0.01),loss='categorical_crossentropy',metrics=['accuracy'])
    early_stopping = EarlyStopping(monitor="loss", patience = patience)# early stop patience
    history = model.fit(X, pd.get_dummies(y).values,
              batch_size=batch_size,
              epochs=epochs,
              callbacks = [early_stopping],
              verbose=0) #verbose set to 1 will show the training process
    return model

Using TensorFlow backend.


In [21]:
%%time
ann = KerasClassifier(build_fn=ANN, verbose=0)
ann.fit(X_train,y_train)
predictions = ann.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 97.762%
Precision: 99.74%
Recall: 97.871%
F1-score: 98.79599999999999%
Time: 220.03672
Wall time: 7.16 s


# 2. Automated Feature Engineering
Feature selection method 1: **Information Gain (IG)**, used to remove irrelevant features to improve model efficiency  
Feature selection method 2: **Pearson Correlation**, used to remove redundant features to improve model efficiency and accuracy  

In [22]:
# Remove irrelevant features and select important features
def Feature_Importance_IG(data):
    features = data.drop(['Label'],axis=1).values  # "Label" should be changed to the target class variable name if different
    labels = data['Label'].values
    
    # Extract feature names
    feature_names = list(data.drop(['Label'],axis=1).columns)

    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    model = lgb.LGBMRegressor(verbose = -1)
    model.fit(features, labels)
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': model.feature_importances_})

    # Sort features according to importance
    feature_importances = feature_importances.sort_values('importance', ascending = False).reset_index(drop = True)

    # Normalize the feature importances to add up to one
    feature_importances['normalized_importance'] = feature_importances['importance'] / feature_importances['importance'].sum()
    feature_importances['cumulative_importance'] = np.cumsum(feature_importances['normalized_importance'])
    
    cumulative_importance=0.90 # Only keep the important features with cumulative importance scores>=90%. It can be changed.

    # Make sure most important features are on top
    feature_importances = feature_importances.sort_values('cumulative_importance')

    # Identify the features not needed to reach the cumulative_importance
    record_low_importance = feature_importances[feature_importances['cumulative_importance'] > cumulative_importance]

    to_drop = list(record_low_importance['feature'])
#     print(feature_importances.drop(['importance'],axis=1))
    return to_drop

In [23]:
# Remove redundant features
def Feature_Redundancy_Pearson(data):
    correlation_threshold=0.90 # Only remove features with the redundancy>90%. It can be changed
    features = data.drop(['Label'],axis=1)
    corr_matrix = features.corr()

    # Extract the upper triangle of the correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # Select the features with correlations above the threshold
    # Need to use the absolute value
    to_drop = [column for column in upper.columns if any(upper[column].abs() > correlation_threshold)]

    # Dataframe to hold correlated pairs
    record_collinear = pd.DataFrame(columns = ['drop_feature', 'corr_feature', 'corr_value'])

    # Iterate through the columns to drop
    for column in to_drop:

        # Find the correlated features
        corr_features = list(upper.index[upper[column].abs() > correlation_threshold])

        # Find the correlated values
        corr_values = list(upper[column][upper[column].abs() > correlation_threshold])
        drop_features = [column for _ in range(len(corr_features))]    

        # Record the information (need a temp df for now)
        temp_df = pd.DataFrame.from_dict({'drop_feature': drop_features,
                                         'corr_feature': corr_features,
                                         'corr_value': corr_values})
        record_collinear = record_collinear.append(temp_df, ignore_index = True)
#     print(record_collinear)
    return to_drop

In [24]:
def Auto_Feature_Engineering(df):
    drop1 = Feature_Importance_IG(df)
    dfh1 = df.drop(columns = drop1)
    
    drop2 = Feature_Redundancy_Pearson(dfh1)
    dfh2 = dfh1.drop(columns = drop2)
    
    return dfh2

In [25]:
dfh2 = Auto_Feature_Engineering(df)
dfh2

Unnamed: 0,Flow_ID,Src_IP,Src_Port,Dst_IP,Dst_Port,Timestamp,Flow_Duration,TotLen_Bwd_Pkts,Flow_Byts/s,Flow_Pkts/s,Flow_IAT_Std,Fwd_IAT_Tot,Bwd_IAT_Std,Bwd_Pkts/s,Pkt_Len_Std,Init_Bwd_Win_Byts,Idle_Mean,Label
0,0.198248,0.446464,0.821049,0.413136,0.148302,0.874767,0.001241,0.123680,0.220603,0.006440,0.000000,0.000000,0.0,0.016109,0.000000,0.028534,0.002506,1
1,0.000765,0.446413,0.931680,0.222458,0.127585,0.848929,0.466032,0.000000,0.000539,0.000023,0.336694,0.458352,0.0,0.000023,0.829215,0.001938,0.311194,1
2,0.198045,0.446413,0.139234,0.419492,0.867104,0.525838,0.000726,0.061840,0.374723,0.010947,0.000000,0.000000,0.0,0.013679,0.000000,0.028534,0.001475,1
3,0.197967,0.446464,0.813747,0.413136,0.148302,0.456471,0.001493,0.000000,0.000000,0.008042,0.000011,0.000843,0.0,0.006691,0.000000,0.498657,0.001506,1
4,0.197967,0.446464,0.813747,0.413136,0.148302,0.442970,0.000827,0.000000,0.000000,0.009627,0.000000,0.000000,0.0,0.012028,0.000000,0.494736,0.001678,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6247,0.197483,0.446464,0.156291,0.413136,0.148302,0.245577,0.000696,0.123680,0.390782,0.011417,0.000000,0.000000,0.0,0.028552,0.000000,0.028534,0.001415,1
6248,0.213862,0.446413,0.928608,0.483051,0.146312,0.759311,0.000394,0.001426,0.039416,0.049988,0.000116,0.000216,0.0,0.024980,0.000000,0.000000,0.000202,1
6249,0.119352,0.269300,0.131007,0.413136,0.009109,0.860102,0.012935,0.000000,0.000000,0.000611,0.000000,0.000000,0.0,0.001539,0.000000,0.222794,0.025932,1
6250,0.978750,0.446516,1.000000,0.483051,0.164217,0.742551,0.002230,0.001426,0.005682,0.007195,0.001788,0.000182,0.0,0.004484,0.000000,0.000000,0.001496,1


## Data Split & Balancing (After Feature Engineering)

In [26]:
X = dfh2.drop(['Label'],axis=1)
y = dfh2['Label']

#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2,random_state = 0)

In [27]:
X_train, y_train = Auto_Balancing(X_train, y_train)

# 3. Automated Model Selection
Select the best-performing model among five common machine learning models (Naive Bayes, KNN, random forest, LightGBM, and ANN/MLP) by evaluating their learning performance

### Method 1: Grid Search

In [28]:
# Create a pipeline
pipe = Pipeline([('classifier', GaussianNB())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [GaussianNB()]},
                {'classifier': [KNeighborsClassifier()]},
                {'classifier': [RandomForestClassifier()]},
                {'classifier': [lgb.LGBMClassifier(verbose = -1)]},
                {'classifier': [KerasClassifier(build_fn=ANN, verbose=0)]},
                 ]

In [29]:
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

In [30]:
clf.fit(X, y)

GridSearchCV(cv=5, estimator=Pipeline(steps=[('classifier', GaussianNB())]),
             param_grid=[{'classifier': [GaussianNB()]},
                         {'classifier': [KNeighborsClassifier()]},
                         {'classifier': [RandomForestClassifier()]},
                         {'classifier': [LGBMClassifier(verbose=-1)]},
                         {'classifier': [<keras.wrappers.scikit_learn.KerasClassifier object at 0x0000027B91744320>]}])

In [31]:
print("Best Model:"+ str(clf.best_params_))
print("Accuracy:"+ str(clf.best_score_))

Best Model:{'classifier': LGBMClassifier(verbose=-1)}
Accuracy:0.9993601278976818


In [32]:
clf.cv_results_

{'mean_fit_time': array([3.19132805e-03, 2.39443779e-03, 3.23404932e-01, 1.25419044e-01,
        3.47803822e+00]),
 'std_fit_time': array([0.00039907, 0.00048784, 0.01115818, 0.00786784, 0.06721689]),
 'mean_score_time': array([0.00179529, 0.14380698, 0.01518211, 0.00498695, 0.07730794]),
 'std_score_time': array([0.00039887, 0.00479374, 0.00038913, 0.0006309 , 0.00146038]),
 'param_classifier': masked_array(data=[GaussianNB(), KNeighborsClassifier(),
                    RandomForestClassifier(), LGBMClassifier(verbose=-1),
                    <keras.wrappers.scikit_learn.KerasClassifier object at 0x0000027B91744320>],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'classifier': GaussianNB()},
  {'classifier': KNeighborsClassifier()},
  {'classifier': RandomForestClassifier()},
  {'classifier': LGBMClassifier(verbose=-1)},
  {'classifier': <keras.wrappers.scikit_learn.KerasClassifier at 0x27b91744320>}],
 'split0

LightGBM model is the best performing machine learning model, and the best cross-validation accuracy is 99.936%

### Method 2: Bayesian Optimization with Tree Parzen Estimator (BO-TPE)

In [33]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    
    classifier_type = params['type']
    del params['type']
    if classifier_type == 'nb':
        clf = GaussianNB()
    elif classifier_type == 'knn':
        clf = KNeighborsClassifier()
    elif classifier_type == 'rf':
        clf = RandomForestClassifier()
    elif classifier_type == 'lgb':
        clf = lgb.LGBMClassifier(verbose = -1)
    elif classifier_type == 'ann':
        clf = KerasClassifier(build_fn=ANN, verbose=0)
    else:
        return 0
    
    clf.fit(X_train,y_train)
    predictions = clf.predict(X_test)
    score = accuracy_score(y_test,predictions)
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = hp.choice('classifier_type', [{'type': 'nb'},{'type': 'knn'},{'type': 'rf'},{'type': 'lgb'},{'type': 'ann'},])

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10)
print("Hyperopt estimated optimum {}".format(best))

100%|██████████████████████████████████████████████████████████████| 10/10 [00:20<00:00,  2.06s/trial, best loss: -1.0]
Hyperopt estimated optimum {'classifier_type': 3}


Classifier type 3 is the LightGBM model, and the best hold-out accuracy is 100.0%

# 4. Hyperparameter Optimization
Optimize the best performing machine learning model (lightGBM) by tuning its hyperparameters

## Cross validation

In [34]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']),
        'learning_rate': abs(float(params['learning_rate'])),
        "num_leaves": int(params['num_leaves']),
        "min_child_samples": int(params['min_child_samples']),
    }
    clf = lgb.LGBMClassifier( **params)
    score = cross_val_score(clf, X, y, scoring='accuracy', cv=StratifiedKFold(n_splits=5)).mean()
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 500, 20),
    'max_depth': hp.quniform('max_depth', 5, 50, 1),
    "learning_rate":hp.uniform('learning_rate', 0, 1),
    "num_leaves":hp.quniform('num_leaves',100,2000,100),
    "min_child_samples":hp.quniform('min_child_samples',10,50,5),
}

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20)
print("LightGBM: Hyperopt estimated optimum {}".format(best))

100%|██████████████████████████████████████████████████████████| 20/20 [00:26<00:00,  1.34s/trial, best loss: -0.99968]
LightGBM: Hyperopt estimated optimum {'learning_rate': 0.5636571315681871, 'max_depth': 16.0, 'min_child_samples': 50.0, 'n_estimators': 180.0, 'num_leaves': 1800.0}


In [36]:
%%time
clf = lgb.LGBMClassifier(max_depth=16, learning_rate=  0.5636571315681871, n_estimators = 180, 
                         num_leaves = 1800, min_child_samples = 50)
clf.fit(X,y)
scores = cross_val_score(clf, X, y, cv=5,scoring='accuracy')
print("Accuracy: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='precision')
print("Precision: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='recall')
print("Recall: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='f1')
print("F1-score: "+ str(round(scores.mean(),5)*100)+"%")

Accuracy: 99.968%
Precision: 99.983%
Recall: 99.983%
F1-score: 99.983%
Wall time: 2.77 s


After hyperparameter optimization, the cross-validation accuracy has been improved from 99.936%% to 99.968%

## Hold-out validation

In [35]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']),
        'learning_rate': abs(float(params['learning_rate'])),
        "num_leaves": int(params['num_leaves']),
        "min_child_samples": int(params['min_child_samples']),
    }
    clf = lgb.LGBMClassifier( **params)
    clf.fit(X_train,y_train)
    predictions = clf.predict(X_test)
    score = accuracy_score(y_test,predictions)
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 500, 20),
    'max_depth': hp.quniform('max_depth', 5, 50, 1),
    "learning_rate":hp.uniform('learning_rate', 0, 1),
    "num_leaves":hp.quniform('num_leaves',100,2000,100),
    "min_child_samples":hp.quniform('min_child_samples',10,50,5),
}

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50)
print("LightGBM: Hyperopt estimated optimum {}".format(best))

100%|██████████████████████████████████████████████████████████████| 50/50 [00:15<00:00,  3.29trial/s, best loss: -1.0]
LightGBM: Hyperopt estimated optimum {'learning_rate': 0.17566405992887468, 'max_depth': 45.0, 'min_child_samples': 45.0, 'n_estimators': 300.0, 'num_leaves': 400.0}


In [37]:
%%time
clf = lgb.LGBMClassifier(max_depth=45, learning_rate= 0.17566405992887468, n_estimators = 300, 
                         num_leaves = 400, min_child_samples = 45)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
F1-score: 100.0%
Wall time: 355 ms


After hyperparameter optimization, the hold-out accuracy has been improved from 100.0% to 100.0%

# 5. Combined Algorithm Selection and Hyperparameter tuning (CASH)
CASH is the process of combining the two AutoML procedures: model selection and hyperparameter optimization.

## Method: Particle Swarm Optimization (PSO)

In [36]:
import optunity
import optunity.metrics

search = {'algorithm': {'k-nn': {'n_neighbors': [3, 10]},
                        'naive-bayes': None,
                        'random-forest': {
                                'n_estimators': [20, 100],
                                'max_features': [5, 12],
                                'max_depth': [5,50],
                                "min_samples_split":[2,11],
                                "min_samples_leaf":[1,11]},
                        'lightgbm': {
                                'n_estimators': [20, 100],
                                'max_depth': [5, 50],
                                'learning_rate': (0, 1),
                                "num_leaves":[100, 2000],
                                "min_child_samples":[10, 50],
                                    },
                        'ann': {
                                'neurons': [10, 100],
                                'epochs': [20, 50],
                                'patience': [3, 20],
                                }
                        }
          
         }
def performance(
                algorithm, n_neighbors=None, 
    n_estimators=None, max_features=None,max_depth=None,min_samples_split=None,min_samples_leaf=None,
    learning_rate=None,num_leaves=None,min_child_samples=None,
    neurons=None,epochs=None,patience=None
):
    # fit the model
    if algorithm == 'k-nn':
        model = KNeighborsClassifier(n_neighbors=int(n_neighbors))
    elif algorithm == 'naive-bayes':
        model = GaussianNB()
    elif algorithm == 'random-forest':
        model = RandomForestClassifier(n_estimators=int(n_estimators),
                                       max_features=int(max_features),
                                       max_depth=int(max_depth),
                                       min_samples_split=int(min_samples_split),
                                       min_samples_leaf=int(min_samples_leaf))
    elif algorithm == 'lightgbm':
        model = lgb.LGBMClassifier(n_estimators=int(n_estimators),
                                   max_depth=int(max_depth),
                                   learning_rate=float(learning_rate),
                                   num_leaves=int(num_leaves),
                                   min_child_samples=int(min_child_samples),
                                  )
    elif algorithm == 'ann':
        model = KerasClassifier(build_fn=ANN, verbose=0,
                               neurons=int(neurons),
                                epochs=int(epochs),
                                patience=int(patience)
                               )
    else:
        raise ArgumentError('Unknown algorithm: %s' % algorithm)
# predict the test set
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    score = accuracy_score(y_test,prediction)
    return score

# Run the CASH process
optimal_configuration, info, _ = optunity.maximize_structured(performance, 
                                                              search_space=search, 
                                                              num_evals=50)
print(optimal_configuration)
print(info.optimum)

{'algorithm': 'lightgbm', 'n_neighbors': None, 'learning_rate': 0.88427734375, 'max_depth': 28.44482421875, 'min_child_samples': 39.66796875, 'n_estimators': 78.9453125, 'num_leaves': 251.220703125}
1.0


In [38]:
%%time
clf = lgb.LGBMClassifier(max_depth=28, learning_rate= 0.88427734375, n_estimators = 78, 
                         num_leaves = 251, min_child_samples = 40)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
F1-score: 100.0%
Wall time: 140 ms


LightGBM with the above hyperparameter values is identified as the optimal model