# Case Study: Automated Machine Learning (AutoML) for Autonomous Intrusion Detection System Development 
This is the code for the paper entitled "**[Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis](https://ieeexplore.ieee.org/document/10472316)**" published in *IEEE Transactions on Network and Service Management* (IF:5.3).<br>
Authors: Li Yang (liyanghart@gmail.com), Mirna El Rajab, Abdallah Shami, and Sami Muhaidat<br>

L. Yang, M. E. Rajab, A. Shami, and S. Muhaidat, "Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis," IEEE Transactions on Network and Service Management, pp. 1-28, 2024, doi: https://doi.org/10.1109/TNSM.2024.3376631.

# Code Part 1: Automated Offline/Static/Batch Learning
Batch learning: Batch learning methods analyze static data in batches and often need access to the entire dataset prior to model training. Traditional ML algorithms can effectively solve batch learning tasks. Although batch learning models often achieve high performance due to their ability to learn diverse data patterns, it is often difficult to update these models once created. Therefore, batch learning faces two significant challenges: model degradation and data unavailability.

## Dataset 2: 5G-NIDD
A subset of the network traffic data randomly sampled from the [5G-NIDD dataset](https://ieee-dataport.org/documents/5g-nidd-comprehensive-network-intrusion-detection-dataset-generated-over-5g-wireless).  

The 5G-NIDD dataset, created in December 2022, is a fully labeled resource constructed on a functional 5G test network for researchers and practitioners evaluating AI/ML solutions in the context of 5G/6G security [87]. 5G-NIDD encompasses data extracted from a 5G testbed connected to the 5G Test Network (5GTN) at the University of Oulu, Finland. The dataset is derived from two base stations, each featuring an attacker node and multiple benign 5G users. The attacker nodes target a server deployed within the 5GTN MEC environment. The attack scenarios captured in the dataset primarily include DoS attacks and port scans.

## Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
import lightgbm as lgb
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
from scipy.stats import shapiro
from imblearn.over_sampling import SMOTE, ADASYN
from hyperopt import fmin, tpe, hp
import time

In [2]:
import warnings 
warnings.filterwarnings('ignore')

## Read the sampled CICIDS2017 dataset

In [63]:
df = pd.read_csv("Data/5gnidd_0.01_pre-processed.csv")

In [64]:
df

Unnamed: 0,Seq,Dur,RunTime,Mean,Sum,Min,Max,Proto,sTos,dTos,...,SrcWin,DstWin,sVid,dVid,SrcTCPBase,DstTCPBase,TcpRtt,SynAck,AckDat,Label
0,0.000496,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.75,0.000000,0.000000,...,0.000002,0.00000,0.0,0.0,0.388654,0.000000,0.0,0.0,0.0,1
1,0.002442,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.75,0.000000,0.000000,...,0.000000,0.00000,0.0,0.0,0.388654,0.000000,0.0,0.0,0.0,1
2,0.003295,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.75,0.000000,0.000000,...,0.000000,0.00000,0.0,0.0,0.388654,0.000000,0.0,0.0,0.0,1
3,0.003317,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.75,0.000000,0.000000,...,0.000000,0.00000,0.0,0.0,0.388654,0.000000,0.0,0.0,0.0,1
4,0.003375,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.75,0.000000,0.000000,...,0.000000,0.00000,0.0,0.0,0.388669,0.000000,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12154,0.003688,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.00,0.000000,0.000000,...,0.000000,0.00000,1.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0
12155,0.004104,2.000005e-07,2.000005e-07,2.000005e-07,2.000005e-07,2.000005e-07,2.000005e-07,0.75,0.000000,0.000000,...,0.000008,0.00025,0.0,0.0,0.292755,0.030151,0.0,0.0,0.0,0
12156,0.005234,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,1.00,0.000000,0.000000,...,0.000000,0.00000,1.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0
12157,0.000015,4.079091e-02,4.079091e-02,4.079091e-02,4.079091e-02,4.079091e-02,4.079091e-02,0.50,0.830357,0.215054,...,0.000000,0.00000,0.0,1.0,0.000000,0.000000,0.0,0.0,0.0,0


# 1. Automated Data Pre-Processing

## Automated Transformation/Encoding
Automatically identify and transform string/text features into numerical features by the label encoding method to make the data more readable by ML models

In [65]:
# Define the automated data encoding function
def Auto_Encoding(df):
    cat_features=[x for x in df.columns if df[x].dtype=="object"] ## Find string/text features
    le=LabelEncoder()
    for col in cat_features:
        if col in df.columns:
            i = df.columns.get_loc(col)
            # Transform to numerical features
            df.iloc[:,i] = df.apply(lambda i:le.fit_transform(i.astype(str)), axis=0, result_type='expand')
    return df

In [66]:
df=Auto_Encoding(df)

## Automated normalization
Normalize the range of features to a similar scale to improve data quality

In [67]:
def Auto_Normalization(df):
    stat, p = shapiro(df)
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret
    alpha = 0.05
    numeric_features = df.drop(['Label'],axis = 1).dtypes[df.dtypes != 'object'].index
    
    # check if the data distribution follows a Gaussian/normal distribution
    # If so, select the Z-score normalization method; otherwise, select the min-max normalization
    # Details are in the paper
    if p > alpha:
        print('Sample looks Gaussian (fail to reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.mean()) / (x.std()))
        print('Z-score normalization is automatically chosen and used')
    else:
        print('Sample does not look Gaussian (reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.min()) / (x.max()-x.min()))
        print('Min-max normalization is automatically chosen and used')
    return df

In [68]:
df=Auto_Normalization(df)

Statistics=0.563, p=0.000
Sample does not look Gaussian (reject H0)
Min-max normalization is automatically chosen and used


## Automated Imputation
Detect and impute missing values to improve data quality

In [69]:
# Define the automated data imputation function
def Auto_Imputation(df):
    # Replace infinities with NaN to unify the treatment of missing data
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    
    # Identify numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    # Apply imputation only if necessary
    if df[numeric_cols].isnull().values.any():
        for col in numeric_cols:
            # Check if the entire column is NaN
            if df[col].isna().all():
                # Optional: fill such columns with a placeholder value or drop them
                # Here, we choose to fill with 0 as an example
                df[col].fillna(0, inplace=True)
            else:
                # Apply median imputation
                df[col].fillna(df[col].median(), inplace=True)
    
    # Final check for NaN or infinite values in numeric columns
    if df[numeric_cols].isnull().any().any() or np.isinf(df[numeric_cols].values).any():
        raise ValueError("Numeric data still contains NaN, infinity or a value too large after imputation.")
    
    return df


In [70]:
df=Auto_Imputation(df)

## Train-test split
Split the dataset into the training and the test set

In [71]:
X = df.drop(['Label'],axis=1)
y = df['Label']

# Here we used the 80%/20% split, it can be changed based on specific tasks
#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2,random_state = 0)

## Automated data balancing
Generate minority class samples to solve class-imbalance and improve data quality.  
Adaptive Synthetic (ADASYN) method is used.

In [72]:
pd.Series(y_train).value_counts()

1    5927
0    3800
Name: Label, dtype: int64

In [73]:
# For binary data (can be modified for multi-class data with the same logic)
def Auto_Balancing(X_train, y_train):
    number0 = pd.Series(y_train).value_counts().iloc[0]
    number1 = pd.Series(y_train).value_counts().iloc[1]
    
    if number0 > number1:
        nlarge = number0
    else:
        nlarge = number1
    
    # evaluate whether the incoming dataset is imbalanced (the abnormal/normal ratio is smaller than a threshold (e.g., 50%)) 
    if (number1/number0 > 1.5) or (number0/number1 > 1.5):
        balanced=ADASYN(n_jobs=-1,sampling_strategy={0:nlarge, 1:nlarge})

        X_train, y_train = balanced.fit_resample(X_train, y_train)
        
    return X_train, y_train

In [74]:
X_train, y_train = Auto_Balancing(X_train, y_train)

In [75]:
pd.Series(y_train).value_counts()

1    5927
0    5912
Name: Label, dtype: int64

## Model learning (for Comparison)

In [76]:
%%time
lg = lgb.LGBMClassifier(verbose = -1)
lg.fit(X_train,y_train)
t1=time.time()
predictions = lg.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.87700000000001%
Precision: 99.799%
Recall: 100.0%
F1-score: 99.899%
Time: 2.05038
Wall time: 183 ms


In [77]:
%%time
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
t1=time.time()
predictions = rf.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.91799999999999%
Precision: 99.933%
Recall: 99.933%
F1-score: 99.933%
Time: 9.42224
Wall time: 770 ms


In [78]:
%%time
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
t1=time.time()
predictions = knn.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.383%
Precision: 99.59599999999999%
Recall: 99.396%
F1-score: 99.496%
Time: 205.88003
Wall time: 507 ms


In [79]:
import tensorflow as tf
from keras.layers import Input,Dense,Dropout,BatchNormalization,Activation
from keras import Model
import keras.backend as K
import keras.callbacks as kcallbacks
from keras import optimizers
from keras.optimizers import Adam

from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
def ANN(optimizer = 'sgd',neurons=32,batch_size=1024,epochs=80,activation='relu',patience=8,loss='binary_crossentropy'):
    K.clear_session()
    inputs=Input(shape=(X.shape[1],))
    x=Dense(1000)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.3)(x)
    x=Dense(256)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.25)(x)
    x=Dense(2,activation='softmax')(x)
    model=Model(inputs=inputs,outputs=x,name='base_nlp')
    model.compile(optimizer='adam',loss='categorical_crossentropy')
#     model.compile(optimizer=Adam(lr = 0.01),loss='categorical_crossentropy',metrics=['accuracy'])
    early_stopping = EarlyStopping(monitor="loss", patience = patience)# early stop patience
    history = model.fit(X, pd.get_dummies(y).values,
              batch_size=batch_size,
              epochs=epochs,
              callbacks = [early_stopping],
              verbose=0) #verbose set to 1 will show the training process
    return model

Using TensorFlow backend.


In [81]:
%%time
ann = KerasClassifier(build_fn=ANN, verbose=0)
ann.fit(X_train,y_train)
predictions = ann.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.507%
Precision: 99.59700000000001%
Recall: 99.59700000000001%
F1-score: 99.59700000000001%
Time: 205.88003
Wall time: 5.01 s


# 2. Automated Feature Engineering
Feature selection method 1: **Recursive Feature Elimination (RFE)**, used to remove irrelevant features to improve model efficiency  
Feature selection method 2: **Pearson Correlation**, used to remove redundant features to improve model efficiency and accuracy  

In [82]:
def Feature_Importance_RFE(data, n_features_to_select=20):
    features = data.drop(['Label'], axis=1).values  # "Label" should be changed to the target class variable name if different
    labels = data['Label'].values

    # Extract feature names
    feature_names = list(data.drop(['Label'], axis=1).columns)

    # Create a base estimator
    model = lgb.LGBMRegressor(verbose = -1)

    # Create the RFE object and rank each feature
    rfe = RFE(estimator=model, n_features_to_select=n_features_to_select, step=1)
    rfe.fit(features, labels)

    # Get the feature ranking
    feature_ranking = rfe.ranking_

    # Create a DataFrame for feature importances
    feature_importances = pd.DataFrame({'feature': feature_names, 'ranking': feature_ranking})

    # Sort features according to their ranking
    feature_importances = feature_importances.sort_values('ranking', ascending=True).reset_index(drop=True)

    # Get the features to drop
    to_drop = list(feature_importances[feature_importances['ranking'] > 1]['feature'])

    return to_drop

In [83]:
# Remove redundant features
def Feature_Redundancy_Pearson(data):
    correlation_threshold=0.90 # Only remove features with the redundancy>90%. It can be changed
    features = data.drop(['Label'],axis=1)
    corr_matrix = features.corr()

    # Extract the upper triangle of the correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # Select the features with correlations above the threshold
    # Need to use the absolute value
    to_drop = [column for column in upper.columns if any(upper[column].abs() > correlation_threshold)]

    # Dataframe to hold correlated pairs
    record_collinear = pd.DataFrame(columns = ['drop_feature', 'corr_feature', 'corr_value'])

    # Iterate through the columns to drop
    for column in to_drop:

        # Find the correlated features
        corr_features = list(upper.index[upper[column].abs() > correlation_threshold])

        # Find the correlated values
        corr_values = list(upper[column][upper[column].abs() > correlation_threshold])
        drop_features = [column for _ in range(len(corr_features))]    

        # Record the information (need a temp df for now)
        temp_df = pd.DataFrame.from_dict({'drop_feature': drop_features,
                                         'corr_feature': corr_features,
                                         'corr_value': corr_values})
        record_collinear = record_collinear.append(temp_df, ignore_index = True)
#     print(record_collinear)
    return to_drop

In [84]:
def Auto_Feature_Engineering(df):
    drop1 = Feature_Importance_RFE(df)
    dfh1 = df.drop(columns = drop1)
    
    drop2 = Feature_Redundancy_Pearson(dfh1)
    dfh2 = dfh1.drop(columns = drop2)
    
    return dfh2

In [85]:
dfh2 = Auto_Feature_Engineering(df)
dfh2

Unnamed: 0,Seq,Dur,Proto,sTtl,dTtl,sHops,TotBytes,SrcBytes,Offset,sMeanPktSz,...,Load,pLoss,Rate,SrcWin,DstWin,SrcTCPBase,TcpRtt,SynAck,AckDat,Label
0,0.000496,0.000000e+00,0.75,0.211765,0.000000,0.357143,0.000007,0.000092,0.000201,0.042208,...,0.000000,0.0,0.000000,0.000002,0.00000,0.388654,0.0,0.0,0.0,1
1,0.002442,0.000000e+00,0.75,0.145098,0.000000,0.964286,0.000007,0.000092,0.000850,0.042208,...,0.000000,0.0,0.000000,0.000000,0.00000,0.388654,0.0,0.0,0.0,1
2,0.003295,0.000000e+00,0.75,0.149020,0.000000,0.928571,0.000007,0.000092,0.001133,0.042208,...,0.000000,0.0,0.000000,0.000000,0.00000,0.388654,0.0,0.0,0.0,1
3,0.003317,0.000000e+00,0.75,0.211765,0.000000,0.357143,0.000007,0.000092,0.001140,0.042208,...,0.000000,0.0,0.000000,0.000000,0.00000,0.388654,0.0,0.0,0.0,1
4,0.003375,0.000000e+00,0.75,0.211765,0.000000,0.357143,0.000007,0.000092,0.001160,0.042208,...,0.000000,0.0,0.000000,0.000000,0.00000,0.388669,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12154,0.003688,0.000000e+00,1.00,0.976471,0.000000,0.250000,0.000014,0.000117,0.003075,0.053852,...,0.000000,0.0,0.000000,0.000000,0.00000,0.000000,0.0,0.0,0.0,0
12155,0.004104,2.000005e-07,0.75,0.203922,0.250980,0.428571,0.000158,0.000257,0.003462,0.039298,...,0.557175,0.0,1.000000,0.000008,0.00025,0.292755,0.0,0.0,0.0,0
12156,0.005234,0.000000e+00,1.00,0.976471,0.000000,0.250000,0.000014,0.000117,0.004298,0.053852,...,0.000000,0.0,0.000000,0.000000,0.00000,0.000000,0.0,0.0,0.0,0
12157,0.000015,4.079091e-02,0.50,1.000000,0.980392,0.035714,0.000145,0.000304,0.004546,0.069862,...,0.000002,0.0,0.000003,0.000000,0.00000,0.000000,0.0,0.0,0.0,0


## Data Split & Balancing (After Feature Engineering)

In [86]:
X = dfh2.drop(['Label'],axis=1)
y = dfh2['Label']

#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2,random_state = 0)

In [87]:
X_train, y_train = Auto_Balancing(X_train, y_train)

# 3. Automated Model Selection
Select the best-performing model among five common machine learning models (Naive Bayes, KNN, random forest, LightGBM, and ANN/MLP) by evaluating their learning performance

### Grid Search

In [93]:
# Create a pipeline
pipe = Pipeline([('classifier', KNeighborsClassifier())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [
                {'classifier': [KNeighborsClassifier()]},
                {'classifier': [RandomForestClassifier()]},
                {'classifier': [lgb.LGBMClassifier(verbose = -1)]},
                {'classifier': [KerasClassifier(build_fn=ANN, verbose=0)]},
                 ]

In [94]:
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

In [95]:
clf.fit(X, y)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('classifier', KNeighborsClassifier())]),
             param_grid=[{'classifier': [KNeighborsClassifier()]},
                         {'classifier': [RandomForestClassifier()]},
                         {'classifier': [LGBMClassifier(verbose=-1)]},
                         {'classifier': [<keras.wrappers.scikit_learn.KerasClassifier object at 0x000001658BACF3C8>]}])

In [96]:
print("Best Model:"+ str(clf.best_params_))
print("Accuracy:"+ str(clf.best_score_))

Best Model:{'classifier': RandomForestClassifier()}
Accuracy:0.9961346316222477


In [97]:
clf.cv_results_

{'mean_fit_time': array([2.59265900e-03, 4.29303312e-01, 1.42621040e-01, 5.53982763e+00]),
 'std_fit_time': array([4.88111152e-04, 1.78230136e-02, 9.09718641e-03, 6.47998989e-01]),
 'mean_score_time': array([0.38965359, 0.01775336, 0.00498633, 0.1201952 ]),
 'std_score_time': array([0.01900864, 0.00039945, 0.00089132, 0.01313867]),
 'param_classifier': masked_array(data=[KNeighborsClassifier(), RandomForestClassifier(),
                    LGBMClassifier(verbose=-1),
                    <keras.wrappers.scikit_learn.KerasClassifier object at 0x000001658BACF3C8>],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'classifier': KNeighborsClassifier()},
  {'classifier': RandomForestClassifier()},
  {'classifier': LGBMClassifier(verbose=-1)},
  {'classifier': <keras.wrappers.scikit_learn.KerasClassifier at 0x1658bacf3c8>}],
 'split0_test_score': array([0.92269737, 0.99917763, 0.99465461,        nan]),
 'split1_test_score': arra

Random Forest model is the best performing machine learning model, and the best cross-validation accuracy is 99.613%

## Model learning (for Comparison)

In [98]:
%%time
lg = lgb.LGBMClassifier(verbose = -1)
lg.fit(X_train,y_train)
t1=time.time()
predictions = lg.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.91799999999999%
Precision: 99.933%
Recall: 99.933%
F1-score: 99.933%
Time: 2.46016
Wall time: 326 ms


In [99]:
%%time
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
t1=time.time()
predictions = rf.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.87700000000001%
Precision: 99.866%
Recall: 99.933%
F1-score: 99.899%
Time: 9.02354
Wall time: 726 ms


In [100]:
%%time
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
t1=time.time()
predictions = knn.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 99.465%
Precision: 99.664%
Recall: 99.46300000000001%
F1-score: 99.563%
Time: 271.69228
Wall time: 672 ms


In [101]:
import tensorflow as tf
from keras.layers import Input,Dense,Dropout,BatchNormalization,Activation
from keras import Model
import keras.backend as K
import keras.callbacks as kcallbacks
from keras import optimizers
from keras.optimizers import Adam

from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
def ANN(optimizer = 'sgd',neurons=32,batch_size=1024,epochs=80,activation='relu',patience=8,loss='binary_crossentropy'):
    K.clear_session()
    inputs=Input(shape=(X_train.shape[1],))
    x=Dense(1000)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.3)(x)
    x=Dense(256)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.25)(x)
    x=Dense(2,activation='softmax')(x)
    model=Model(inputs=inputs,outputs=x,name='base_nlp')
    model.compile(optimizer='adam',loss='categorical_crossentropy')
#     model.compile(optimizer=Adam(lr = 0.01),loss='categorical_crossentropy',metrics=['accuracy'])
    early_stopping = EarlyStopping(monitor="loss", patience = patience)# early stop patience
    history = model.fit(X_train, pd.get_dummies(y_train).values,
              batch_size=batch_size,
              epochs=epochs,
              callbacks = [early_stopping],
              verbose=0) #verbose set to 1 will show the training process
    return model

In [102]:
%%time
ann = KerasClassifier(build_fn=ANN, verbose=0)
ann.fit(X_train,y_train)
predictions = ann.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 98.15%
Precision: 97.5%
Recall: 99.53%
F1-score: 98.504%
Time: 271.69228
Wall time: 6.51 s


# 4. Hyperparameter Optimization
Optimize the best performing machine learning model (lightGBM) by tuning its hyperparameters

## Hold-out validation

In [103]:
#Particle Swarm Optimization
import optunity
import optunity.metrics

# Define the hyperparameter configuration space
search = {
    'n_estimators': [50, 500],
    'max_depth': [5, 50],
    'learning_rate': (0, 1),
    "num_leaves":[100, 2000],
    "min_child_samples":[10, 50],
         }
# Define the objective function
def performance(n_estimators=None, max_depth=None,learning_rate=None,num_leaves=None,min_child_samples=None):
    clf = lgb.LGBMClassifier(n_estimators=int(n_estimators),
                                   max_depth=int(max_depth),
                                   learning_rate=float(learning_rate),
                                   num_leaves=int(num_leaves),
                                   min_child_samples=int(min_child_samples),
                                  )
    clf.fit(X_train,y_train)
    prediction = clf.predict(X_test)
    score = accuracy_score(y_test,prediction)
    return score

# Detect the optimal hyperparameter values
optimal_configuration, info, _ = optunity.maximize(performance,
                                                  solver_name='particle swarm',
                                                  num_evals=20,
                                                   **search
                                                  )
print(optimal_configuration)
print("Accuracy:"+ str(info.optimum))

{'n_estimators': 192.3828125, 'max_depth': 41.38671875, 'learning_rate': 0.74609375, 'num_leaves': 122.265625, 'min_child_samples': 49.53125}
Accuracy:0.9995888157894737


In [105]:
%%time
clf = lgb.LGBMClassifier(max_depth=41, learning_rate=  0.74609375, n_estimators = 192, 
                         num_leaves = 122, min_child_samples = 49)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 99.959%
Precision: 99.933%
Recall: 100.0%
F1-score: 99.966%
Wall time: 244 ms


After hyperparameter optimization, the hold-out accuracy has been improved from 99.918% to 99.959%

In [106]:
import optunity
import optunity.metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define the hyperparameter configuration space for RandomForestClassifier
search = {
    'n_estimators': [50, 500],  # Number of trees in the forest
    'max_depth': [5, 50],  # Maximum depth of the tree
    'min_samples_split': [2, 11],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 11],  # Minimum number of samples required to be at a leaf node
    'criterion_index': [0, 1],  # Index to select criterion, 0 for "gini", 1 for "entropy"
}

# Define the objective function for RandomForestClassifier
def performance(n_estimators=None, max_depth=None, min_samples_split=None, min_samples_leaf=None, criterion_index=None):
    # Convert criterion_index to actual criterion string
    criterion = ["gini", "entropy"][int(criterion_index)]
    
    # Define and fit the model
    clf = RandomForestClassifier(
        n_estimators=int(n_estimators),
        max_depth=int(max_depth),
        min_samples_split=int(min_samples_split),
        min_samples_leaf=int(min_samples_leaf),
        criterion=criterion
    )
    clf.fit(X_train, y_train)
    prediction = clf.predict(X_test)
    score = accuracy_score(y_test, prediction)
    return score

# Detect the optimal hyperparameter values using PSO
optimal_configuration, info, _ = optunity.maximize(performance,
                                                   solver_name='particle swarm',
                                                   num_evals=20,
                                                   **search
                                                  )

print(optimal_configuration)
print("Accuracy:" + str(info.optimum))


{'n_estimators': 396.5087890625, 'max_depth': 42.19970703125, 'min_samples_split': 3.05029296875, 'min_samples_leaf': 2.4794921875, 'criterion_index': 0.62646484375}
Accuracy:0.9991776315789473


In [110]:
%%time
clf = RandomForestClassifier(max_depth=42, n_estimators = 396, min_samples_split = 3,
                         min_samples_leaf = 2, criterion = 'gini')
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 99.91799999999999%
Precision: 99.933%
Recall: 99.933%
F1-score: 99.933%
Wall time: 2.66 s


# 5. Combined Algorithm Selection and Hyperparameter tuning (CASH)
CASH is the process of combining the two AutoML procedures: model selection and hyperparameter optimization.

## Method: Particle Swarm Optimization (PSO)

In [111]:
import optunity
import optunity.metrics

search = {'algorithm': {'k-nn': {'n_neighbors': [3, 10]},
                        'random-forest': {
                                'n_estimators': [50, 500],
                                'max_features': [5, 12],
                                'max_depth': [5,50],
                                "min_samples_split":[2,11],
                                "min_samples_leaf":[1,11]},
                        'lightgbm': {
                                'n_estimators': [50, 500],
                                'max_depth': [5, 50],
                                'learning_rate': (0, 1),
                                "num_leaves":[100, 2000],
                                "min_child_samples":[10, 50],
                                    },
                        'ann': {
                                'neurons': [10, 100],
                                'epochs': [20, 50],
                                'patience': [3, 20],
                                }
                        }
          
         }
def performance(
                algorithm, n_neighbors=None, 
    n_estimators=None, max_features=None,max_depth=None,min_samples_split=None,min_samples_leaf=None,
    learning_rate=None,num_leaves=None,min_child_samples=None,
    neurons=None,epochs=None,patience=None
):
    # fit the model
    if algorithm == 'k-nn':
        model = KNeighborsClassifier(n_neighbors=int(n_neighbors))
    elif algorithm == 'random-forest':
        model = RandomForestClassifier(n_estimators=int(n_estimators),
                                       max_features=int(max_features),
                                       max_depth=int(max_depth),
                                       min_samples_split=int(min_samples_split),
                                       min_samples_leaf=int(min_samples_leaf))
    elif algorithm == 'lightgbm':
        model = lgb.LGBMClassifier(n_estimators=int(n_estimators),
                                   max_depth=int(max_depth),
                                   learning_rate=float(learning_rate),
                                   num_leaves=int(num_leaves),
                                   min_child_samples=int(min_child_samples),
                                  )
    elif algorithm == 'ann':
        model = KerasClassifier(build_fn=ANN, verbose=0,
                               neurons=int(neurons),
                                epochs=int(epochs),
                                patience=int(patience)
                               )
    else:
        raise ArgumentError('Unknown algorithm: %s' % algorithm)
# predict the test set
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    score = accuracy_score(y_test,prediction)
    return score

# Run the CASH process
optimal_configuration, info, _ = optunity.maximize_structured(performance, 
                                                              search_space=search, 
                                                              num_evals=10)
print(optimal_configuration)
print(info.optimum)

{'algorithm': 'lightgbm', 'epochs': None, 'neurons': None, 'patience': None, 'n_neighbors': None, 'learning_rate': 0.84814453125, 'max_depth': 30.29052734375, 'min_child_samples': 21.73828125, 'n_estimators': 427.2705078125, 'num_leaves': 1191.943359375, 'max_features': None, 'min_samples_leaf': None, 'min_samples_split': None}
0.9995888157894737


In [112]:
%%time
clf = lgb.LGBMClassifier(max_depth=30, learning_rate= 0.84814453125, n_estimators = 427, 
                         num_leaves = 1191, min_child_samples = 21)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 99.959%
Precision: 99.933%
Recall: 100.0%
F1-score: 99.966%
Wall time: 317 ms


LightGBM with the above hyperparameter values is identified as the optimal model