# Adaptive Boosting Classifier Training and Evaluation

Author: Abigail Dupaya

Description: 
The notebook contains the code used for training and evaluating an Adaptive Boosting Classifier. It is using decision tree stumps as an ensemble of weak learners.
In order to run the code, the cells needed to be run in order.

## Step 1: Importing Libraries

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

## Step 2: Loading Preprocessed Data

The data loaded here was the cleaned, balanced, and feature extracted dataset. This was generated from the data_preprocessing notebook.

In [19]:
# Path for the preprocessed data
csv_path = Path("processed_urls.csv")

# Reading file
try:
    df = pd.read_csv(csv_path, encoding="utf-8", low_memory=False)
except FileNotFoundError:
    raise FileNotFoundError(f"Error: {csv_path} not found. Make sure the file is in the same folder.")

# Drop non-numerical columns that were used for feature extraction
dropped_columns = ['url', 'scheme', 'subdomain', 'registrable_domain', 
                   'suffix', 'path', 'query', 'fragment', 'port', 'username', 'password', 'host']

df = df.drop(columns=dropped_columns, errors='ignore')

# Check if any column has a missing value
df.isnull().sum().any()

np.False_

In [20]:
# Split cleaned dataset into features and labels
X = df.drop(columns=['type'])
y = df['type']

In [21]:
# Check that the correct columns for the features is selected
X.head()

Unnamed: 0,is_http,is_https,len_total,len_host,len_path,len_query,len_fragment,count_dots,count_slashes,count_digits,...,count_percent,count_at,count_question,count_equal,count_ampersand,count_special,entropy_url,keyword_flag,is_shortened,is_ip_host
0,0,1,22,14,0,0,0,2,2,0,...,0,0,0,0,0,0,3.663533,0,0,0
1,0,1,23,15,0,0,0,2,2,0,...,0,0,0,0,0,0,3.762267,0,0,0
2,0,1,24,16,0,0,0,2,2,0,...,0,0,0,0,0,0,3.855389,0,0,0
3,0,1,21,13,0,0,0,2,2,0,...,0,0,0,0,0,0,3.88018,0,0,0
4,0,1,25,17,0,0,0,2,2,0,...,0,0,0,0,0,0,3.813661,0,0,0


In [22]:
# Check that the correct columns for the labels is selected
y.head()

0    0
1    0
2    0
3    0
4    0
Name: type, dtype: int64

In [23]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

## Step 3: Train the Model

#### Initial state of model: Adaptive Boosting Classifier (AdaBoost) without Hyperparameter Tuning

An Adaptive Boosting Classifier using the default hyperparameters without hyperparameter tuning was trained and evaluated first to compare if it is worth it to perform hyperparameter tuning.

In [24]:
# Creating AdaBoost Classifier Object Without Hyperparameter Tuning
stump = DecisionTreeClassifier(random_state=42, max_depth=1)
ada_classifier = AdaBoostClassifier(random_state=42, estimator=stump, n_estimators=50)

# Time the training
print("AdaBoost Classifier Started Training...")

start = time.perf_counter()
ada_classifier.fit(X_train, y_train)
end = time.perf_counter()

print(f"AdaBoost Classifier training time: {end - start:.2f} seconds")

AdaBoost Classifier Started Training...
AdaBoost Classifier training time: 6.84 seconds


In [25]:
# Check how the model learned during training
y_train_pred = ada_classifier.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)

# Check the performance of model based on accuracy
y_test_pred = ada_classifier.predict(X_test)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"AdaBoost Classifier Training Accuracy: {train_acc:.4f}")
print(f"AdaBoost Classifier Testing Accuracy: {test_acc:.4f}")

AdaBoost Classifier Training Accuracy: 0.9204
AdaBoost Classifier Testing Accuracy: 0.9200


#### Evaluating the Model

In [26]:
cr = classification_report(y_test, y_test_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.90      1.00      0.94    103721
           1       0.99      0.75      0.86     47759

    accuracy                           0.92    151480
   macro avg       0.94      0.87      0.90    151480
weighted avg       0.93      0.92      0.92    151480



#### Final Model: Adaptive Boosting Classifier (AdaBoost) with Hyperparameter Tuning

This was the final model used in the course project.

In [27]:
# Defining hyperparameter grids
param_grid = {
    'estimator__criterion': ["gini", "entropy"],
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.1, 0.5, 1.0]
}

In [28]:
print("AdaBoost Classifier Started Hyperparameter Tuning...")
# Grid Search for AdaBoost to find best parameters
ada_grid_search = GridSearchCV(estimator=ada_classifier, param_grid=param_grid, cv=5, scoring="f1", n_jobs=-1)

# Time the hyperparameter tuning
start = time.perf_counter()
ada_grid_search.fit(X_train,y_train)
end = time.perf_counter()

hyperparameter_tuning_time = end - start

print(f"AdaBoost Classifier Hyperparameter tuning time: {hyperparameter_tuning_time:.2f} seconds")

# Grab the best parameters for the model
best_ada_params = ada_grid_search.best_params_

# Print the best parameters for the model
print(f"Best parameters for AdaBoost Classifier: {best_ada_params}")

AdaBoost Classifier Started Hyperparameter Tuning...
AdaBoost Classifier Hyperparameter tuning time: 296.59 seconds
Best parameters for AdaBoost Classifier: {'estimator__criterion': 'gini', 'learning_rate': 1.0, 'n_estimators': 200}


In [29]:
# Building base model with tuned parameters
stump = DecisionTreeClassifier(
    random_state=42, 
    criterion=best_ada_params['estimator__criterion'],
    max_depth=1)

# Building model with tuned parameters
best_ada_classifier = AdaBoostClassifier(
    random_state=42,
    estimator=stump,
    n_estimators=best_ada_params['n_estimators'],
    learning_rate=best_ada_params['learning_rate']
)

# Retrain the model on the whole train dataset
# Time the training
print("AdaBoost Classifier Started Training...")
start = time.perf_counter()
best_ada_classifier.fit(X_train, y_train)
end = time.perf_counter()
training_time = end - start

print(f"AdaBoost Classifier Training time: {training_time:.2f} seconds")

AdaBoost Classifier Started Training...
AdaBoost Classifier Training time: 26.78 seconds


In [30]:
# Check how the model learned during training
y_train_pred = best_ada_classifier.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)

# Check the performance of model based on accuracy
y_test_pred = best_ada_classifier.predict(X_test)
test_acc = accuracy_score(y_test, y_test_pred)
print(f"AdaBoost Classifier Training Accuracy: {train_acc:.4f}")
print(f"AdaBoost Classifier Testing Accuracy: {test_acc:.4f}")

AdaBoost Classifier Training Accuracy: 0.9435
AdaBoost Classifier Testing Accuracy: 0.9434


In [31]:
cr = classification_report(y_test, y_test_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.93      0.99      0.96    103721
           1       0.98      0.84      0.90     47759

    accuracy                           0.94    151480
   macro avg       0.96      0.91      0.93    151480
weighted avg       0.95      0.94      0.94    151480



#### Using Custom Threshold to Predict

Custom threshold was used to predict rather than the default 50% because we are prioritizing higher f1 score and recall. The threshold was lowered because missing to identify a malicious url is more harmful than labeling a legitimate as a malicious url.

In [32]:
pred_prob = best_ada_classifier.predict_proba(X_test)[:, 1]  # 1 is malicious

# Custom threshold
threshold = 0.48

# Use threshold to decide the prediction
y_test_pred = (pred_prob >= threshold).astype(int)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"AdaBoost Classifier Testing Accuracy with custom threshold: {test_acc:.4f}")

AdaBoost Classifier Testing Accuracy with custom threshold: 0.9413


## Step 4: Evaluating the Final Model

In [33]:
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1_score = f1_score(y_test, y_test_pred)
test_acc = accuracy_score(y_test, y_test_pred)
cr = classification_report(y_test, y_test_pred)
cm = confusion_matrix(y_test, y_test_pred)

print('\n' + '='*50)
print('FINAL TUNED ADABOOST TEST RESULTS')
print('='*50)
print(f'Accuracy:  {test_acc:.4f}')
print(f'Precision: {test_precision:.4f}')
print(f'Recall:    {test_recall:.4f}')
print(f'F1 Score:  {test_f1_score:.4f}')
print(f'Confusion Matrix:\n{cm}')
print(f'Classification Report:\n{cr}')
print('\n' + '='*50)
print('FINAL TUNED ADABOOST RUNTIME ANALYSIS')
print('='*50)
print(f'Training completed in {training_time:.4f} seconds ({training_time/60:.2f} minutes)')
print(f'Hyperparameter tuning completed in {hyperparameter_tuning_time:.4f} seconds ({hyperparameter_tuning_time/60:.2f} minutes)')



FINAL TUNED ADABOOST TEST RESULTS
Accuracy:  0.9413
Precision: 0.9191
Recall:    0.8923
F1 Score:  0.9055
Confusion Matrix:
[[99971  3750]
 [ 5142 42617]]
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.96      0.96    103721
           1       0.92      0.89      0.91     47759

    accuracy                           0.94    151480
   macro avg       0.94      0.93      0.93    151480
weighted avg       0.94      0.94      0.94    151480


FINAL TUNED ADABOOST RUNTIME ANALYSIS
Training completed in 26.7826 seconds (0.45 minutes)
Hyperparameter tuning completed in 296.5925 seconds (4.94 minutes)
