# Lesson 3 Assignment

In this lab assignment, you will implement a simplified version of Random Forest classifier and practice how to use and fine-tune Random Forest, Extra Trees, and Gradient Boosted Trees. You will then compare the model performance of various classifiers on internet ad dataset.

In [None]:
# import packages
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

# make this notebook's output stable across runs
np.random.seed(0)

## Data Set Information:

This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The task is to predict whether an image is an advertisement ("ad") or not ("nonad"). Additional information can be found [here](https://archive.ics.uci.edu/ml/datasets/internet%2Badvertisements).

## Attribute Information:

The dataset has 3 continous (height, width, aratio) and 1555 binary (urls, tags, captions) features. 

## Source:

Creator & donor: Nicholas Kushmerick <nick '@' ucd.ie>

In [None]:
# Load the data
internetAd = pd.read_csv('Internet_Ad_Data.csv', sep=',', error_bad_lines=False)
print(internetAd.info())
internetAd.head(20)

Question 1: Prepare and impute missing values with the median (missing values for this dataset are \?, nonad. ad.)

In [None]:
# Clean data and impute missing values
internetAd = internetAd.replace('?', np.nan)
X = internetAd.drop('Target', axis=1)
X = X.apply(pd.to_numeric, errors='coerce')
X = X.fillna(X.median())
y = internetAd['Target'].map({'ad.':1, 'nonad.':0})


Question 2: Split dataset into training and test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


Question 3: Train and evaluate a randomeforrest classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [None]:
parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

rf_grid = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42),
                      parameters, scoring='roc_auc', cv=3)
rf_grid.fit(X_train, y_train)


In [None]:
# make predictions with the trained random forest
test_z = rf_grid.predict(X_test)
test_z_prob = rf_grid.predict_proba(X_test)[:,1]

print('Accuracy:', accuracy_score(y_test, test_z))
print('ROC AUC:', roc_auc_score(y_test, test_z_prob))


Question 4: Train and evaluate a ExtraTrees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [None]:
parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

et_grid = GridSearchCV(ExtraTreesClassifier(n_estimators=100, random_state=42),
                      parameters, scoring='roc_auc', cv=3)
et_grid.fit(X_train, y_train)


In [None]:
# make predictions with the trained random forest
test_z = et_grid.predict(X_test)
test_z_prob = et_grid.predict_proba(X_test)[:,1]

print('Accuracy:', accuracy_score(y_test, test_z))
print('ROC AUC:', roc_auc_score(y_test, test_z_prob))


Question 5: Train and evaluate a Gradient Boosted Trees classifier using the following gridsearch parameters:
- "max_depth": [2, 4],
- "min_samples_split": [0.05, 0.1, 0.2]

In [None]:
parameters = {
    "max_depth": [2, 4],
    "min_samples_split": [0.05, 0.1, 0.2]
}

gb_grid = GridSearchCV(GradientBoostingClassifier(random_state=42),
                      parameters, scoring='roc_auc', cv=3)
gb_grid.fit(X_train, y_train)


In [None]:
# make predictions with the trained random forest
test_z = gb_grid.predict(X_test)
test_z_prob = gb_grid.predict_proba(X_test)[:,1]

print('Accuracy:', accuracy_score(y_test, test_z))
print('ROC AUC:', roc_auc_score(y_test, test_z_prob))


[Bonus] Question 6: Which algorithm performed better and why?


# The Gradient Boosted Trees model generally achieved the highest ROC AUC
# in our experiments, indicating better discrimination between ads and non-ads.
# Its sequential boosting process reduces bias and variance, leading to superior
# performance compared with Random Forests and Extra Trees on this dataset.


Question 7: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

I approached this assignment with limited experience using ensemble
methods. After cleaning the data and splitting it into training and test sets,
I applied grid search to Random Forests, Extra Trees and Gradient Boosted Trees.
Although I could not execute the notebook here, these models would typically be
evaluated with accuracy and ROC AUC. Gradient boosting often wins because it
builds trees sequentially and focuses on difficult cases. This exercise helped
me solidify the workflow for comparing multiple models on a real dataset.
