<a href="https://colab.research.google.com/github/bu11ymaguire/Machin-Learning1/blob/main/Week11_Assignment_2023036299.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Programming**

# Week11_Ensemble


### - **Please note that the code must be submitted in a state where it runs correctly when executed by the TA.**



---



## **Assignment** Breast Cancer Data (10 points)

In this assignment, you will apply the ensemble learning techniques we studied today (Random Forest, AdaBoost, and Gradient Boost) to the **breast cancer dataset** from scikit-learn. Your goal is to build a classifier to predict whether a tumor is cancerous or not.

- Choose one of the models used in class
- You may select and use a subset of features
- Split the data into train and test sets, and calculate the accuracy on the test set
- Achieve over 97% accuracy on the test set

### **Dataset Information**
The Breast Cancer Wisconsin (Diagnostic) dataset contains features computed from digitized images of breast mass. The dataset includes:

- 569 instances
- 30 numeric features
- Binary classification: cancer (malignant=0) or not cancer (benign=1)


### 1 Data Load and Preprocessing (2pts)

In [1]:
# Data Load and Preprocessing

from sklearn import datasets
import pandas as pd

cancer = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
cancer_df['target'] = cancer.target

cancer_df.head(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [3]:
missing_values_count = cancer_df.isnull().sum()

print("missing_Values_count:")
print(missing_values_count)

missing_Values_count:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


### 2 Train/Test Split
- Split the datasets into train and test sets.

In [5]:
# Split the dataset
X = cancer_df.drop('target',axis=1)
y = cancer_df['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [6]:
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

X_train shape: (455, 30)
X_test shape: (114, 30)
y_train shape: (455,)
y_test shape: (114,)


### 3 Model Training (3 points)
- Train a classifier with the preprocessed data.

In [7]:
# Train a classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

rf_accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest : {rf_accuracy:.4f}")

Random Forest : 0.9561


In [8]:
from sklearn.ensemble import AdaBoostClassifier
ada_model = AdaBoostClassifier(n_estimators=50,learning_rate=1.0 ,random_state=42)
ada_model.fit(X_train, y_train)
ada_predictions = ada_model.predict(X_test)
ada_accuracy = accuracy_score(y_test, ada_predictions)
print(f"AdaBoost : {ada_accuracy:.4f}")

AdaBoost : 0.9561


In [9]:
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_predictions)
print(f"Gradient Boost : {gb_accuracy:.4f}")

Gradient Boost : 0.9561


### 4 Update the model to achieve an accuracy of over 97% (5 points)

In [12]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
    'max_depth': [2, 3, 4]
}

gb_model_base = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(estimator=gb_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Grid Search (Hyperparameter Tuning)...")
grid_search.fit(X_train, y_train)
print("Grid Search Complete!")

print("\nBest Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

print("Evaluating on Test Set...")
best_gb_model = grid_search.best_estimator_
final_predictions = best_gb_model.predict(X_test)
final_accuracy = accuracy_score(y_test, final_predictions)

print(f"Final Test Set Accuracy with Best Parameters: {final_accuracy:.4f}")

Starting Grid Search (Hyperparameter Tuning)...
Grid Search Complete!

Best Hyperparameters: {'learning_rate': 1.0, 'max_depth': 2, 'n_estimators': 100}
Best Cross-Validation Score: 0.9758241758241759
Evaluating on Test Set...
Final Test Set Accuracy with Best Parameters: 0.9649


In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

param_grid = {
    'n_estimators': [50, 100, 200, 300, 500],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.5],
    'max_depth': [2, 3, 4, 5]
}

gb_model_base = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(estimator=gb_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Expanded Grid Search (Hyperparameter Tuning)...")
grid_search.fit(X_train, y_train)
print("Expanded Grid Search Complete!")

print("\nBest Hyperparameters found with Expanded Search:", grid_search.best_params_)
print("Best Cross-Validation Score found with Expanded Search:", grid_search.best_score_)

print("Evaluating on Test Set with Best Model from Expanded Search...")
best_gb_model_tuned = grid_search.best_estimator_
final_predictions_tuned = best_gb_model_tuned.predict(X_test)
final_accuracy_tuned = accuracy_score(y_test, final_predictions_tuned)

print(f"Final Test Set Accuracy with Best Tuned Parameters: {final_accuracy_tuned:.4f}")

Starting Expanded Grid Search (Hyperparameter Tuning)...
Expanded Grid Search Complete!

Best Hyperparameters found with Expanded Search: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 200}
Best Cross-Validation Score found with Expanded Search: 0.9714285714285715
Evaluating on Test Set with Best Model from Expanded Search...
Final Test Set Accuracy with Best Tuned Parameters: 0.9474


In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

param_grid = {
    'n_estimators': [150, 200, 250, 300],
    'learning_rate': [0.05, 0.08, 0.1, 0.12, 0.15],
    'max_depth': [2, 3]
}

gb_model_base = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(estimator=gb_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Refined Grid Search (Hyperparameter Tuning)...")
grid_search.fit(X_train, y_train)
print("Refined Grid Search Complete!")

print("\nBest Hyperparameters found with Refined Search:", grid_search.best_params_)
print("Best Cross-Validation Score found with Refined Search:", grid_search.best_score_)

print("Evaluating on Test Set with Best Model from Refined Search...")
best_gb_model_tuned = grid_search.best_estimator_
final_predictions_tuned = best_gb_model_tuned.predict(X_test)
final_accuracy_tuned = accuracy_score(y_test, final_predictions_tuned)

print(f"Final Test Set Accuracy with Best Tuned Parameters: {final_accuracy_tuned:.4f}")

Starting Refined Grid Search (Hyperparameter Tuning)...
Refined Grid Search Complete!

Best Hyperparameters found with Refined Search: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 200}
Best Cross-Validation Score found with Refined Search: 0.9714285714285715
Evaluating on Test Set with Best Model from Refined Search...
Final Test Set Accuracy with Best Tuned Parameters: 0.9474


In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.5, 1.0]
}

ada_model_base = AdaBoostClassifier(random_state=42)

grid_search_ada = GridSearchCV(estimator=ada_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Grid Search for AdaBoost (Hyperparameter Tuning)...")
grid_search_ada.fit(X_train, y_train)
print("Grid Search for AdaBoost Complete!")

print("\nBest Hyperparameters found for AdaBoost:", grid_search_ada.best_params_)
print("Best Cross-Validation Score for AdaBoost:", grid_search_ada.best_score_)

print("Evaluating on Test Set with Best AdaBoost Model...")
best_ada_model_tuned = grid_search_ada.best_estimator_
final_predictions_ada_tuned = best_ada_model_tuned.predict(X_test)
final_accuracy_ada_tuned = accuracy_score(y_test, final_predictions_ada_tuned)

print(f"Final Test Set Accuracy with Best Tuned AdaBoost Parameters: {final_accuracy_ada_tuned:.4f}")

Starting Grid Search for AdaBoost (Hyperparameter Tuning)...
Grid Search for AdaBoost Complete!

Best Hyperparameters found for AdaBoost: {'learning_rate': 0.5, 'n_estimators': 100}
Best Cross-Validation Score for AdaBoost: 0.9780219780219781
Evaluating on Test Set with Best AdaBoost Model...
Final Test Set Accuracy with Best Tuned AdaBoost Parameters: 0.9561


In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

param_grid = {
    'n_estimators': [80, 100, 120, 150, 200],
    'learning_rate': [0.3, 0.4, 0.5, 0.6, 0.7]
}

ada_model_base = AdaBoostClassifier(random_state=42)

grid_search_ada = GridSearchCV(estimator=ada_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Refined Grid Search for AdaBoost (Hyperparameter Tuning)...")
grid_search_ada.fit(X_train, y_train)
print("Refined Grid Search for AdaBoost Complete!")

print("\nBest Hyperparameters found for AdaBoost with Refined Search:", grid_search_ada.best_params_)
print("Best Cross-Validation Score for AdaBoost with Refined Search:", grid_search_ada.best_score_)

print("Evaluating on Test Set with Best AdaBoost Model from Refined Search...")
best_ada_model_tuned = grid_search_ada.best_estimator_
final_predictions_ada_tuned = best_ada_model_tuned.predict(X_test)
final_accuracy_ada_tuned = accuracy_score(y_test, final_predictions_ada_tuned)

print(f"Final Test Set Accuracy with Best Tuned AdaBoost Parameters: {final_accuracy_ada_tuned:.4f}")

Starting Refined Grid Search for AdaBoost (Hyperparameter Tuning)...
Refined Grid Search for AdaBoost Complete!

Best Hyperparameters found for AdaBoost with Refined Search: {'learning_rate': 0.6, 'n_estimators': 100}
Best Cross-Validation Score for AdaBoost with Refined Search: 0.9802197802197803
Evaluating on Test Set with Best AdaBoost Model from Refined Search...
Final Test Set Accuracy with Best Tuned AdaBoost Parameters: 0.9561


In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_model_base = RandomForestClassifier(random_state=42)

grid_search_rf = GridSearchCV(estimator=rf_model_base,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

print("Starting Grid Search for Random Forest (Hyperparameter Tuning)...")
grid_search_rf.fit(X_train, y_train)
print("Grid Search for Random Forest Complete!")

print("\nBest Hyperparameters found for Random Forest:", grid_search_rf.best_params_)
print("Best Cross-Validation Score for Random Forest:", grid_search_rf.best_score_)

print("Evaluating on Test Set with Best Random Forest Model...")
best_rf_model_tuned = grid_search_rf.best_estimator_
final_predictions_rf_tuned = best_rf_model_tuned.predict(X_test)
final_accuracy_rf_tuned = accuracy_score(y_test, final_predictions_rf_tuned)

print(f"Final Test Set Accuracy with Best Tuned Random Forest Parameters: {final_accuracy_rf_tuned:.4f}")

Starting Grid Search for Random Forest (Hyperparameter Tuning)...
Grid Search for Random Forest Complete!

Best Hyperparameters found for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Cross-Validation Score for Random Forest: 0.9604395604395606
Evaluating on Test Set with Best Random Forest Model...
Final Test Set Accuracy with Best Tuned Random Forest Parameters: 0.9561


In [18]:
feature_importances = best_rf_model_tuned.feature_importances_

feature_names = X.columns

importance_series = pd.Series(feature_importances, index=feature_names)

sorted_importance_series = importance_series.sort_values(ascending=False)

print("Feature Importances (Sorted):")
print(sorted_importance_series)

Feature Importances (Sorted):
worst perimeter            0.133100
worst area                 0.128052
worst concave points       0.108107
mean concave points        0.094414
worst radius               0.090639
mean radius                0.058662
mean perimeter             0.055242
mean area                  0.049938
mean concavity             0.046207
worst concavity            0.035357
area error                 0.034368
mean compactness           0.018094
worst texture              0.017869
worst compactness          0.014481
mean texture               0.014317
worst smoothness           0.014026
radius error               0.013454
worst symmetry             0.009570
perimeter error            0.008138
concavity error            0.007136
worst fractal dimension    0.006387
mean smoothness            0.006168
texture error              0.005441
mean symmetry              0.005004
compactness error          0.004630
symmetry error             0.004611
mean fractal dimension     0.00446

In [22]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

feature_importances = best_rf_model_tuned.feature_importances_
feature_names = X_train.columns

importance_series = pd.Series(feature_importances, index=feature_names)
sorted_importance_series = importance_series.sort_values(ascending=False)

importance_threshold = 0.005

selected_features = sorted_importance_series[sorted_importance_series >= importance_threshold].index.tolist()

print(f"Original number of features: {len(feature_names)}")
print(f"Number of selected features (importance >= {importance_threshold}): {len(selected_features)}")
print(f"Selected features: {selected_features}")

X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

best_params_rf = best_rf_model_tuned.get_params()

selected_rf_model = RandomForestClassifier(**best_params_rf)

print("\nTraining Random Forest model with selected features...")
selected_rf_model.fit(X_train_selected, y_train)
print("Training Complete!")

print("Evaluating model on test set with selected features...")
selected_predictions = selected_rf_model.predict(X_test_selected)
selected_accuracy = accuracy_score(y_test, selected_predictions)

print(f"Final Test Set Accuracy with Feature Selection ({len(selected_features)} features): {selected_accuracy:.4f}")

Original number of features: 30
Number of selected features (importance >= 0.005): 24
Selected features: ['worst perimeter', 'worst area', 'worst concave points', 'mean concave points', 'worst radius', 'mean radius', 'mean perimeter', 'mean area', 'mean concavity', 'worst concavity', 'area error', 'mean compactness', 'worst texture', 'worst compactness', 'mean texture', 'worst smoothness', 'radius error', 'worst symmetry', 'perimeter error', 'concavity error', 'worst fractal dimension', 'mean smoothness', 'texture error', 'mean symmetry']

Training Random Forest model with selected features...
Training Complete!
Evaluating model on test set with selected features...
Final Test Set Accuracy with Feature Selection (24 features): 0.9561


In [25]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import numpy as np

param_dist = {
    'n_estimators': [50, 100, 150, 200, 300, 400, 500],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2],
    'max_depth': [2, 3, 4, 5, 6],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0]
}

gb_model_base = GradientBoostingClassifier(random_state=42)

random_search = RandomizedSearchCV(estimator=gb_model_base,
                                   param_distributions=param_dist,
                                   n_iter=200,
                                   scoring='accuracy',
                                   cv=5,
                                   random_state=42,
                                   n_jobs=-1)

print("Starting Randomized Search (Hyperparameter Tuning)...")
random_search.fit(X_train, y_train)
print("Randomized Search Complete!")

print("\nBest Hyperparameters found with Randomized Search:", random_search.best_params_)
print("Best Cross-Validation Score found with Randomized Search:", random_search.best_score_)

print("Evaluating model on test set with best parameters...")
best_gb_model_random = random_search.best_estimator_
final_predictions_random = best_gb_model_random.predict(X_test)
final_accuracy_random = accuracy_score(y_test, final_predictions_random)

# Modified print statement using .format()
print("Final Test Set Accuracy with Best Tuned Parameters (Randomized Search): {:.4f}".format(final_accuracy_random))

Starting Randomized Search (Hyperparameter Tuning)...
Randomized Search Complete!

Best Hyperparameters found with Randomized Search: {'subsample': 0.7, 'n_estimators': 500, 'max_depth': 4, 'learning_rate': 0.05}
Best Cross-Validation Score found with Randomized Search: 0.9802197802197803
Evaluating model on test set with best parameters...
Final Test Set Accuracy with Best Tuned Parameters (Randomized Search): 0.9561


In [3]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score
import numpy as np

cancer = load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
cancer_df['target'] = cancer.target

print("First 5 rows of the dataset:")
print(cancer_df.head(5))

X = cancer_df.drop('target', axis=1).values
y = cancer_df['target'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

selector = SelectKBest(score_func=f_classif, k=20)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

param_dist = {
    'n_estimators': [100, 150, 200, 250, 300],
    'learning_rate': [0.01, 0.03, 0.05, 0.07, 0.1],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 0.9],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

gb_model_base = GradientBoostingClassifier(random_state=42)
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
random_search = RandomizedSearchCV(estimator=gb_model_base,
                                   param_distributions=param_dist,
                                   n_iter=100,
                                   scoring='accuracy',
                                   cv=skf,
                                   random_state=42,
                                   n_jobs=-1,
                                   verbose=2)

print("Starting Randomized Search (Hyperparameter Tuning)...")
random_search.fit(X_train_selected, y_train)
print("Randomized Search Complete!")

print("\nBest Hyperparameters found with Randomized Search:", random_search.best_params_)
print("Best Cross-Validation Score found with Randomized Search:", random_search.best_score_)

print("Evaluating model on test set with best parameters...")
best_gb_model_random = random_search.best_estimator_
final_predictions_random = best_gb_model_random.predict(X_test_selected)
final_accuracy_random = accuracy_score(y_test, final_predictions_random)

print("Final Test Set Accuracy with Best Tuned Parameters (Randomized Search): {:.4f}".format(final_accuracy_random))

First 5 rows of the dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter 