<a href="https://colab.research.google.com/github/alberthtan/ensemble-methods/blob/main/Ensemble_Method_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Method Analysis

Datasets -

https://drive.google.com/drive/folders/1NxCh4X7u7wVo5aHojxjLNs9wC7B7zJhb?usp=sharing


Importing libraries.


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

## Comparing the accuracies of various Ensemble Methods (Bagging, RandomForests, Boosting, and Voting)

**Wisconsin Breast Cancer dataset (cancer.csv)**

The dataset contains a total number of 10 features labeled in either benign or malignant classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant) (**target variable**)

For more information: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

### 1. Reading the dataset

In [None]:
pd.set_option('display.max_columns', 100)
data = pd.read_csv('cancer.csv')
print(data.head())

   Sample Code Number  Clump Thickness  Uniformity of Cell Size   \
0             1000025                5                         1   
1             1002945                5                         4   
2             1015425                3                         1   
3             1016277                6                         8   
4             1017023                4                         1   

   Uniformity of Cell Shape  Marginal Adhesion  Single Epithelial Cell Size  \
0                         1                  1                            2   
1                         4                  5                            7   
2                         1                  1                            2   
3                         8                  1                            3   
4                         1                  3                            2   

  Bland Chromatin  Bare Nuclei  Normal Nucleoli  Mitosis  Class  
0               1            3                1   

### **Preprocessing**

#### Deleting unnecessary columns: The column "Sample code number" is just an indicator and it's of no use in the modeling. So, let's drop it:


In [None]:
data.drop(['Sample Code Number'],axis = 1, inplace = True)


#### Handling missing values : 
As mentioned earlier, the dataset contains missing values. The column named "Bare Nuclei" contains them. The missing values are represneted as "?". 

Replace those "?"s with 0's and impute them with Mean Imputation

In [None]:
data['Bare Nuclei']

0       3
1       3
2       3
3       3
4       3
       ..
694     1
695     1
696     8
697    10
698    10
Name: Bare Nuclei, Length: 699, dtype: int64

In [None]:
data.replace('?',0, inplace=True)

In [None]:
# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values
# Now impute it
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputedData = imputer.fit_transform(values)

#### Normalizing the data:
Ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, normalize the ranges of the features to a uniform range, in this case, 0 - 1.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)
cols = ['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bland Chromatin', 'Bare Nuclei', 'Normal Nucleoli', 'Mitosis','Class']
normalizedData = pd.DataFrame(normalizedData, columns=cols)
print(normalizedData.head())

   Clump Thickness  Uniformity of Cell Size  Uniformity of Cell Shape  \
0         0.444444                 0.000000                  0.000000   
1         0.444444                 0.333333                  0.333333   
2         0.222222                 0.000000                  0.000000   
3         0.555556                 0.777778                  0.777778   
4         0.333333                 0.000000                  0.000000   

   Marginal Adhesion  Single Epithelial Cell Size  Bland Chromatin  \
0           0.000000                     0.111111              0.1   
1           0.444444                     0.666667              1.0   
2           0.000000                     0.111111              0.2   
3           0.000000                     0.222222              0.4   
4           0.222222                     0.111111              0.1   

   Bare Nuclei  Normal Nucleoli  Mitosis  Class  
0     0.222222         0.000000      0.0    0.0  
1     0.222222         0.111111      0.0

### Data preprocessing is completed. The data is in **normalizedData**: 

### 2. Splitting the data into test and training data with test size - 30%. Computing the baseline classification accuracy.

In [None]:
X = normalizedData.iloc[:, :-1]
y = normalizedData.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dummy_classifier = DummyClassifier(strategy='most_frequent')
dummy_classifier.fit(X_train, y_train)
baseline_acc = dummy_classifier.score(X_test, y_test)

print("Baseline Accuracy =", baseline_acc)


Baseline Accuracy = 0.680952380952381


### 3.  Bagging : Building a generic Bagging ensemble and calculating the accuracy
---


Hyperparameters:

Base estimator = DecisionTreeClassifier

n_estimators = 10

random_state = 42

---


In [None]:
# Generic Bagging model
dt = DecisionTreeClassifier()
model_bagging = BaggingClassifier(base_estimator=dt, n_estimators=10, random_state=42)
model_bagging.fit(X_train, y_train)
pred_bagging = model_bagging.predict(X_test)
acc_bagging = accuracy_score(y_test, pred_bagging)

print('Accuracy Score:', acc_bagging)


Accuracy Score: 0.9571428571428572


### 4. RandomForest
#### a) Building a Random Forest model and calculating the accuracy
---

Constructor arguments: 


n_estimators = 100, max_features = 7 and random_state = 42 


---




In [None]:
# Random Forest model
model_rf = RandomForestClassifier(n_estimators=100, max_features=7, random_state=42)
model_rf.fit(X_train, y_train)
predict_rf = model_rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, predict_rf)

print("Accuracy Score:", accuracy_rf)



Accuracy Score: 0.9619047619047619


####  b) Calculating the top 3 important features for the above **RandomForest** model 

In [None]:
# Top 3 features for RandomForest

feature_importances = model_rf.feature_importances_
features = X_train.columns
df = pd.DataFrame({'features': features, 'importance': feature_importances}).nlargest(3, 'importance')
print(df)
# Uniformity of Cell Size, Uniformity of Cell Shape, and Bland Chromatin

                   features  importance
1   Uniformity of Cell Size    0.396144
2  Uniformity of Cell Shape    0.249673
5           Bland Chromatin    0.176869


### 5. Boosting: 
#### a) Building an AdaBoost model with training data and print the accuracy
---

Hyperparameters:

Base estimator = DecisionTreeClassifier, max_depth = 4

n_estimators = 200

random_state = 42

learning_rate = 0.05


---









In [None]:
# AdaBoost Classification
# Enter your code here
base_est = DecisionTreeClassifier(max_depth=4)
ada_boost = AdaBoostClassifier(base_est, n_estimators=200, random_state=42, learning_rate=0.05)
ada_boost.fit(X_train, y_train)
#res = round(recall_score(y_test, ada_boost.predict(X_test)), 4)
ada_pred = ada_boost.predict(X_test)
print('Accuracy score:', accuracy_score(y_test, ada_pred))

Accuracy score: 0.9619047619047619


#### b) Calculate the top 3 important features for the above **AdaBoost** model and print them (3 marks)

In [None]:
# Top 3 features for AdaBoost
feature_importances = ada_boost.feature_importances_
features = X_train.columns
df = pd.DataFrame({'features': features, 'importance': feature_importances}).nlargest(3, 'importance')
print(df)
# Normal Nucleoli, Marginal Adhesion, and Uniformity of Cell Size


          features  importance
0  Clump Thickness    0.194414
6      Bare Nuclei    0.190095
5  Bland Chromatin    0.142675


### 6. Voting : Using a voting classifier, build an ensemble of RandomForestClassifier, DecisionTreeClassifier, Support Vector Machine and Logistic Regression.


---


Use max_depth = 4, n_estimators = 200, voting = soft

In [None]:
# Voting Ensemble for Classification

rfClf = RandomForestClassifier(n_estimators=200, random_state=42)
dtClf = DecisionTreeClassifier(max_depth=4, random_state=42)
svmClf = SVC(probability=True, random_state=42)
logClf = LogisticRegression(random_state=42)

clf = VotingClassifier(estimators = [('rf',rfClf), ('dt',dtClf), ('svm',svmClf), ('log', logClf)], voting='soft')
clf.fit(X_train, y_train)
clf_pred = clf.predict(X_test)
print('Accuracy score', accuracy_score(y_test, clf_pred))

Accuracy score 0.9666666666666667


### 7. Finding the best model among the above 4 models

In [None]:
print("The best model is the Voting ensemble with an accuracy of", accuracy_score(y_test, clf_pred))



The best model is the Voting ensemble with an accuracy of 0.9666666666666667
