<a href="https://colab.research.google.com/github/franceslawley/Python_Skills/blob/main/7_Ensemble_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The datasets used in this homework can be found in the google drive link below -

https://drive.google.com/drive/folders/1NxCh4X7u7wVo5aHojxjLNs9wC7B7zJhb?usp=sharing


Import all the libraries you require in the cell below.


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import seaborn as sns

## Implement the Ensemble methods learnt in class and compare their accuarcies.

The dataset you are going to be using for homework is the **Wisconsin Breast Cancer dataset (cancer.csv)**

The dataset contains a total number of 10 features labeled in either benign or malignant classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant) (**target variable**)

For more information: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

### 1. Read the dataset into variable called '**data**' (1 mark)

In [None]:
# Enter your code here

pd.set_option('display.max_columns', 100)
data = pd.read_csv('cancer.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample Code Number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bland Chromatin              699 non-null    object
 7   Bare Nuclei                  699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitosis                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB
None


### **Preprocessing**: Data needs to be preprocessed before implementing ensemble methods. It is done for you here. 
### Run the below code first and then answer the questions from 2 - 7.

#### Deleting unnecessary columns: The column "Sample code number" is just an indicator and it's of no use in the modeling. So, let's drop it:


In [None]:
data.drop(['Sample Code Number'],axis = 1, inplace = True)

#### Handling missing values : 
As mentioned earlier, the dataset contains missing values. The column named "Bare Nuclei" contains them. The missing values are represneted as "?". 

Replace those "?"s with 0's and impute them with Mean Imputation

In [None]:
data['Bare Nuclei']

data[data['Bare Nuclei']=='?']

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Bare Nuclei,Normal Nucleoli,Mitosis,Class


In [None]:
data.replace('?',0, inplace=True)

In [None]:
# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values
#print(data)
#print(data.values)

# Now impute it
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputedData = imputer.fit_transform(values)

#### Normalizing the data:
Ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, normalize the ranges of the features to a uniform range, in this case, 0 - 1.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)
cols = ['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bland Chromatin', 'Bare Nuclei', 'Normal Nucleoli', 'Mitosis','Class']
normalizedData = pd.DataFrame(normalizedData, columns=cols)
#print(normalizedData.head())

### Data preprocessing is done and now you will answer the below questions using the **normalizedData**: 

### 2. Split the data into test and training data with test size - 30%. Compute the baseline classification accuracy for X_train. (3 marks)

In [None]:
# Enter your code here

X = normalizedData.iloc[:, 0:-1]
y = normalizedData.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
print(X_train.shape)
print(y_train.shape)


from sklearn.dummy import DummyClassifier
dummy_classifier = DummyClassifier(strategy='most_frequent')
dummy_classifier.fit(X_train,y_train)
baseline_acc = dummy_classifier.score(X_test,y_test)
print("Baseline Accuracy = ", baseline_acc)

(489, 9)
(489,)
Baseline Accuracy =  0.6571428571428571


### 3.  Bagging : Build a generic Bagging ensemble and print the accuracy (4 marks)
---


Hyperparameters:

Base estimator = DecisionTreeClassifier

n_estimators = 10

random_state = 42

---


In [None]:
# Generic Bagging model

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

bagged_model = BaggingClassifier(base_estimator = DecisionTreeClassifier(), n_estimators = 10, random_state = 42)
bagged_model.fit(X_train, y_train)
pred_bagged = bagged_model.predict(X_test)
acc_bagged = accuracy_score(y_test, pred_bagged)
print('Bagging Accuracy = ', acc_bagged)

Bagging Accuracy =  0.9476190476190476


### 4. RandomForest : (7 marks)
#### a) Build a Random Forest model and print the accuracy (4 marks)
---

Constructor arguments: 


n_estimators = 100, max_features = 7 and random_state = 42 


---




In [None]:
# Random Forest model

from sklearn.ensemble import RandomForestClassifier

randomforest_model = RandomForestClassifier(n_estimators=100, max_features=7, random_state=42)
randomforest_model.fit(X_train, y_train)
pred_randomforest = randomforest_model.predict(X_test)
acc_randomforest = accuracy_score(y_test, pred_randomforest)
print('Random Forest Accuracy = ', acc_randomforest)

Random Forest Accuracy =  0.9523809523809523


####  b) Calculate the top 3 important features for the above **RandomForest** model and print them (3 marks)

In [None]:
# Top 3 features for RandomForest

imp = pd.DataFrame(zip(X_train.columns, randomforest_model.feature_importances_))
sorted = imp.sort_values(by = 1, ascending = False)
print('Top three most important features are:', sorted[0:3])

Top three most important features are:                           0         1
1   Uniformity of Cell Size  0.428607
2  Uniformity of Cell Shape  0.263046
5           Bland Chromatin  0.097426


### 5. Boosting: (7 marks)
#### a) Build an AdaBoost model with training data and print the accuracy (4 marks)
---

Hyperparameters:

Base estimator = DecisionTreeClassifier, max_depth = 4

n_estimators = 200

random_state = 42

learning_rate = 0.05


---









In [None]:
# AdaBoost Classification

#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

base_estimator = DecisionTreeClassifier(max_depth = 4)
adaboost = AdaBoostClassifier(base_estimator, n_estimators=200, random_state=42, learning_rate=0.05)
adaboost.fit(X_train, y_train)
pred_adaboost = adaboost.predict(X_test)
acc_adaboost = accuracy_score(y_test, pred_adaboost)
print('Adaboost Accuracy = ', acc_adaboost)

Adaboost Accuracy =  0.9238095238095239


#### b) Calculate the top 3 important features for the above **AdaBoost** model and print them (3 marks)

In [None]:
# Top 3 features for AdaBoost

imp_ad = pd.DataFrame(zip(X_train.columns, adaboost.feature_importances_))
sorted = imp_ad.sort_values(by = 1, ascending = False)
print('Top three most important features are:', sorted[0:3])

Top three most important features are:                              0         1
0              Clump Thickness  0.346175
4  Single Epithelial Cell Size  0.197838
1      Uniformity of Cell Size  0.091346


### 6. Voting : Using a voting classifier, build an ensemble of RandomForestClassifier, DecisionTreeClassifier, Support Vector Machine and Logistic Regression. (7 marks)


---


Use max_depth = 4, n_estimators = 200, voting = soft

In [None]:
# Voting Ensemble for Classification

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

#randomforest_model created above
decisiontree_model = DecisionTreeClassifier()
logisticregression_model = LogisticRegression()
supportvector_model = svm.SVC()

decisiontree_model.fit(X_train, y_train)
pred_decisiontree = decisiontree_model.predict(X_test)

logisticregression_model.fit(X_train, y_train)
pred_logisticregression = logisticregression_model.predict(X_test)

supportvector_model.fit(X_train, y_train)
pred_supportvector = supportvector_model.predict(X_test)

averaged_prediction = (pred_randomforest + pred_decisiontree + pred_logisticregression + pred_supportvector )//4
acc_average = accuracy_score(y_test, averaged_prediction)
print("Averaged Accuracy = ", acc_average)

voting_model= VotingClassifier(estimators=[('SVC', supportvector_model), ('DTree', decisiontree_model), ('LogReg', logisticregression_model)], voting='hard')
voting_model.fit(X_train, y_train)
pred_voting = voting_model.predict(X_test)
acc_voting = accuracy_score(y_test, pred_voting)
print("Voting Accuracy =", acc_voting)

Averaged Accuracy =  0.9380952380952381
Voting Accuracy = 0.9666666666666667


### 7. Mention the best model among the above 4 models and its accuracy (1 mark)

In [None]:
# Write your answer here

print('Random Forest Accuracy = ', acc_randomforest)
print('Bagging Accuracy = ', acc_bagged)
print('Adaboost Accuracy = ', acc_adaboost)
print('Voting Accuracy = ', acc_voting)

dictionary = {"Random Forest":acc_randomforest, "Bagging":acc_bagged, "Adaboost":acc_adaboost, "Voting":acc_voting}
print('The model with the higehst accuracy is:', max(dictionary, key=dictionary.get))

Random Forest Accuracy =  0.9523809523809523
Bagging Accuracy =  0.9476190476190476
Adaboost Accuracy =  0.9238095238095239
Voting Accuracy =  0.9666666666666667
The model with the higehst accuracy is: Voting
