# Homework 7 (30 marks)
Create a copy of the notebook to start answering the questions. Name your notebook in the format HW7_lastname_firstname.ipynb to facilitate the grading process.

Answer all the questions, test your code to ensure there are no errors and the results are as expected. Once you have answered all the questions, save the final copy, then go to File-> click on Download.ipynb. Once the local copy has been downloaded, submit your file on Blackboard under the corresponding assignment section. Also provide us a link to your notebook during submission.

NOTE: Please give the TAs the permission to access your notebooks through the links you have provided during submission.

The due date of this homework is 04/23/2021 (Friday).

Please ensure you follow all the steps mentioned in the homework.

You can submit your solutions any number of times until the deadline.

The datasets used in this homework can be found in the google drive link below -

https://drive.google.com/drive/folders/1NxCh4X7u7wVo5aHojxjLNs9wC7B7zJhb?usp=sharing

Follow the necessary steps to import data to test your code. You can use any method to read your data in the notebook. We will not be grading the methods you use. We will only grade the code from the point where you read the dataset into a pandas dataframe - (pd.read_csv('file_name'))

Import all the libraries you require in the cell below.


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Enter your code here

## Implement the Ensemble methods learnt in class and compare their accuarcies.

The dataset you are going to be using for homework is the **Wisconsin Breast Cancer dataset (cancer.csv)**

The dataset contains a total number of 10 features labeled in either benign or malignant classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant) (**target variable**)

For more information: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

### 1. Read the dataset into variable called '**data**' (1 mark)

In [None]:
pd.set_option('display.max_columns', 100)
data = pd.read_csv('cancer(3).csv')

data.head()



Unnamed: 0,Sample Code Number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Bare Nuclei,Normal Nucleoli,Mitosis,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### **Preprocessing**: Data needs to be preprocessed before implementing ensemble methods. It is done for you here. 
### Run the below code first and then answer the questions from 2 - 7.

#### Deleting unnecessary columns: The column "Sample code number" is just an indicator and it's of no use in the modeling. So, let's drop it:


In [None]:
data.drop(['Sample Code Number'],axis = 1, inplace = True)


#### Handling missing values : 
As mentioned earlier, the dataset contains missing values. The column named "Bland Chromatin" contains them. The missing values are represneted as "?". 

Replace those "?"s with 0's and impute them with Mean Imputation

In [None]:
data['Bland Chromatin']

0       1
1      10
2       2
3       4
4       1
       ..
694     2
695     1
696     3
697     4
698     5
Name: Bland Chromatin, Length: 699, dtype: object

In [None]:
data.replace('?',0, inplace=True)

In [None]:
# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values
# Now impute it
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputedData = imputer.fit_transform(values)

#### Normalizing the data:
Ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, normalize the ranges of the features to a uniform range, in this case, 0 - 1.

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)
cols = ['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bland Chromatin', 'Bare Nuclei', 'Normal Nucleoli', 'Mitosis','Class']
normalizedData = pd.DataFrame(normalizedData, columns=cols)
print(normalizedData.head())

   Clump Thickness  Uniformity of Cell Size  Uniformity of Cell Shape  \
0         0.444444                 0.000000                  0.000000   
1         0.444444                 0.333333                  0.333333   
2         0.222222                 0.000000                  0.000000   
3         0.555556                 0.777778                  0.777778   
4         0.333333                 0.000000                  0.000000   

   Marginal Adhesion  Single Epithelial Cell Size  Bland Chromatin  \
0           0.000000                     0.111111              0.1   
1           0.444444                     0.666667              1.0   
2           0.000000                     0.111111              0.2   
3           0.000000                     0.222222              0.4   
4           0.222222                     0.111111              0.1   

   Bare Nuclei  Normal Nucleoli  Mitosis  Class  
0     0.222222         0.000000      0.0    0.0  
1     0.222222         0.111111      0.0

### Data preprocessing is done and now you will answer the below questions using the **normalizedData**: 

### 2. Split the data into test and training data with test size - 30%. Compute the baseline classification accuracy for X_train. (3 marks)

In [None]:
# Enter your code here
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]

print(X)
print(y)

print(data.shape)
print(X.shape)
print(y.shape)

# partition data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
                                                    X,
                                                    y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=42
                                                    )

     Clump Thickness  Uniformity of Cell Size   Uniformity of Cell Shape  \
0                  5                         1                         1   
1                  5                         4                         4   
2                  3                         1                         1   
3                  6                         8                         8   
4                  4                         1                         1   
..               ...                       ...                       ...   
694                3                         1                         1   
695                2                         1                         1   
696                5                        10                        10   
697                4                         8                         6   
698                4                         8                         8   

     Marginal Adhesion  Single Epithelial Cell Size Bland Chromatin  \
0               

In [None]:
print(X)
print(y)

     Clump Thickness  Uniformity of Cell Size   Uniformity of Cell Shape  \
0                  5                         1                         1   
1                  5                         4                         4   
2                  3                         1                         1   
3                  6                         8                         8   
4                  4                         1                         1   
..               ...                       ...                       ...   
694                3                         1                         1   
695                2                         1                         1   
696                5                        10                        10   
697                4                         8                         6   
698                4                         8                         8   

     Marginal Adhesion  Single Epithelial Cell Size Bland Chromatin  \
0               

In [None]:
from sklearn.dummy import DummyClassifier
dummy_classifier = DummyClassifier(strategy='most_frequent')
dummy_classifier.fit(X_train,y_train)
baseline_acc = dummy_classifier.score(X_test,y_test)


### For verifying answer:
print("Baseline Accuracy = ", baseline_acc)

Baseline Accuracy =  0.6571428571428571


### 3.  Bagging : Build a generic Bagging ensemble and print the accuracy (4 marks)
---


Hyperparameters:

Base estimator = DecisionTreeClassifier

n_estimators = 10

random_state = 42

---


In [None]:
def create_bootstrap_sample(df):
    return df.sample(n= df.shape[0], replace = True)

bootstrap_sample = create_bootstrap_sample(X_train)

print('Number of rows should be the same:')
print('Number of rows in X_train:  ', X_train.shape[0])
print('Number of rows in bootstrap:', create_bootstrap_sample(X_train).shape[0])

print(bootstrap_sample)

Number of rows should be the same:
Number of rows in X_train:   489
Number of rows in bootstrap: 489
     Clump Thickness  Uniformity of Cell Size   Uniformity of Cell Shape  \
689                1                         1                         1   
260               10                         5                         8   
7                  2                         1                         2   
662                1                         1                         3   
23                 8                         4                         5   
..               ...                       ...                       ...   
105                7                         3                         4   
261                5                        10                        10   
303                1                         1                         1   
613                2                         3                         1   
389                5                         1                 

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

model_bagging = BaggingClassifier(n_estimators = 10, random_state = 42)
model_bagging.fit(X_train, y_train)
pred_bagging = model_bagging.predict(X_test)
acc_bagging = accuracy_score(y_test, pred_bagging)

print(' Accuracy = ', acc_bagging)

 Accuracy =  0.9571428571428572


  y = column_or_1d(y, warn=True)


### 4. RandomForest : (7 marks)
#### a) Build a Random Forest model and print the accuracy (4 marks)
---

Constructor arguments: 


n_estimators = 100, max_features = 7 and random_state = 42 


---




In [None]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import (
    classification_report,
    recall_score,
    precision_score,
    accuracy_score
)
print('Classification Report:\n')
print(classification_report(y_test, pred_bagging))

model_rf = RandomForestClassifier(n_estimators=100, max_features=7, random_state=42)
model_rf.fit(X_train, y_train)
predict_rf = model_rf.predict(X_test)
recall_rf = recall_score(y_test, predict_rf, average= 'macro')
precision_rf = precision_score(y_test, predict_rf, average = 'macro')

from sklearn.ensemble import RandomForestClassifier

model_rf_oob = RandomForestClassifier(n_estimators=100, max_features=7, oob_score=True, random_state=42).fit(X_train, y_train)
oob_score = round(model_rf_oob.oob_score_,4)
acc_oob = round(accuracy_score(y_test, model_rf_oob.predict(X_test)),4)
diff_oob = round(abs(oob_score - acc_oob),4)

print('OOB Score:\t\t\t', oob_score)
print('Testing Accuracy:\t\t', acc_oob)
print('Acc. Difference:\t\t', diff_oob)

Classification Report:

              precision    recall  f1-score   support

           2       0.97      0.96      0.97       138
           4       0.93      0.94      0.94        72

    accuracy                           0.96       210
   macro avg       0.95      0.95      0.95       210
weighted avg       0.96      0.96      0.96       210



  del sys.path[0]


OOB Score:			 0.9632
Testing Accuracy:		 0.9524
Acc. Difference:		 0.0108


####  b) Calculate the top 3 important features for the above **RandomForest** model and print them (3 marks)

In [None]:
# Top 3 features for RandomForest
# Enter your code here

feature_importances = model_rf.feature_importances_
features = X_train.columns
df = pd.DataFrame({'features': features, 'importance': feature_importances}).nlargest
#print(df)

df = pd.DataFrame(zip(X_train.columns, feature_importances)).sort_values(by=1, ascending=False)

print(df.iloc[0:3, :])

                          0         1
1  Uniformity of Cell Size   0.503253
5           Bland Chromatin  0.229520
2  Uniformity of Cell Shape  0.095969


### 5. Boosting: (7 marks)
#### a) Build an AdaBoost model with training data and print the accuracy (4 marks)
---

Hyperparameters:

Base estimator = DecisionTreeClassifier, max_depth = 4

n_estimators = 200

random_state = 42

learning_rate = 0.05


---









In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

base_est = DecisionTreeClassifier (max_depth =4)
ada_boost1 = AdaBoostClassifier(base_est, n_estimators=200, random_state=42, learning_rate=.05)
ada_boost1.fit(X_train, y_train)

predict_ada = ada_boost1.predict(X_test)
accuracy_rf = accuracy_score(y_test, predict_ada)
score = round(accuracy_score(y_test, ada_boost1.predict(X_test)),10)
print("accuracy score: ", score)

print(ada_boost1)

  y = column_or_1d(y, warn=True)


accuracy score:  0.9523809524
AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='gini',
                                                         max_depth=4,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecate

#### b) Calculate the top 3 important features for the above **AdaBoost** model and print them (3 marks)

In [None]:
feature_importances = ada_boost1.feature_importances_
features = X_train.columns
df = pd.DataFrame({'features': features, 'importance': feature_importances}).nlargest
imp = pd.DataFrame(zip(X_train.columns, feature_importances)).sort_values(by=1, ascending=False)
print(imp.iloc[0:3, :])

                          0         1
0           Clump Thickness  0.323357
1  Uniformity of Cell Size   0.183736
7           Normal Nucleoli  0.138899


### 6. Voting : Using a voting classifier, build an ensemble of RandomForestClassifier, DecisionTreeClassifier, Support Vector Machine and Logistic Regression. (7 marks)


---


Use max_depth = 4, n_estimators = 200, voting = soft

In [None]:
# Voting Ensemble for Classification
# Enter your code here

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

rfClf = RandomForestClassifier(n_estimators=500, random_state=0) 
svmClf = SVC(probability=True, random_state=0)
logClf = LogisticRegression(random_state=0)

clf2 = VotingClassifier(estimators = [('rf',rfClf), ('svm',svmClf), ('log', logClf)], voting='soft') 

clf2.fit(X_train, y_train)

clf2_pred = clf2.predict(X_test)
recall_voting = recall_score(y_test, clf2_pred, average='macro')
precision_voting = precision_score(y_test, clf2_pred, average='macro')
print('Accuracy score', accuracy_score(y_test, clf2_pred))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Accuracy score 0.9571428571428572


### 7. Mention the best model among the above 4 models and its accuracy (1 mark)

In [None]:
# Write your answer here

print("The Voting and Bagging models have the same accuracy, so they are both the best model")
print("The accuracy is 0.9571428")

The Voting and Bagging models have the same accuracy, so they are both the best model
The accuracy is 0.9571428
