# Module 5 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

# We do this to ignore warnings
import warnings
warnings.filterwarnings("ignore")

---
# Prepare Breast Cancer Data

This assignment will use the breast cancer dataset. Before we attempt to build a model, we first prepare the data.

Please run the next code cell before proceeding to Problem 1.


In [2]:
#Load breast cancer dataset
df = pd.read_csv('data/breast-cancer-wisconsin.csv')
label = df['class']
data = df[['clump thickness', 'uniformity cell size', 'uniformity cell shape', 'marginal adhesion', 'epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses']]
data.sample(2)

Unnamed: 0,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses
572,5,1,1,1,2,1,2,2,1
70,1,3,3,2,2,1,7,2,1


---
# Problem 1: Get Feature Variances by Variance Thresholding

Normalize features and get feature variances.

For this problem you will use the DataFrame **data** defined above.

To solve this problem do the following:
- Use `MinMaxScaler` to normalize DataFrame **data** created above, assign normalized data to variable __data_ss__.
- Create a `VarianceThreshold` feature selector with default `threshold`(which is 0) using scikit learns library.
- Fit & Transform the selector on the DataFrame **data_ss**.
- Retrieve feature variances from the selector's `variances_` attribute and assign it to variable **feature_variance**

After this problem, there's a new variable **feature_variance** defined.

---

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold

# YOUR CODE HERE
minmax = MinMaxScaler()

In [4]:
data_ss = minmax.fit_transform(data)

In [5]:
data_ss

array([[0.44444444, 0.        , 0.        , ..., 0.22222222, 0.        ,
        0.        ],
       [0.44444444, 0.33333333, 0.33333333, ..., 0.22222222, 0.11111111,
        0.        ],
       [0.22222222, 0.        , 0.        , ..., 0.22222222, 0.        ,
        0.        ],
       ...,
       [0.44444444, 1.        , 1.        , ..., 0.77777778, 1.        ,
        0.11111111],
       [0.33333333, 0.77777778, 0.55555556, ..., 1.        , 0.55555556,
        0.        ],
       [0.33333333, 0.77777778, 0.77777778, ..., 1.        , 0.33333333,
        0.        ]])

In [6]:
vt = VarianceThreshold()

In [7]:
vt.fit_transform(data_ss)

array([[0.44444444, 0.        , 0.        , ..., 0.22222222, 0.        ,
        0.        ],
       [0.44444444, 0.33333333, 0.33333333, ..., 0.22222222, 0.11111111,
        0.        ],
       [0.22222222, 0.        , 0.        , ..., 0.22222222, 0.        ,
        0.        ],
       ...,
       [0.44444444, 1.        , 1.        , ..., 0.77777778, 1.        ,
        0.11111111],
       [0.33333333, 0.77777778, 0.55555556, ..., 1.        , 0.55555556,
        0.        ],
       [0.33333333, 0.77777778, 0.77777778, ..., 1.        , 0.33333333,
        0.        ]])

In [8]:
feature_variance = vt.variances_

In [9]:
assert_almost_equal(feature_variance[0], 0.09808697, msg='Feature variances are not correct')
assert_almost_equal(feature_variance[2], 0.11010541, msg='Feature variances are not correct')
print('Feature Variances:\n')
for feature, var in sorted(zip(data.columns, feature_variance), key=lambda x: x[1], reverse=True):
    print(f'{feature:>21} = {var:5.3f}')

Feature Variances:

          bare nuclei = 0.164
 uniformity cell size = 0.116
      normal nucleoli = 0.115
uniformity cell shape = 0.110
    marginal adhesion = 0.101
      clump thickness = 0.098
      bland chromatin = 0.074
 epithelial cell size = 0.061
              mitoses = 0.037


---
# Problem 2: Get Feature Ranking by Recursive Feature Extraction

Perform RFE on a Random Forest Classifier and retrieve feature rankings.

For this problem you will use **data** and __label__ created above.

To solve this problem do the following:
- Create a `RandomForestClassifier` estimator. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Create a Recursive Feature Estimator `RFE` using the Random Forest Classifier created in step 1 as the `estimator`, set `n_features_to_select ` to 1. Accept default values for other arguments.
- Fit the RFE estimator using **data** and __lable__.
- Retrieve feature rankings from the `RFE` selector's `ranking_` attribute and assign it to variable **feature_ranking**.

After this problem, there's a new variable **feature_ranking** defined.

---

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

# YOUR CODE HERE
rfe = RandomForestClassifier(n_estimators=100, random_state=23)

In [11]:
recursive = RFE(estimator=rfe, n_features_to_select=1)

In [12]:
rfe.fit(data,label)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=23, verbose=0,
                       warm_start=False)

In [13]:
recursive.fit_transform(data,label)

array([[ 1],
       [ 4],
       [ 1],
       [ 8],
       [ 1],
       [10],
       [ 1],
       [ 2],
       [ 1],
       [ 1],
       [ 1],
       [ 1],
       [ 3],
       [ 1],
       [ 5],
       [ 6],
       [ 1],
       [ 1],
       [ 7],
       [ 1],
       [ 2],
       [ 5],
       [ 1],
       [ 1],
       [ 3],
       [ 1],
       [ 1],
       [ 1],
       [ 3],
       [ 1],
       [ 1],
       [ 7],
       [ 1],
       [ 2],
       [ 1],
       [10],
       [ 1],
       [ 4],
       [ 3],
       [ 3],
       [10],
       [ 5],
       [10],
       [ 1],
       [ 7],
       [ 1],
       [ 1],
       [ 7],
       [ 8],
       [ 3],
       [ 6],
       [ 5],
       [ 5],
       [ 6],
       [10],
       [ 4],
       [ 3],
       [ 5],
       [ 5],
       [ 1],
       [10],
       [ 4],
       [ 1],
       [ 2],
       [ 1],
       [ 4],
       [ 8],
       [ 1],
       [ 3],
       [ 2],
       [ 3],
       [ 5],
       [ 4],
       [ 2],
       [ 4],
       [ 1],
       [ 1],

In [14]:
feature_ranking = recursive.ranking_

In [15]:
assert_equal(feature_ranking.tolist(), [6, 2, 1, 8, 5, 3, 4, 7, 9])
# Display feature ranking
print('Feature Ranking:')
for var, name in sorted(zip(feature_ranking, data.columns), key=lambda x: x[0]):
    print(f'{name:>21} = {var}')

Feature Ranking:
uniformity cell shape = 1
 uniformity cell size = 2
          bare nuclei = 3
      bland chromatin = 4
 epithelial cell size = 5
      clump thickness = 6
      normal nucleoli = 7
    marginal adhesion = 8
              mitoses = 9


---

# Problem 3: Get Feature Importance from Random Forest Classifier

Get feature importance from a trained Random Forest Classifier.

For this problem you will use **data** and __label__ created above.

To solve this problem do the following:
- Create a `RandomForestClassifier` estimator. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Fit the `RandomForestClassifier` estimator using **data** and __label__.
- Retrieve feature importances from the estimator's `feature_importances_` attribute and assign it to variable **feature_importance**.

After this problem, there will be a new variable **feature_importance** defined.

-----

In [16]:
from sklearn.ensemble import RandomForestClassifier

# YOUR CODE HERE
rfc = RandomForestClassifier(n_estimators=100, random_state=23)

In [17]:
rfc.fit(data,label)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=23, verbose=0,
                       warm_start=False)

In [18]:
feature_importance = rfc.feature_importances_

In [19]:
assert_almost_equal(feature_importance[0], 0.04282269, msg="Feature importances are not correct")
assert_almost_equal(feature_importance[-1], 0.00794928, msg="Feature importances are not correct")
print("Feature Importance:\n")
for val, name in sorted(zip(feature_importance, data.columns), key=lambda x: x[0], reverse=True):
    print(f'{name:>21}: {100.0*val:05.2f}%')

Feature Importance:

 uniformity cell size: 22.64%
          bare nuclei: 20.62%
uniformity cell shape: 18.03%
 epithelial cell size: 16.84%
      bland chromatin: 07.73%
      normal nucleoli: 06.47%
      clump thickness: 04.28%
    marginal adhesion: 02.58%
              mitoses: 00.79%


---

# Problem 4: Get the Cross Validation Scores

Get the cross-validation scores for a Random Forest Classifier.

For this problem you will use **data** and __label__ created above.

To solve this problem do the following:
- Create a `RandomForestClassifier` estimator. Set `n_estimators` to 100, `random_state` to 23 and accept default values for all other hyperparameters.
- Create `StratifiedKFold` iterator. Set `n_splits` to 5 and `random_state` to 23.
- Calculate cross validation scores using `cross_val_score` function with the random forest classifier, data, label and the `StratifiedKFold` iterator. Assign scores to variable **cv_scores**.

After this problem, there's a new variable **cv_scores** defined.

-----

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# YOUR CODE HERE
rfc = RandomForestClassifier(n_estimators=100, random_state=23)

In [21]:
kfold = StratifiedKFold(n_splits=5, random_state=23)

In [22]:
cv_scores = cross_val_score(estimator=rfc, X=data, y=label, cv=kfold)
cv_scores

array([0.93430657, 0.94890511, 0.98540146, 0.97810219, 0.98518519])

In [23]:
assert_almost_equal(cv_scores[0], 0.93430657, msg='Cross validation scores are not correct')
assert_almost_equal(cv_scores[2], 0.98540146, msg='Cross validation scores are not correct')
print(f"Average Cross Validation Score: {np.mean(cv_scores)*100:4.1f}%")

Average Cross Validation Score: 96.6%
