# Module 3 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

-----

## Loading Breast Cancer Data

In this assignment, we will work with a breast cancer data set to make predictive models. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows. The second Code cell splits the DataFrame into a training and testing data set, respectively, before creating the features and labels to use for our classification task.

-----

In [2]:
# Load data
df = pd.read_csv('breast-cancer-wisconsin.data')
df.sample(5)

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
244,1017023,6,3,3,5,3,10,3,5,3,2
295,666942,1,1,1,1,2,1,3,1,1,2
390,1223543,1,2,1,3,2,1,1,2,1,2
624,1285722,4,1,1,3,2,1,1,1,1,2
367,846423,10,6,3,6,4,10,7,8,4,4


In [3]:
df.shape

(683, 11)

In [4]:
X = df.iloc[:,0:10]
y = df.iloc[:,10]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.25, random_state=0)

In [6]:
X_train.shape, y_train.shape

((512, 10), (512,))

In [None]:
# Load data
df = pd.read_csv('./breast-cancer-wisconsin.data')
df.sample(5)

In [None]:
# Split data into training and testing DataFrames
train, test = train_test_split(df, test_size=0.25, random_state=0)

# Create training features and label
y = train['class']
X = train.drop('class', axis=1)

# Create testing features and label
yTest = test['class']
XTest = test.drop('class', axis=1)

-----

## Problem 1: Creating a Random Forest Classifier

For this problem, you will complete the `classify` function, provided below, to complete the following tasks:
- Create a random forest classifier by using the `RandomForestClassifier` estimator in the scikit learn library.
- When creating the RandomForestClassifier estimator assign the `s_estimators` hyperparameter to the `ne` parameter, assign the `random_state` hyperparameter to the `rs` parameter, and leave the other parameters as their default values.
- Fit this `RandomForestClassifier` estimator to the X (which are the features) and y (which are the labels) DataFrames.
- Return the RandomForestClassifier model.

-----

In [7]:
from sklearn.ensemble import RandomForestClassifier
#from sklearn.metrics import 

In [8]:
rfcmodel = RandomForestClassifier(n_estimators=10, random_state=0)

In [9]:
rfcmodel.fit(X_train,y_train)

RandomForestClassifier(n_estimators=10, random_state=0)

In [10]:
rfcmodel.score(X_test,y_test)

0.9415204678362573

In [None]:
from sklearn.ensemble import RandomForestClassifier

def classify(X, y, rs=0, ne=10):
    '''
    Create and fit a RandomForestClassifier estimator to training data.
    
    Parameters
    ---------
    X : Pandas DataFrame
    y : Pandas DataFrame
    rs: seed for random number generator
    ne: float for number of estimators
    
    Returns
    -------
    The RandomForestClassifier estimator
    '''
    
    ### YOUR CODE HERE

In [None]:
# Classify data by using different numbers of tress in the forest
rfc1 = classify(X, y, rs=0, ne=1)
rfc5 = classify(X, y, rs=0, ne=5)
rfc10 = classify(X, y, rs=0, ne=10)
rfc20 = classify(X, y, rs=0, ne=20)

# Check solutions
assert_almost_equal(rfc1.score(XTest,yTest), 0.9356, places=2)
assert_almost_equal(rfc5.score(XTest,yTest), 0.9532, places=2)
assert_almost_equal(rfc10.score(XTest,yTest), 0.9415, places=2)
assert_almost_equal(rfc20.score(XTest,yTest), 0.9473, places=2)

-----

We now load a raw version of the breast cancer data set, which we will use in the remaining two problems. The first Code cell below loads these data and samples several instances. The second Code cell generates features and  labels.

-----

In [11]:
df2 = pd.read_csv('breast-cancer-wisconsin-not-cleaned.data')
df2.sample(5)

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
139,1183246,1,1,1,1,1,,2,1,1,2
419,1253505,2,3,1,1,5,1.0,1,1,1,2
487,1073960,10,10,10,10,6,10.0,8,1,5,4
635,1260659,3,1,4,1,2,1.0,1,1,1,2
495,1170945,3,1,1,1,1,1.0,2,1,1,2


In [12]:
X2 = df2.iloc[:,0:10]
y2 = df2.iloc[:,10]

In [13]:
X2.shape, y2.shape

((699, 10), (699,))

In [None]:
# Load uncleaned data
df = pd.read_csv('./breast-cancer-wisconsin-not-cleaned.data')
df.sample(5)

In [None]:
# Create labels (y) and features (X)
y = df['class']
X = df.drop('class', axis=1)

-----

## Problem 2: Creating a Data Pre-Processing Pipeline

Previously, we have often cleaned data, such as the breast cancer dataset, by removing rows that contained a _NaN_. In this problem, however, we will instead create a pipeline that replaces, or imputes, missing values (i.e., _NaN_) with the mean value of the appropriate column. After this, the pipeline will apply a standard scaling to the columns. Thus, to complete this problem, you must explicitly:
- Create an `Imputer` by using the scikit learn library. When creating this object, set the `missing_values` argument to `'NaN'`, the `strategy` argument to `'mean'`, the `axis` argument to `0`, and set `copy=False`.
- Create a `StandardScaler` by using the scikit learn library, leaving the parameters as their default values.
- Create a pipeline. **Important:** The first item in the pipeline must be the `Imputer` with the name `'imp'`, while the second item in the pipeline must be the `StandardScaler` use the name `'ss'`.
- Apply the pipeline to fit and transform the features (X) and the labels (y).
- Return the preprocessing pipeline and the transformed  features.

-----

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [15]:
X2.isnull().sum()

id                        0
clump thickness           0
uniformity cell size      0
uniformity cell shape     0
marginal adhesion         0
epithelial cell size      0
bare nuclei              16
bland chromatin           0
normal nucleoli           0
mitoses                   0
dtype: int64

In [16]:
X2 = X2.replace(to_replace=np.nan, value=X2["bare nuclei"].mean())

In [17]:
X2.isnull().sum()

id                       0
clump thickness          0
uniformity cell size     0
uniformity cell shape    0
marginal adhesion        0
epithelial cell size     0
bare nuclei              0
bland chromatin          0
normal nucleoli          0
mitoses                  0
dtype: int64

In [18]:
scalar = StandardScaler()

In [19]:
new_X = scalar.fit_transform(X2)

In [20]:
new_X[30][0], new_X[100][0], new_X[123][0], new_X[256][0], new_X[512][0]

(-0.0012472123989550703,
 0.15397593605209656,
 0.16658267824600242,
 0.17951699465179008,
 0.370207348984359)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, StandardScaler

def preprocess(X, y):
    '''
    Create and fit a RandomForestClassifier pipeline to training data, 
    and compute the mean accuracy score from the testing data.
    
    Parameters
    ---------
    X : Pandas DataFrame
    y : Pandas DataFrame

    Returns
    -------
    The pipeline used to clean the data and the cleaned features as a NumPy array
    '''

    ### YOUR CODE HERE

In [None]:
# Apply pipeline to uncleaned data
data_pipeline, new_X = preprocess(X, y)

# Test pipeline
assert_equal(type(data_pipeline), Pipeline, 
             msg='Please return a pipeline object as the first object')

assert_equal(new_X[30][0], -0.0012472123989550703)
assert_equal(new_X[100][0], 0.15397593605209656)
assert_equal(new_X[123][0], 0.16658267824600242)
assert_equal(new_X[256][0], 0.17951699465179008)
assert_equal(new_X[512][0], 0.37020734898435897)

assert_equal(type(data_pipeline.get_params()['imp']), type(Imputer()), 
             msg='You  did not label the imputer "imp" or did not make an Imputer Object')

assert_equal(type(data_pipeline.get_params()['steps'][0][1]), Imputer, 
             msg='You need to create a Imputer First')

assert_equal(data_pipeline.get_params()['imp__missing_values'], 'NaN', 
             msg='set missing values to NaN')

assert_equal(data_pipeline.get_params()['imp__strategy'], 'mean', 
             msg='Set strategy to mean')

assert_equal(type(data_pipeline.get_params()['ss']), type(StandardScaler()), 
             msg='You  did not create a Standard Scaler labeled "ss" ')

-----

We now need to split the newly cleaned data into training and testing data sets in order to create and apply a classifier pipeline to these data.

-----

In [21]:
trainX, testX, trainY, testY = train_test_split(new_X, y2, random_state=0)

In [22]:
trainX , trainY

(array([[ 0.4417661 ,  1.27313768,  2.25152563, ...,  1.87236122,
         -0.28411186, -0.34391178],
        [ 0.00554428, -0.50386559, -0.69999505, ..., -0.5900668 ,
         -0.61182504, -0.34391178],
        [ 0.34904314,  1.27313768,  2.25152563, ...,  0.23074254,
          1.68216723,  3.15697661],
        ...,
        [ 0.23883387, -0.14846494, -0.69999505, ..., -1.00047147,
         -0.61182504, -0.34391178],
        [ 0.40578158,  0.20693572, -0.69999505, ..., -0.5900668 ,
         -0.61182504, -0.34391178],
        [-0.98077357, -1.2146669 , -0.69999505, ..., -1.00047147,
         -0.61182504, -0.34391178]]),
 570    4
 34     2
 506    4
 514    4
 567    2
       ..
 359    4
 192    2
 629    2
 559    2
 684    2
 Name: class, Length: 524, dtype: int64)

In [None]:
trainX, testX, trainY, testY = train_test_split(new_X, y, random_state=0)

-----

## Problem 3: Creating a Random Forest Pipeline

For this problem, you will finish the `rfcp` template function provided below by creating a random forest pipeline. This will require that you complete the following tasks:
- Create a random forest classifier by using the scikit learn RandomForestClassifer estimator. Do not change any of the default hyperparameters for this estimator.
- Create a pipeline that includes this random forest classifier, and label the classifier `rfc`.
- Set the random_state of the random forest classifier in the pipeline to the value specified by the `rs` parameter.
- Fit the pipeline to the training data.
- Calculate the mean accuracy of the random forest classifier in the pipeline from the test data.
- Return the pipeline and the mean accuracy score.

-----

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [24]:
rfcp = RandomForestClassifier()

In [25]:
rfcp.fit(trainX,trainY)

RandomForestClassifier()

In [26]:
predictY = rfcp.predict(testX)

In [27]:
rfcp.score(testX,testY)

0.9714285714285714

In [None]:
def rfcp(trainX, trainY, testX, testY, rs=0):
    '''
    Create and fit a RandomForestClassifier pipeline to training data, 
    and compute the mean accuracy score from the testing data.
    
    Parameters
    ---------
    trainX : NumPy array (features)
    trainY : NumPy array (labels)
    testX : NumPy array (features)
    testY : NumPy array (labels)
    rs: random state

    Returns
    -------
    The pipeline and a float containing the mean accuracy score
    '''

    ### YOUR CODE HERE

In [None]:
# Create and apply the RFC pipeline
ml_pipeline, score = rfcp(trainX, trainY, testX, testY, rs=0)

# Test the pipeline
assert_equal(type(ml_pipeline.get_params()['rfc']), type(RandomForestClassifier()))
assert_almost_equal(score, 0.9555, places=2)

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 