# Module 5 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from nose.tools import assert_is_instance, assert_equal, assert_almost_equal
import helpera5 as hp

-----

## Problem 1: Reading in Breast Cancer Dataset

For this problem, you will complete the `handle_data` function below to first load the data into the assignment notebook, before creating the features and labels to use for later classification tasks. For testing purposes, this function will be used to load the breast cancer data set, which has labels given by the `class` feature. Specifically, you must complete the following tasks:
- Read in the dataset specified in the `file_name` parameter by using any method that you have learned in this course. The simplest (and recommended) approach is to use the Pandas module.
- Create a NumPy array that only contains the labels (e.g., the `class` column).
- Create a multidimensional NumPy array that contains all features except the label feature (e.g., all columns but the `class` column).
- Return the feature and label arrays in order.

-----

In [2]:
def handle_data(file_name):
    '''
    Load data from indicated file, and create and return feature and label arrays.
    
    Inputs
    ---------
    file_name: path to data file

    Returns
    -------
    feature: A NumPy array containing the features
    label: A NumPy array containing the labels
    '''
    ### YOUR CODE HERE
    
    df = pd.read_csv(file_name)
    
    X = df.drop('class', axis=1).values
    y = df['class'].values
    
    return X,y

In [3]:
df = pd.read_csv('breast-cancer-wisconsin.data')
df.head()

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
X = df.drop('class', axis=1).values
y = df['class'].values

In [5]:
X.shape, y.shape

((683, 10), (683,))

In [6]:
# Handle data
X, y = handle_data('breast-cancer-wisconsin.data')

# Checking type
assert_equal(type(X), np.ndarray, "Your features should be of type NumPy array")
assert_equal(type(y), np.ndarray, "Your labels should be of type NumPy array")

# Checking shape
assert_equal(X.shape[0], 683, "You should have 683 labels not %s"%X.shape[0])
assert_equal(X.shape[1], 10, "You should have 10 columns not %s"%X.shape[1])
assert_equal(y.shape[0], 683, "You should have 683 labels not %s"%y.shape[0])

# Checking actual data
#assert_equal(X.tolist(), hp.X, msg="")
#assert_equal(y.tolist(), hp.y, msg="")

-----

## Problem 2: Preparing Data

For this problem, you will extract feature and label arrays from their parent DataFrame. You will pre-process the features and labels to scale the features and encode the labels. Specifically, you must complete the following tasks:
- Make a new Pandas DataFrame for the features, which contains all columns in the parent DataFrame, except for the label column, whose name is provided by the `label_name` parameter.
- Make a Pandas Series that contains the label from the parent DataFrame, which is indicated by the column specified in the `label_name` parameter.
- Transform the features DataFrame by using the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) object from the scikit learn library (use the default parameters) .
- Transform the label Series by using the [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) object from the scikit learn library (use the default parameters).
- Return the Standardized Features and Encoded Labels, in this order.

-----

In [7]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

def prep_data(df, label_name):
    """
    Prepares the features and labels contained in the DataFrame, df.
    
    Parameters
    ----------
    df: A DataFrame containing the data to process
    label_name: string containing the name of the label column in the DataFrame, df
    
    Returns
    -------
    xx, yy:  NumPy arrays containing the scaled features and encoded labels, respectively
    """

    ### YOUR CODE HERE
    
 
    X = df.drop(label_name, axis=1)
    y = df[label_name]
    
    scaler = StandardScaler()
    
    le = LabelEncoder()

    xx = scaler.fit_transform(X)
    yy = le.fit_transform(y)
    
    return xx, yy

In [8]:
data = pd.read_csv('./breast-cancer-wisconsin.data', encoding="utf-8")
data.head()

Unnamed: 0,id,clump thickness,uniformity cell size,uniformity cell shape,marginal adhesion,epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [9]:
X = data.drop('class', axis=1)
y = data['class']

In [10]:
X.shape, y.shape

((683, 10), (683,))

In [11]:
scaler = StandardScaler()

xx = scaler.fit_transform(X)

xx

array([[-0.12366418,  0.19790469, -0.70221201, ..., -0.18182716,
        -0.61292736, -0.34839971],
       [-0.11895594,  0.19790469,  0.27725185, ..., -0.18182716,
        -0.28510482, -0.34839971],
       [-0.09883306, -0.51164337, -0.70221201, ..., -0.18182716,
        -0.61292736, -0.34839971],
       ...,
       [-0.30297227,  0.19790469,  2.23617957, ...,  1.86073779,
         2.33747554,  0.22916583],
       [-0.2890233 , -0.15686934,  1.58320366, ...,  2.67776377,
         1.02618536, -0.34839971],
       [-0.2890233 , -0.15686934,  1.58320366, ...,  2.67776377,
         0.37054027, -0.34839971]])

In [12]:
le = LabelEncoder()

yy = le.fit_transform(y)

yy

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,

In [13]:
type(xx), type(yy)

(numpy.ndarray, numpy.ndarray)

In [14]:
# Load Breast Cancer data set
df = pd.read_csv('./breast-cancer-wisconsin.data')

# Process data
X, y = prep_data(df, label_name='class')

# Test Processing function
assert_equal(X.tolist(), hp.xx)
assert_equal(y.tolist(), hp.yy)

In [15]:
print(X, y)

[[-0.12366418  0.19790469 -0.70221201 ... -0.18182716 -0.61292736
  -0.34839971]
 [-0.11895594  0.19790469  0.27725185 ... -0.18182716 -0.28510482
  -0.34839971]
 [-0.09883306 -0.51164337 -0.70221201 ... -0.18182716 -0.61292736
  -0.34839971]
 ...
 [-0.30297227  0.19790469  2.23617957 ...  1.86073779  2.33747554
   0.22916583]
 [-0.2890233  -0.15686934  1.58320366 ...  2.67776377  1.02618536
  -0.34839971]
 [-0.2890233  -0.15686934  1.58320366 ...  2.67776377  0.37054027
  -0.34839971]] [0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 0 1 0 1 1 0
 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1
 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1
 1 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 0 1
 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1
 0 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0

-----

Now we prepare the data for classification by creating training and testing features and labels.

-----

In [16]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [17]:
X_train[:5], X_test[:5]

(array([[-1.50564978,  0.90745276,  0.93022775,  2.2718962 ,  0.75803177,
         -0.10545357,  1.77286724,  2.26925078,  2.33747554,  0.22916583],
        [ 0.07068609,  1.26222679, -0.0492361 ,  1.6021918 ,  0.05933312,
          0.34470136,  1.49823165,  1.86073779,  2.00965299,  3.69455903],
        [ 0.28671521,  1.26222679,  2.23617957,  2.2718962 ,  2.5047784 ,
          1.24501121,  1.77286724,  2.67776377,  2.33747554, -0.34839971],
        [-0.6267233 , -1.22119144, -0.70221201, -0.74177362, -0.63936553,
         -0.5556085 , -0.69885309, -0.99885314, -0.61292736, -0.34839971],
        [-1.04503897, -1.22119144, -0.70221201, -0.07206921, -0.63936553,
         -1.00576342, -0.69885309, -0.59034015, -0.61292736, -0.34839971]]),
 array([[ 0.15580201, -1.22119144, -0.70221201, -0.74177362, -0.63936553,
         -0.5556085 ,  0.39968928, -0.99885314, -0.61292736, -0.34839971],
        [ 0.12785894, -0.51164337, -0.70221201, -0.74177362, -0.63936553,
         -0.5556085 , -0.69885

-----

## Problem 3: Gaussian Process Classification

This problem requires that you complete the `gc` function below to perform Gaussian Process classification on the supplied training data, and return the score for this classifier when applied to the test data. Specifically, you must complete the following tasks:
- Create a [`GaussianProcessClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html) using the default hyperparameters, except for the `random_state` parameter, which should be set to the `rs` parameter supplied to the `gc` function.
- Fit your `GaussianProcessClassifier` estimator to the training features and labels.
- Compute the mean accuracy by using this estimator for the test features and labels.
- Return the mean accuracy.

-----

In [18]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.metrics import accuracy_score

def gc(X_train, X_test, y_train, y_test, rs=0):
    '''
    Creates and applies a Gaussian Process Classifier to training data and
    returns the mean accuracy score for this estimator on testing data.
    
    Parameters
    ----------
    X_train: NumPy multi-dimensional array containing training features
    X_test: NumPy multi-dimensional array containing testing features
    y_train: NumPy array containing training labels
    y_test: NumPy array containing testing labels
    rs: random_state parameter for GPC estimator
    
    Returns
    -------
    floating point value containing the mean accuracy
    '''
    
    ### YOUR CODE HERE
    
    gpc = GaussianProcessClassifier(random_state=rs)
    
    gpc.fit(X_train, y_train)
    
    predicted_labels = gpc.predict(X_test)
    
    score = accuracy_score(y_test, predicted_labels)
    
    
    return score

In [19]:
# Compute Mean Accuracy
score = gc(X_train, X_test, y_train, y_test, rs=0)

# Test Mean Accuracy
assert_almost_equal(score, 0.95906, places=2)

In [20]:
score

0.9590643274853801

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 