In [4]:
# Imports
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.cross_validation import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder




In [6]:
# Reading in the wisconsin breast cancer dataset
df = pd.read_csv('wdbc.csv', header=None)

In [37]:
df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [9]:
# Preparing Data for the pipeline
X = df.loc[:,2:].values
y = df.loc[:,1].values
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
# Label encoder has assigned 0 to Benign and 1 to Malignant class type
le.classes_

array(['B', 'M'], dtype=object)

In [11]:
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)

# Dimensionality Reduction | PCA
PCA: Principal component analysis.

Data is transformed into a low/equal dimensional feature subspace such that principal components(directions of maximum variance) are orthogonal to each other.

PCA helps us to identify patterns in data based on the correlation between features.

PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions that the original one. 

The orthogonal axes (principal components) of the new subspace can be interpreted as the directions of maximum variance given the constraint that the new feature axes are orthogonal to each other.

## Using PCA for dimensionality reduction with 10 PCs, play around with PCs and see the accuracies you get.

# Pipeline

The Pipeline object takes a list of tuples as input, where the first value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline, and the second element in every tuple is a scikit-learn transformer or estimator.

The intermediate steps in a pipeline constitute scikit-learn transformers, and the last step is an estimator.

![Pipeline](working_pipeline.png)

In [28]:
# Logistic Regression Pipeline

lr_pipe = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=10)),
                    ('clf',LogisticRegression(random_state=1))])

lr_pipe.fit(X_train, y_train)

print('Test Accuracy: %.3f' % lr_pipe.score(X_test, y_test))

Test Accuracy: 0.974


In [29]:
# SVM pipeline
svm_pipe = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=10)),
                     ('clf', SVC())])

svm_pipe.fit(X_train, y_train)

print('Test Accuracy: %.3f' % svm_pipe.score(X_test, y_test))

Test Accuracy: 0.956


# Cross Validation

In [42]:
# Performing K fold Cross Validation to get an estimate of model performance on unknown data.
print('Logistic Regression Mean score: {}'.format(cross_val_score(estimator=lr_pipe, 
                                                                  X=X_train, y= y_train,
                                                                  cv=25, n_jobs = -1).mean()))

Logistic Regression Mean score: 0.9799862401100792


In [41]:
print('SVM Mean score: {}'.format(cross_val_score(estimator=svm_pipe, 
                                                                  X=X_train, y= y_train,
                                                                  cv=25, n_jobs = -1).mean()))

SVM Mean score: 0.96640522875817


# Comments | Conclusion | Observations

## Dimensionality Reduction: 
    Out of 30 features in the original dataset 10 Principal components cover the most info.
    We have reduced the problem from 30 Dims to 10 Dims.
    
## Pipeline

    It saved us a lot of steps and made the process simpler. 
    
    Without having to write fit_transform on data for standardization and then passing that 
    data for pca and again writing fit_transform for PCA and then repeating the same for estimator.
    
    We just instantiated a pipeline with necessary transformations and estimator and that does it all. 
    
    Now our Pipeline acted as an trained estimator just like any other estimator logistic regression
    or svm but it also internally performed scaling and dimensionality reduction. 

# Cross Validation

    Obtain reliable estimates of the model's generalization error, that is, how well the
    model performs on unseen data.

    In k-fold cross-validation, we randomly split the training dataset into k folds 
    without replacement, where k-1 folds are used for the model training and one fold
    is used for testing. This procedure is repeated k times so that we obtain k models
    and performance estimates.

    We then calculate the average performance of the models based on the different, 
    independent folds to obtain a performance estimate that is less sensitive to 
    the subpartitioning of the training data.