    Andrew Carr

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier


import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import time
from math import floor
%matplotlib inline

# Problem 1
Apply PCA to the cancer dataset to reduce the dimension of the feature space to each of 15, 10, and 5.    Are there any features or combinations of features for which PCA is not a suitable method to use?  Explain.  WARNING: remember to center your data (subtract the mean) and also normalize it. 

In [2]:
#load data set
data = load_breast_cancer()
data, targets, names = data.data, data.target, data.feature_names
print("before mean {} var {}".format(np.mean(data), np.sqrt(np.var(data))))
data = scale(data)
print("after mean {} var {}".format(np.mean(data), np.sqrt(np.var(data))))


#for each feature space (5, 10, 15) reduce feature space
for n in [15,10,5]:
    reducer = PCA(n_components=n)
    reducer.fit(data)
    new_data = reducer.transform(data)
    print("original data shape {}".format(data.shape))
    print("reduced data shape {}".format(new_data.shape))

before mean 61.890712339519624 var 228.29740508276657
after mean -6.118909323768877e-16 var 1.0
original data shape (569, 30)
reduced data shape (569, 15)
original data shape (569, 30)
reduced data shape (569, 10)
original data shape (569, 30)
reduced data shape (569, 5)


See data below

In [3]:
print(", ".join(names))

mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius, worst texture, worst perimeter, worst area, worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, worst fractal dimension


Yes the data looks like it is split into 3 different groups: Means, Errors, and Worsts. There are 10 of the same feature in each of these 3 groups. While these 3 groups have similar units, they are measuring fundamentally different things. This means that PCA will not do well in combining them into single features. 

However, as we see (with Mitch's homework) it doesn't actually have much of an effect on the end accuracy results or time to completion. 

# Problem 2
Apply three of your favorite classification methods to the full cancer data set and also to the PCA-reduced data.  Analyze and evaluate the performance (time and accuracy) for each combination.  

In [4]:
for n in [5,10,15]:
    print("n components {}".format(n))
    reducer = PCA(n_components=n)
    reducer.fit(data)
    new_data = reducer.transform(data)
    seed = 3264
    train_x, test_x, train_y, test_y = train_test_split(data, targets, train_size=0.7,random_state=seed)
    r_train_x, r_test_x, r_train_y, r_test_y = train_test_split(new_data, targets, train_size=0.7, random_state=seed)
    #favorite classification 1
    nb_clf = GaussianNB()
    start = time.time()
    nb_clf.fit(train_x,train_y)
    print("\tnaive bayes full data accuracy {}\t\ttime {}".format(nb_clf.score(test_x, test_y), time.time()-start))
    
    nb_clf = GaussianNB()
    start = time.time()
    nb_clf.fit(r_train_x, r_train_y)
    print("\tnaive bayes pca accuracy {}\t\t\ttime {}\n".format(nb_clf.score(r_test_x, r_test_y), time.time()-start))


    
    #favorite classification 2
    log_clf = LogisticRegression()
    start = time.time()
    log_clf.fit(train_x,train_y)
    print("\tlogistic regression full data accuracy {}\ttime {}".format(log_clf.score(test_x, test_y), time.time()-start))
    
    log_clf = LogisticRegression()
    start = time.time()
    log_clf.fit(r_train_x, r_train_y)
    print("\tlogistic regression pca accuracy {}\t\ttime {}\n".format(log_clf.score(r_test_x, r_test_y), time.time()-start))


    #favorite classification 3
    svm_clf = SVC()
    start = time.time()
    svm_clf.fit(train_x,train_y)
    print("\tsvm full data accuracy {}\t\t\ttime {}".format(svm_clf.score(test_x, test_y), time.time()-start))
    
    svm_clf = SVC()
    start = time.time()
    svm_clf.fit(r_train_x, r_train_y)
    print("\tsvm pca accuracy {}\t\t\t\ttime {}\n".format(svm_clf.score(r_test_x, r_test_y), time.time()-start))


n components 5
	naive bayes full data accuracy 0.9590643274853801		time 0.0012028217315673828
	naive bayes pca accuracy 0.9298245614035088			time 0.0010149478912353516

	logistic regression full data accuracy 0.9883040935672515	time 0.0024499893188476562
	logistic regression pca accuracy 0.9649122807017544		time 0.0011322498321533203

	svm full data accuracy 0.9883040935672515			time 0.004132986068725586
	svm pca accuracy 0.9473684210526315				time 0.0038750171661376953

n components 10
	naive bayes full data accuracy 0.9590643274853801		time 0.0009217262268066406
	naive bayes pca accuracy 0.9181286549707602			time 0.0010197162628173828

	logistic regression full data accuracy 0.9883040935672515	time 0.0022199153900146484
	logistic regression pca accuracy 0.9824561403508771		time 0.0014829635620117188

	svm full data accuracy 0.9883040935672515			time 0.0035812854766845703
	svm pca accuracy 0.9590643274853801				time 0.0033979415893554688

n components 15
	naive bayes full data accurac

It is interesting to notice that Naive Bayes is faster on the full data than it is on the PCA data. However, with LogisticRegression PCA beats out the full dataset handily. The train test split has more effect on the testing accuracy than splitting up the features as we see with Mitchell Probst's homework. He split up his data by Mean, Error, and Worst and did PCA on each of those features, and with the same seed (trying different seeds) we found that we went back and forth on accuracy. With me winning sometimes and his winning others. This seems to show, that while some of the components may not be super well suited for PCA. They still work fine (as we see here) without pulling out features like Mitch did. 

# Problem 3
Find some aspect of your final project for which PCA is an appropriate dimension-reduction method.  Apply PCA and analyze the results and performance.  Compare to your results without PCA.  

In [5]:
#apply PCA
df = pd.read_csv("recipe_data.csv")

In [6]:
df = df.dropna()

def my_floor(li):
    return floor(li['rating'])

df['rating_floor']  = df.apply(my_floor, axis=1)
df['rating_floor'] = df['rating_floor'].astype(int)

X = df[['calories', 'fat','protein', 'sodium']]
Y = df['rating_floor']

In [7]:
#compare results
start = time.time()
reducer = PCA(n_components=2)
reducer.fit(X)
new_X = reducer.transform(X)
print("time of pca decomp {}".format(time.time() - start))



train_x, test_x, train_y, test_y = train_test_split(X, Y, train_size=0.7, random_state=seed)
r_train_x, r_test_x, r_train_y, r_test_y = train_test_split(new_X, Y, train_size=0.7, random_state=seed)

rf_clf = RandomForestClassifier(n_estimators=100)
start = time.time()
rf_clf.fit(train_x, train_y)
print("full data score {}\t\ttime {}".format(rf_clf.score(test_x,test_y), time.time()-start))

rf_clf = RandomForestClassifier(n_estimators=100)
start = time.time()
rf_clf.fit(r_train_x, r_train_y)
print("PCA score {}\t\t\ttime {}".format(rf_clf.score(r_test_x,r_test_y), time.time()-start))


time of pca decomp 0.008898019790649414
full data score 0.4640276468740182		time 1.5698440074920654
PCA score 0.4436066603832862			time 1.7538659572601318


# Problem 4
Repeat what you did in the previous problem, but replacing PCA by a random projection. Try 5 different random projections and compare the results and performance. 

In [8]:
#compare results
accuracy = []
times = []
for i in range(5):
    start = time.time()
    reducer = PCA(n_components=2, svd_solver='randomized')
    reducer.fit(X)
    new_X = reducer.transform(X)
    proj_time = time.time() - start
    times.append(proj_time)
    print("time of random projection {}".format(proj_time))

    r_train_x, r_test_x, r_train_y, r_test_y = train_test_split(new_X, Y, train_size=0.7, random_state=seed)

    rf_clf = RandomForestClassifier(n_estimators=100)
    start = time.time()
    rf_clf.fit(r_train_x, r_train_y)
    acc = rf_clf.score(r_test_x,r_test_y)
    accuracy.append(acc)
    print("Projection score {}\t\t\ttime {}".format(acc, time.time()-start))

print("\nProjection average score {}".format(np.mean(accuracy)))
print("Projection average time {}".format(np.mean(times)))

time of random projection 0.010297775268554688
Projection score 0.43669494187873076			time 1.6062350273132324
time of random projection 0.009592771530151367
Projection score 0.43512409676405905			time 1.4497499465942383
time of random projection 0.009931087493896484
Projection score 0.4344957587181904			time 1.420179843902588
time of random projection 0.009760856628417969
Projection score 0.4448633364750236			time 1.422384262084961
time of random projection 0.00988006591796875
Projection score 0.4410933081998115			time 1.4695649147033691

Projection average score 0.43845428840716305
Projection average time 0.009892511367797851


We notice here that random projections are not quite as accurate, and not quite as fast. However, that is because we are working with something that is already in a very low dimension AND the PCA solver is highly optimized. Therefore, we won't see much difference. 