# MachineLearning Working Group

### Python PCA - September 5, 2018

As with the [R walkthrough](https://github.com/dlab-berkeley/MachineLearningWG/blob/master/Fall2018/sep5-PCA/PCA-R.Rmd), let's begin by replicating [another great example](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60) for conducting PCA in Python and then see a machine learning application. 

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the iris dataset

In [None]:
iris = pd.read_csv('./iris.csv')
print(type(iris))
iris.head()

# Define the nuemric features

In [None]:
Features = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]
x = iris.loc[:, Features].values
x

# Standardize the numeric features

In [None]:
x = StandardScaler().fit_transform(x)
x

# Extract the target variable

In [None]:
y = iris.loc[:,["Species"]].values
y

# Define the 2D PCA feature space

In [None]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
pca_df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
pca_df

# Concatenate the Species vector the principal component arrays

In [None]:
iris_pca = pd.concat([iris[["Species"]], pca_df], axis = 1)
iris_pca.head()

# Construct the scatterplot

In [None]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel("Principal Component 1", fontsize = 15)
ax.set_ylabel("Principal Component 2", fontsize = 15)
ax.set_title("PCA iris scatterplot", fontsize = 20)
targets = ["setosa", "versicolor", "virginica"]
colors = ["r", "g", "b"]
for target, color in zip(targets,colors):
    indicesToKeep = iris_pca["Species"] == target
    ax.scatter(iris_pca.loc[indicesToKeep, "principal component 1"]
               , iris_pca.loc[indicesToKeep, "principal component 2"]
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

In [None]:
pca.explained_variance_ratio_

# Proportions of variance are similar to R!
The proportions of variance are virtually identical to those we obtained in R: 

- PC 1 = 0.7296245 
- PC 2 = 0.2285076

# Machine Learning example

Now, let's use PCA to optimize a logistic regression model. 

In [None]:
# Load the mnist dataset
mnist = fetch_mldata('MNIST original')

In [None]:
# Split the data with a 70/30 split
# Define our training and test images and our training and test labels
# random_state is like setting the seed in R and ensures reproducible results
train_img, test_img, train_lbl, test_lbl = train_test_split(mnist.data, mnist.target, test_size=1/7.0, random_state=0)

# Initialize the scaler to standardize the data (remember that PCA is grossly affected by scale!)
scaler = StandardScaler()

# Fit model to training set

In [None]:
scaler.fit(train_img)

train_img = scaler.transform(train_img)

test_img = scaler.transform(test_img)

# Initialize the PCA model

In [None]:
# change the value in the parentheses to tell the model how much variation should be retained. 
# We want 95% of it so we enter 0.95
mnist_pca = PCA(0.95)
mnist_pca.fit(train_img)

# Do the transform on the training and test sets

In [None]:
train_img = mnist_pca.transform(train_img)
test_img = mnist_pca.transform(test_img)

# Initialize logistic regression
... with default settings

In [None]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow which is why it was changed to 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(train_img, train_lbl)

In [None]:
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

In [None]:
# Predict for One Observation (image)
logisticRegr.predict(test_img[0:10])

In [None]:
logisticRegr.score(test_img, test_lbl)

View [this webpage](https://plot.ly/ipython-notebooks/principal-component-analysis/) for another great iris example. 