## HW2 Unsupervised and Supervised Learning

**Deadline** 11:59 pm on November 3rd

In this assignment you'll gain some hands-on experience with principal components analysis (PCA) and Supervised Learning methods such as Support Vector Machine (SVM) and Random Forest.

You need to install the following libraries: tensorflow and pillow. But if you use co-lab, no need for additional installations. 

In the first problem, you will study how different numbers of principal components represent the images visually. For the second problem you will utilize sklearn built in functions to perform classification on the provided data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import math

### Problem 2.1: PCA for dimension reduction (3 Points)

In this problem you will approximately reconstruct images by simplifying them to multiples of a few principal components.

Note: When you display the images, use the color map `cmap=plt.cm.gray.reversed()` for MNIST.

Pick a random seed in the next cell to select a random image of a handwritten $0$ from the MNIST data.

In [None]:
(x, y), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x = x.reshape([60000, 28*28])
zeros = np.where(y==0)[0]
x = x[zeros,:]
y = y[zeros]
np.random.seed(265) # put your seed here
my_image = np.random.randint(0, len(y), size=1)

plt.imshow(x[my_image,:].reshape((28,28)), cmap=plt.cm.gray.reversed())

For $k = 0, 10, 20, 30, 40, 50$, use $k$-th principal components for MNIST $0$'s to approximately reconstruct the image selected above. Noting that we index from 0, namely 0-th pricipal component is the first one. Display the reconstruction for each value of $k$. To display the set of images compactly, you may want to use the 'plot_images' function defined below.

In [None]:
def plot_images(images, titles, h, w, n_row=3, n_col=4, reversed=False):
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        if reversed:
            plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        else:
            plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray.reversed())
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

In [None]:
# Hint example code
from sklearn.decomposition import PCA

height = 28
width = 28


num_components = 50

#initialize and fit PCA
pca = PCA(num_components).fit(x)

#get principal_vectors (components) of the fitted pca (size = (n_components, n_features)). The components are sorted by explained_variance_.
principal_vectors = 

#reshape the principals vectors to the same size of input images
principal_vectors = principal_vectors.reshape((num_components, height, width))

# fit the model with x and apply the dimensionality reduction on x.
pcs = 

# Transform data back to its original space
capprox = 

# plot
labels = ['principal vector %d' % (i+1) for i in np.arange(num_components)]

plot_images( #fill in# )

# show the obtained total variance
ratio = pca.explained_variance_ratio_.sum()
print('Variance explained by first %d principal vectors: %.2f%%' % (num_components, ratio*100))

### Problem 2.2: SVM for classification (7 Points)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for statistical data visualization
import sklearn.metrics as metrics
%matplotlib inline

In [None]:
# load data
train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')

# Set variables for the targets and features
y_train = train_data['price_range']
X_train = train_data.drop('price_range', axis=1)
y_test = test_data['price_range']
X_test = test_data.drop('price_range', axis=1)

In [None]:
# TODO: Normalize Data

### 2.2.a

The linear kernel is written as $<x,x'>$.

The parameter C, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.

In [None]:
from sklearn.svm import LinearSVC

# TODO: Train Linear kernel SVM for different values of C on train data
Cs = [   ] # TODO: fill in the hyper-parameter candidates
for c in Cs:
  lsvc = LinearSVC(random_state = 7, C = c)
  # TODO: Fit the model and get prediction and evalutation on testing data.

  # TODO: Save your results

# TODO: Plot accuracy on test data

### 2.2.b

The RBF kernel is expressed as $exp(γ\|x-x' \|)$.

$\gamma$ defines how much influence a single training example has. The larger $\gamma$ is, the closer other examples must be to be affected.

In [None]:
from sklearn.svm import SVC

# TODO: Train RBF kernel SVM for different values of gamma on train data

gammas = [   ] # TODO: fill in the hyper-parameter candidates
for g in gammas:
  rsvc = SVC(random_state = 7, C=1.0, kernel='rbf', gamma = g)
  # TODO: Fit the model and get prediction and evalutation on testing data.

  # TODO: Save your results


# TODO: Plot accuracy on test data

### 2.2.c

In [None]:
from sklearn.ensemble import RandomForestClassifier

n_trees = [] # TODO: fill in the hyper-parameter candidates
# TODO: Train Random Forest for different values of number of estimators on train data
for n in n_trees:
  rf = RandomForestClassifier(random_state = 7, n_estimators=n)
  # TODO: Fit the model and get prediction and evalutation on testing data.

  # TODO: Save your results

# TODO: Plot accuracy on test data

### 2.2.d

In [None]:
from sklearn.model_selection import GridSearchCV

n_folds = 5

# configure the cross-validation procedure
cv = KFold(n_splits=n_folds, shuffle=True, random_state=1)


# TODO: define search space
space = dict()

# TODO: define the model for parts a, b, and C
lsvc = 
rsvc =
rf = 

# TODO: Perform a grid search and cross-validation to find the optimal hyperparameters of parts a, b, and c

# TODO: For each part, report the optimal value

# TODO: For each part, report the accuracy on test data for the best estimator


Hint: Here I provide an example code of how to use GridSearchCV. For more information, please refer to [Scikit Learn - GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
'''
This is only a piece of example code of how to use GridSearchCV.
You don't need to do anything here. 
You can refer to this example code and implement Prob 2.d.
'''
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search
result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set
best_model = result.best_estimator_

# evaluate model on the hold out dataset
yhat = best_model.predict(X_test)