## <center> Principal Component Analysis and Clustering

In this Notebook, we are going to walk through `sklearn` built-in implementations of dimensionality reduction and clustering methods.
## 1. Principal Component Analysis

First we'll import all required modules:

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns; sns.set(style='white')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.manifold import TSNE
from sklearn import datasets
from sklearn.model_selection import train_test_split

Use the given toy data set:

In [None]:
X = np.array([[2., 13.], [1., 3.], [6., 19.],
              [7., 18.], [5., 17.], [4., 9.],
              [5., 22.], [6., 11.], [8., 25.]])

In [None]:
plt.scatter(X[:,0], X[:, 1])
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$');

**With the following code we will plot $x_1$ axis and the vector corresponding to the first principal component for this data, while rescaling the data using StandardScaler.**

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
print(scaled_data)

In [None]:
plt.scatter(scaled_data[:,0], X[:, 1])
plt.xlabel(r'$x_1$')
plt.ylabel(r'$x_2$');

In [None]:
x1 = [i*scaled_data[0] for i in range(-3,4,1)]
x2 = [i*scaled_data[1] for i in range(-3,4,1)]
plt.plot(x1, x2)

 **With the following code we will find what are the eigenvalues of the $X^{\text{T}}X$ matrix, given $X$ is a rescaled matrix of the toy dataset.**

In [None]:
z = np.dot(X.T, X)
c = np.linalg.eigvals(z)
c

Let's load a dataset of peoples' faces and output their names.

In [None]:
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=50, 
                resize=0.4, data_home='../../data/faces')

print('%d objects, %d features, %d classes' % (lfw_people.data.shape[0],
      lfw_people.data.shape[1], len(lfw_people.target_names)))
print('\nPersons:')
for name in lfw_people.target_names:
    print(name)

Let's look at some faces. All images are stored in a handy `lfw_people.images` array.

In [None]:
fig = plt.figure(figsize=(8, 6))

for i in range(15):
    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap='gray')

**With the following code we will find what the minimal principal component number is needed to explain 90% of data variance.**

For this task, we will be using the [`svd_solver='randomized'`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) parameter, which is a PCA approximation, but it significantly increases performance on large data sets. We'll also use fixed `random_state=1` for comparable results.

In [None]:
pca = PCA(n_components = 0.90, random_state =1)
pca.fit(imgs)

In [None]:
random_state = 1
svd_solver = 'randomized'
imgs=lfw_people.images.reshape(1560,50*37)
#pca = StandardScaler('imgs').fit_transform()
pca = PCA(n_components = 0.90, random_state =1)
pca.fit(imgs)
#print(pca.explained_variance_)
X_pca = pca.transform(imgs)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

Print a picture showing the first 30 principal components. In order to create it, use 30 vectors from `pca.components_`, reshape them to their initial size (50 x 37), and display.

In [None]:
print(pca.components_)

## 2. Clustering

For the next task in the notebook, we'll load the housing prices dataset:

In [None]:
boston = datasets.load_boston()
X = boston.data

Using the elbow-method (reference [article 7](https://medium.com/@libfun/db7879568417) of the course), we'll find the optimal number of clusters to set as a hyperparameter for the k-means algorithm.

**With the following code we will find what is the optimal number of clusters to use on housing prices data set according to the elbow-method.**

In this case, we are looking for the most significant curve fracture on the `Cluster number vs Centroid distances` graph. 

In [None]:
from sklearn.cluster import KMeans

In [None]:
inertia = []
for k in range(2, 10):
    kmeans = KMeans(n_clusters= k, random_state=1).fit(X)
    inertia.append(np.sqrt(kmeans.inertia_))
inertia

In [None]:
plt.plot(range(2, 10), inertia, marker='s');
plt.xlabel('$k$')
plt.ylabel('$J(C_k)$');

Going back to the faces dataset. We can imagine that we did not know the names for who was each photo but that we knew that there were 12 different people. Let's compare clustering results from 4 algorithms - k-means, Agglomerative clustering, Affinity Propagation, and Spectral clustering. We'll use the same respective parameters as in the end of [this article](https://medium.com/@libfun/db7879568417), only change the number of clusters to 12.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

kmeanss = KMeans(n_clusters=12, random_state=1).fit(imgs)
kmeanss.labels_

#kmeanss.cluster_centers_

In [None]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
clustering = AgglomerativeClustering(n_clusters = 12).fit(imgs)
clustering 
# AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
#             connectivity=None, linkage='ward', memory=None, n_clusters=2,
#             pooling_func='deprecated')
clustering.labels_

In [None]:
from sklearn.cluster import AffinityPropagation
import numpy as np
clusteringg = AffinityPropagation().fit(imgs)
clusteringg 
#AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
#         damping=0.5, max_iter=200, preference=None, verbose=False)
clusteringg.labels_
#clusteringg.cluster_centers_

In [None]:
from sklearn.cluster import SpectralClustering
import numpy as np
clustering = SpectralClustering(n_clusters=12, assign_labels="discretize", random_state=1).fit(imgs)
clustering.labels_
# clustering 
# SpectralClustering(affinity='rbf', assign_labels='discretize', coef0=1,
#           degree=3, eigen_solver=None, eigen_tol=0.0, gamma=1.0,
#           kernel_params=None, n_clusters=2, n_init=10, n_jobs=None,
#           n_neighbors=10, random_state=0)