<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/PCA0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Clone the repo**

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

In [None]:
from IPython.display import Image
Image("images/pca0.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca1.png" , width=600)

# **Principal component analysis (PCA)** 
PCA is a multivariate technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. It was invented in 1901 by Karl Pearson.

When studying many variables at the same time, in order to interpret the information in a more meaningful form, it is necessary to reduce the number of variables to a few linear combinations of the data. Each linear combination will correspond to a principal component (PC). The number of PCs is less than or equal to the smaller of the number of original variables or the number of observations.

In [None]:
from IPython.display import Image
Image("images/pca9.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca8.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca7.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca6.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca5.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca4.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca3.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca2.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca15.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca14.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca13.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca12.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca11.png" , width=600)

In [None]:
from IPython.display import Image
Image("images/pca10.png" , width=600)

**Import libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from matplotlib import*
import matplotlib.pyplot as plt
from matplotlib.cm import register_cmap
from scipy import stats
from sklearn.decomposition import PCA 
import seaborn


# **Can we predict how a viewer will rate a movie?**

**Load and prepare the data**

In [None]:
#Load movie names and movie ratings
movies = pd.read_csv('movies_df.csv')

In [None]:
movies = pd.DataFrame(movies)
movies

In [None]:
movies = movies.drop(['title','genres'],axis=1)

In [None]:
movies.head(40)

In [None]:
M = movies.pivot_table(index=['userId_x'], columns=['movieId'], values='rating')
m = M.shape
print("shape = ", m)
df1 = M.replace(np.nan, 0, regex=True)

In [None]:
df1.head()

**Scale the data**

In [None]:
X_std = StandardScaler().fit_transform(df1)

**Compute covariance matrix**

In [None]:
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix n%s' %cov_mat)

**Compute the Principal Components of the data set**

In [None]:
#Calculating eigenvectors and eigenvalues on covariance matrix
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

**Compute the feature vector**

In [None]:
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print('Eigenvalues in descending order:')
for i in eig_pairs:
  print(i[0])

**Use the PCA() function to reduce the dimensionality of the data set**

In [None]:
pca = PCA(n_components=2)
pca.fit_transform(df1)
print(pca.explained_variance_ratio_)

**Scree plot expresses the variance associated with each principal component**

In [None]:
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

The scree plot clearly indicates that the first 500 principal components contain the maximum information (variance) within the data. Note that the initial data set had approximately 1550 features which can now be narrowed down to just 500. You can now easily perform further analysis on the data since the redundant or insignificant variables are out. This is the power of dimensionality reduction.