
## Fashion MNIST Exercise to study how PCA can achieve data compression

### By Ezhilarasan Kannaiyan 

## What is Fashion-MNIST Dataset?
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

### Scope
Any image contains lots of redundant data. For example, many adjoining pixels would have the same color. Therefore, there is a huge scope to reduce the number of columns in an image using PCA. This example shows how even by reducing dimensionality of any image in this dataset from 1 X 784 to 1 X 187, image is till preserved and when plotted is clearly recognizable.

In [None]:
# Call libraries
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Check version of sklearn.
# There should not be any assertion error
import sklearn
assert sklearn.__version__ >= "0.20"

In [None]:
#Read dataset
X = pd.read_csv("/kaggle/input/fashionmnist/fashion-mnist_train.csv")

In [None]:
X.shape
X.head()

In [None]:
#Copy the first column 'label' (target) to 'y' array and remove it
y = X.pop('label')

In [None]:
y.head()

In [None]:
X.shape     
y.shape     

In [None]:
# Split dataset. Default split test-size is 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape
X_test.shape
y_train.shape
y_test.shape

In [None]:
# Train PCA on dataset
pca = PCA()
pca.fit(X_train)

In [None]:
# Get statistics from pca
# How much variance is explained by each principal component
pca.explained_variance_ratio_[:10]
# Cumulative sum of variance of each principal component
cumsum = np.cumsum(pca.explained_variance_ratio_)
cumsum[:10]

In [None]:
# Get the column (principal component) number 
# when cum explained variance threshold just exceeds 0.95
d = np.argmax(cumsum >= 0.95) + 1
d   

In [None]:
#  Let us also plot cumsum - Saturation occurs are Elbow
abc = plt.figure(figsize=(6,4))
abc = plt.plot(cumsum, linewidth=3)
abc = plt.axis([0, 400, 0, 1])
abc = plt.xlabel("Dimensions")
abc = plt.ylabel("Explained Variance")
# Draw a (vertical) line from (d,0) to (d,0.95) - Should be black and dotted
abc = plt.plot([d, d], [0, 0.95], "k:")
# Draw another dotted (horizontal) line - from (0,0.95) to (d,0.95)
abc = plt.plot([0, d], [0.95, 0.95], "k:")
# Draw a point at (d,0.95)
abc = plt.plot(d, 0.95, "ko")
# Annotate graph
abc = plt.annotate("Elbow", xy=(40, 0.81), xytext=(60, 0.65), arrowprops=dict(arrowstyle="->"), fontsize=16)
plt.grid(True)
plt.show()

In [None]:
# Get transformed dataset upto 95%
# explained variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

In [None]:
pca.n_components_
X_reduced.shape

Shape is reduced from 784 to 187. 

In [None]:
# Recheck sum of explained variance
np.sum(pca.explained_variance_ratio_)

Our scope now is to apply reverse transform of pca to get from 187 to 784 (back to original shape) and check the quality of the image.

In [None]:
# Use PCA's function inverse_transform() to get origianl
# dimensions back from reduced dimesionality
X_recovered = pca.inverse_transform(X_reduced)

In [None]:
X_recovered.shape     

In [None]:
# Plot few digits from original dataset
# Digit shapes
fig,axe = plt.subplots(2,5)
axe = axe.flatten()
for i in range(10):
    abc = axe[i].imshow(X_train.iloc[i,:].to_numpy().reshape(28,28))

# And few digits from compressed dataset
# And compare both
fig,axe = plt.subplots(2,5)
axe = axe.flatten()
for i in range(10):
    abc = axe[i].imshow(X_recovered[i,:].reshape(28,28))


## Conclusion: 

There is not much difference between original images (first 10 images) and compressed (PCA applied) images (last 10 images).

Therefore, 
- There is a huge scope to reduce the number of columns in an image using PCA. 
- Eventhough dimensionality of the image is reduced from 784 to 187 pixels, image is till preserved.
- When the image is plotted, it is clearly recognizable.
