# 4. Intermediate Python and PCA and t-SNE
In this lab, we will go through intermediate Python, PCA, and t-SNE on MNIST dataset. 

As you may have some experience with Python, we will not go through the basics of Python. Instead, we will focus on some intermediate topics (although still Python 101) that are useful for coding.


## Immutable vs Mutable
In Python, there are two types of objects: immutable and mutable. Immutable objects are those that cannot be changed once created. Mutable objects are those that can be changed once created. Immutable objects include `int`, `float`, `bool`, `str`, `tuple`, `frozenset`. Mutable objects include `list`, `dict`, `set` and also the objects from the third party packages, such as `numpy.array` and `torch.tensor`.

The following code shows an immutable object.


In [1]:
# Immutable
a = (1, 2, 3) # tuple
b = a 
a[0] = 3 # Error
a = (3, 2, 1) # OK
print(b) # (1, 2, 3)

# Mutable
a = [1, 2, 3] # list
b = a
a[0] = 3 # OK
print(b) # [3, 2, 3]
a = [3, 2, 1] # OK
print(b) # [3, 2, 3]

TypeError: 'tuple' object does not support item assignment

## List Comprehension
List comprehension is a concise way to create lists. It is very useful when you want to create a list based on another list. The following code shows how to use list comprehension.


In [None]:
# Create a list of squares
squares = [x**2 for x in range(10)]
print(squares) # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# Create a list with condition
even_squares = [x**2 for x in range(10) if x % 2 == 0]
print(even_squares) # [0, 4, 16, 36, 64]

## Function Arguments and Parameters Type
In Python, there are two types of function arguments: positional arguments `args` and keyword arguments `kwargs`. Positional arguments are those that are passed to the function in the order they are defined. Keyword arguments are those that are passed to the function with a keyword and an equal sign. Keyword arguments are optional. The following code shows how to use positional arguments and keyword arguments.


In [None]:
def func1(a, b, c=1, d=2):
    print(a, b, c, d)
    pass
func1(1, 2) # 1 2 1 2

def func2(a, b, *args, **kwargs):
    print(a, b, args, kwargs)
    pass

func2(1, 2, 3, 4, 5, c=6, d=7) # 1 2 (3, 4, 5) {'c': 6, 'd': 7}

## Boilerplate Code `if __name__ == '__main__'`
In Python, when you run a script, the code in the script will be executed. However, sometimes you may want to import the script as a module and use the functions defined in the script. In this case, you do not want the code in the script to be executed. The following code shows how to use `if __name__ == '__main__'` to avoid executing the code in the script.


In [None]:
# script.py
def func1():
    print('func1')
    pass

if __name__ == '__main__':
    func1()

## Python Ternary Operator
The ternary operator is used for inline conditional expressions. It is best used in simple, concise operations that are easily read. See [Python Ternary Operator](https://book.pythontips.com/en/latest/ternary_operators.html).

In [None]:
# ternary operator
a, b = 10, 20
min = a if a < b else b
print(min)

## MNIST dataset
The `mnist` package provides a function to load the MNIST dataset. The MNIST dataset is a dataset of handwritten digits. It has 60,000 training samples, and 10,000 test samples. Each image is represented by 28x28 pixels, each containing a value 0 - 255 with its grayscale value. The MNIST dataset is one of the most common datasets used for image classification and accessible from many different sources. See [MNIST dataset](http://yann.lecun.com/exdb/mnist/).

In [2]:
# load from mnist dataset: python-mnist
# train-images-idx3-ubyte.gz:  training set images (9912422 bytes)
# train-labels-idx1-ubyte.gz:  training set labels (28881 bytes)
# t10k-images-idx3-ubyte.gz:   test set images (1648877 bytes)
# t10k-labels-idx1-ubyte.gz:   test set labels (4542 bytes)
from mnist import MNIST
# Initialize the dataset
mndata = MNIST('../datasets/MNIST/') # change the path to the dataset folder
# Load the dataset into memory (this will search the four files above)
training_images, training_labels = mndata.load_training()
testing_images, testing_labels = mndata.load_testing()

You can also import the MNIST dataset from `keras` package. 
`keras` is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. 
It was developed with a focus on enabling fast experimentation. 
Being able to go from idea to result with the least possible delay is key to doing good research.
``` python
from keras.datasets import mnist
# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```

Also, you can call the MNIST dataset from `pytorch` package.
PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab (FAIR). It is free and open-source software released under the Modified BSD license. 
``` python
import torch
import torchvision
import torchvision.transforms as transforms
# Load the MNIST dataset
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                      download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)
testset = torchvision.datasets.MNIST(root='./data', train=False,
                                     download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)
```

## Tasks
1. Load the MNIST dataset.
2. Apply PCA and t-SNE on the MNIST dataset. Try the test dataset. t-SNE is time-consuming, so you can use a subset of the dataset.
3. Visualize the results of PCA and t-SNE.
4. Compare the results of PCA and t-SNE.
5. Discuss the pros and cons of PCA and t-SNE.

## PCA
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variations. <br>
NOTE: the following is a template code. You need to change or modify the variable `X_train`.

In [6]:
from sklearn.decomposition import PCA
# Create a Randomized PCA model that takes two components
randomized_pca = PCA(n_components=2, svd_solver='randomized')
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(X_train)

Have a look at the PCA results. Save the plot using `plt.savefig()` to a folder `labs/lab4` and name it `MNIST_PCA.png`.

In [None]:
import matplotlib.pyplot as plt
# Plot the data
colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
for i in range(len(colors)):
    x = reduced_data_rpca[:, 0][y_train == i]
    y = reduced_data_rpca[:, 1][y_train == i]
    plt.scatter(x, y, c=colors[i])
plt.legend(list(map(str, range(10))))
plt.show()

## t-SNE
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation

In [None]:
from sklearn.manifold import TSNE
# Create a t-SNE model
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
# Fit and transform the data to the t-SNE model
tsne_results = tsne.fit_transform(X_train)

Have a look at the t-SNE results. Save the plot using `plt.savefig()` to a folder `labs/lab4` and name it `MNIST_tSNE.png`.

In [None]:
# Plot the data
colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
for i in range(len(colors)):
    x = tsne_results[:, 0][y_train == i]
    y = tsne_results[:, 1][y_train == i]
    plt.scatter(x, y, c=colors[i])
plt.legend(list(map(str, range(10))))
plt.show()