# Notebook 1 - Introduction and Exploration of MNIST Data 
## Objectives :
- Learn about MNIST dataset
- Load and visualize the MNIST dataset using both PyTorch and TensorFlow/Keras
- Compare the data handling process in both methods
- Learn more about PyTorch and TensorFlow/Keras during this process

## Section 1 : Introduction
### About MNIST dataset
- The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset.

- It is a dataset of 70,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.

- The task is to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9, inclusively.

- I want to achieve this using different models, evaluate each of them and during the process, learn more about different Machine Learning and Deep Learning techniques.

## Section 2 : Import required libraries
### NumPy
- NumPy (Numerical Python) is a fundamental library for scientific computing in Python
- NumPy is essential for numerical computations in Python and serves as the foundation for many other scientific computing libraries

### Pandas
- Pandas is a powerful library for data analysis and manipulation 
- Pandas is widely used in data science workflows for preparing and analyzing datasets

### Matplotlib and Pyplot
- Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python
- Pyplot is a MATLAB-like interface built on top of Matplotlib, providing a simpler API for basic plotting tasks

### Scikit-learn (Sklearn)
- Scikit-learn is a machine learning library for Python 
- Key features include:
    - Provides a wide range of algorithms for classification, regression, clustering, and more
    - Offers tools for data preprocessing, feature selection, and model evaluation
    - Features cross-validation methods for assessing model performance
    - Designed to be easy to use, efficient, and robust
- Scikit-learn is a popular choice for implementing machine learning models in Python.

### PyTorch
- PyTorch is a deep learning framework for Python 
- Key characteristics:
    - Excels in tensor computation acceleration using GPU
    - Uses a dynamic computation graph, allowing for flexible experimentation
    - Well-suited for research-oriented projects and linguistic analysis
    - More flexible than TensorFlow but harder to choose between the two
- PyTorch is particularly useful for neural network development and is favored by researchers.

### TensorFlow
- TensorFlow is another popular deep learning framework 
- Key points:
    - Developed by Google
    - Known for its production-ready features
    - Better suited for large-scale deployment compared to PyTorch
    - Offers a static computation graph, which can be beneficial for certain applications
- TensorFlow is often preferred for production environments and larger-scale projects.

### Keras
- Keras is a high-level neural networks API 
- Key features:
    - Written in Python
    - Capable of running on top of either TensorFlow or Theano
    - Designed to be easy to use and accessible to beginners
    - Offers a simple interface for building deep learning models
- Keras provides a convenient way to implement neural networks without worrying about the underlying computational engine.

These libraries form the core of many data science and machine learning pipelines in Python, each serving a specific purpose in data analysis, visualization, and modeling.

In [None]:
# import libraries that are useful for us
import numpy as np
import pandas as pd
# import sklearn
import torch
import tensorflow as tf
from tensorflow import keras
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## Section 3 : Loading MNIST data and Exploration
### Using Sci-Kit Learn

In [None]:
# fetching the data
from sklearn.datasets import fetch_openml
mnist_skl = fetch_openml("mnist_784")

In [None]:
# Check contents of data
mnist_skl.keys()

In [None]:
# description of dataset
mnist_skl.DESCR

In [None]:
# Checking data and targets lables
mnist_skl.data

In [None]:
mnist_skl.target

In [None]:
print(f"data shape: {mnist_skl.data.shape}")
print(f"target shape: {mnist_skl.target.shape}")

- The `mnist_skl.data` contains all the data in a 1-D array. 
- We have to transform the data frame into a NumPy array and then reshape it into 28x28.

In [None]:
# Selecting some 6 random images
rndm_img_idx = [3245, 36000, 5, 16459, 59999, 45001]

In [None]:
# visualize the data 
image= mnist_skl.data.to_numpy()
plt.subplot(431)
plt.imshow((image[rndm_img_idx[0]].reshape(28,28)), cmap=plt.cm.gray_r, interpolation='nearest')
plt.subplot(432)
plt.imshow(image[rndm_img_idx[1]].reshape(28,28), cmap=plt.cm.gray_r, interpolation='nearest')
plt.subplot(433)
plt.imshow(image[rndm_img_idx[2]].reshape(28,28), cmap=plt.cm.gray_r, interpolation='nearest')
plt.subplot(434)
plt.imshow(image[rndm_img_idx[3]].reshape(28,28), cmap=plt.cm.gray_r, interpolation='nearest')
plt.subplot(435)
plt.imshow(image[rndm_img_idx[4]].reshape(28,28), cmap=plt.cm.gray_r, interpolation='nearest')
plt.subplot(436)
plt.imshow(image[rndm_img_idx[5]].reshape(28,28), cmap=plt.cm.gray_r, interpolation='nearest')

- From the images, the digits looks like [6,9,2,4,8,1]
-Let's verify it.


In [None]:
rndm_digits = [mnist_skl.target[index] for index in rndm_img_idx]
print(f"random digits: {rndm_digits}")

Looks like we are correct.

### Using PyTorch

In [None]:
# Import necessary libraries
from torchvision import datasets, transforms

In [None]:
# Define data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

In [None]:
# Load the dataset
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

In [None]:
train_dataset

In [None]:
test_dataset

In [None]:
# access individual samples using indexing
image, label = train_dataset[0]
print(image.shape, label)

In [None]:
# use similar test as sklearn
rndm_digits_pytrc = [train_dataset[index][1] for index in rndm_img_idx]
print(f"random digits: {rndm_digits_pytrc}")

In [None]:
# Function to plot image
def plot_image(image, label, ax=None):
    if ax is None:
        ax = plt.gca()
    ax.imshow(image.squeeze(), cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f'Label: {label}', fontsize=8)
    ax.axis('off')

In [None]:
# Choose a image from our random indexes
fig, axs = plt.subplots(2, 3, figsize=(3, 2))
for i, idx in enumerate(rndm_img_idx):
    image, label = train_dataset[idx]
    plot_image(image, label, axs[i//3, i%3])
plt.tight_layout()
plt.show()

### Using Tensorflow/Keras

In [None]:
# Import necessary libraries
from keras.datasets import mnist

In [None]:
# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

In [None]:
# use similar test as sklearn
rndm_digits_tfk = [y_train[index] for index in rndm_img_idx]
print(f"random digits: {rndm_digits_tfk}")

In [None]:
# Choose a image from our random indexes use same function as above subsection
fig, axs = plt.subplots(2, 3, figsize=(3, 2))
for i, idx in enumerate(rndm_img_idx):
    image, label = X_train[idx], y_train[idx]
    plot_image(image, label, axs[i//3, i%3])
plt.tight_layout()
plt.show()

## Conclusion
So in this brief investigation, we understood out following things:
- We learned to load MNIST data using Sci-Kit Learn, PyTorch as well as TensorFlow/Keras
- We understood that the MNIST dataset is effectively solved, but it can be a useful starting point for developing and practicing a methodology for solving image classification tasks
- We can see that there are 60,000 examples in the training dataset and 10,000 in the test dataset and that images are indeed square with 28×28 pixels (for PyTorch and TF/Keras dataset)
- Similarily, Sci-Kit Learn dataset has single set of 70,000 examples with image data in 1x784 pixels

In next Notebook, I will try to apply basic ML classifying models like Decision Tree, Random Forest etc. classifiers with Sci-Kit learn and evaluate their Performances