
# Machine Learning with Python

Welcome to the **Machine Learning** course! This course is designed to give you hands-on experience with the foundational concepts and advanced techniques in machine learning. You will explore:

1. **Supervised Learning**
    - Regression algorithms
    - Classification algorithms
2. **Unsupervised Learning**
    - Clustering algorithms
    - Dimensionality reduction
3. **Fairness and Interpretability**
    - Interpretable methods
    - Bias evaluation
    
Throughout the course, you'll engage in projects to solidify your understanding and gain practical skills in implementing machine learning algorithms.  

Instructor: Dr. Adrien Dorise  
Contact: adrien.dorise@hotmail.com  

---


## Part2.2: Unsupervised learning - Dimensionality reduction on the MNIST dataset
In this project, you will use dimensionality reduction to get a better representation of the MNIST dataset, and analyse its impact on classification. The taks will include:  

1. **Import and Understand a Dataset**: Learn how to load, preprocess, and explore a dataset to prepare it for training.
2. **Perform dimensionality reduction on a dataset**: Learn to transform a high-dimensional dataset into a 2D dataset for better visualisation.
3. **Train a classification model**: Select and train a classification model using scikit-learn.
4. **Evaluate and plot the model performance**: Select a criterion to which you can evaluate the model, and plot its result.
5. **Compare multiple classification model, and get the best performance**: Compare multiple models, and find the best model to fit the data.

By the end of this project, you'll have a solid understanding of what dimensionality reduction can do, and how it can improves models performance.

---

## Dataset

This exercise will use the **MNIST dataset** (https://en.wikipedia.org/wiki/MNIST_database).  
MNIST is a benchmark dataset of 70,000 handwritten digits (0–9), each represented as a 28×28 grayscale image.  
It is widely used for training and testing image classification algorithms in machine learning and computer vision.  

<img src="doc/MNIST_example.png" alt="MNIST" width="1000"/>  

The code snippet below allows you to load the dataset.

In [None]:
from sklearn.datasets import fetch_openml


# Import MNIST
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist['data'], mnist['target'].astype(int)

## Data visualisation

The visualisation of the MNIST datasets is already given to you.  

The code snippet below is:
- Printing the first 10 samples of the dataset, with their class.
    - As the sample are 28x28 images, you can use `plt.imshow(X[0].reshape(28, 28), cmap='gray')`

**Your job**
- Print the number of samples in the dataset
- Print the number of samples per class


In [None]:
import matplotlib.pyplot as plt

# Plot the first 10 MNIST samples with their labels
plt.figure(figsize=(10, 2))
for i in range(10):
    plt.subplot(1, 10, i + 1)
    plt.imshow(X[i].reshape(28, 28), cmap='gray')
    plt.title(f"{y[i]}")
    plt.axis('off')
plt.suptitle("First 10 MNIST Digits with Labels")
plt.tight_layout()
plt.show()

In [None]:
import numpy as np



## Data preparation and classification

As for the Iris dataset in part 1.2, you wil have to train a classification model on the MNIST dataset.  
This time, I want you to print also the execution time of your model for both training and prediction.  

**Your job:**
- Split features and targets using the holdout method.
- Select a classifier.
- Train the classifier
    - **Record the training time**
    - You can use the *time* package
        - `import time`
        - `time = time.time()`
- Evaluate the model:
    - Make prediction on the test set
    - **Record the prediction time**
    - Print the confusion matrix
    - Print the accuracy of your model

In [None]:
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import time



## Dimensionality reduction


**Your job:**
- Select a dimensionality reduction algorithm
- Apply a standard scaling to the data
- Apply the algorithm on the dataset
    - Note that here were aim to improve the classification
    - *No need to force the dimension in 2D or 3D.*
- Print the new number of features.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn.manifold import TSNE



## Classification on reduced dataset

We now are going to evaluate the performance of your classifier on the reduced dataset.

**Your job:**
- Redo the steps to train the classifier, but with the reduced dataset
    - Train/test split
    - Classifier fitting while recording time
    - Classifier prediction while recording time
    - Compute confusion matrix and accuracy
- Conclude on the results 