# Assignment 8: Dimensionality Reduction for Supervised Learning

Previous version is from MLEARN 510 course materials in 2021, ML510-Assignment8-Solution.ipynb. <br>
Modified and Extended by Ernst Henle.<br>
Copyright © 2024 by Ernst Henle

# Learning Objectives
- Be able to make application decisions regarding principal component analysis to train and test data 
- Produce a dimensionality reduction model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

## MNIST Data
We will use the MNIST ("Modified National Institute of Standards and Technology") dataset to demonstrate dimensionality reduction for supervised learning.
<br>
The [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) is a standard dataset of 70000 images of hand-written digits.  Each image is 28-by-28 ($28 X 28 = 784$) pixels and contains one hand-written digit.  Each image occupies one row of the csv file or numpy array.  The first 60000 rows are training images.  The last 10000 rows are test images.  
The dataset can be downloaded from many websites including Canvas as mnist_784.csv.  The most convenient source of the dataset is through `fetch_openml` in `sklearn.datasets`.

In [None]:
#Load the MNIST dataset
import time
t0 = time.time()
mnist = fetch_openml('mnist_784', parser='pandas')
print("Data loading took {:.2f}s".format(time.time() - t0))
X_byte = mnist['data'].to_numpy()
y = mnist['target'].to_numpy()
mnist = None

In [None]:
# import time
# import pandas as pd
# mnist_start_time = time.time()
# mnist = pd.read_csv('../data/mnist_784.csv') # 14 sec
# print("MNIST read elapsed time: ", time.time() - mnist_start_time)
# X_byte = mnist.drop(columns=['class'], inplace=False).to_numpy()
# y = mnist['class'].to_numpy()
# mnist = None

In [None]:
# Basic EDA

# Show the shapes of the training and test data sets
print('Shape of input features is:', X_byte.shape, ' Shape of target(digits) is:', y.shape)
print('Range of input features is from', X_byte.min(),'to', X_byte.max())

# Show the distribution of digits
print('###########\n Distribution of digits:')
labels, counts = np.unique(y, return_counts=True)
display(pd.DataFrame([counts], columns=labels, index=['counts']))

print('###########\n Sample of input features:')
display(X_byte[0:5,400:410])
print('###########\n')
# Plot one of the images
import matplotlib.pyplot as plt
plt.gray()
rand_i = np.random.randint(low=0, high=70000)
plt.matshow(X_byte[rand_i,:].reshape((28,28)).astype(float));    
plt.title(f'Digit: {y[rand_i]}')
plt.show();

### Feature Scaling
[Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) for image data is different than in many machine learning applications.  Often, the best scaling is simply by dividing by the max value of all features.  The following are explanations with examples: 
<br><br>
#### Argument for normalizing each feature individually
In many machine learning models the numeric input features are **not** of the same kind.  Given the description of a car, one may use the features weight, height, and age.  Although these three features are correlated, they mean different things and are on different scales.  For instance, it is meaningless to say that the age is bigger than the weight.  Also, when we combine these features, like we do in PCA, we will get a new kind of feature that is neither weight, height, nor age.  The purpose of normalization is to bring these very different features onto a similar scale prior to PCA.  Such normalization must be done individually, where the normalization factors are determined separtely for each feature.
<br><br>
#### Argument against normalizing each feature individually
In some machine learning models the numeric input features **are** of the same kind.  The three input features for a box might be height, width and length.  All three input features are spatial dimensions and are on the same scale.  For instance, if we rotate the box we might switch the values of height and width.  When we combine spatial dimensions, as we do in PCA, then the result is still a spatial dimension.  If the features are already on the same units, then individual normalization may be counter productive.  If we do normalize, then all related features should be normalized with the same normalization parameters to preserve the relative differences between features.
<br><br>
#### General conclusion
The conclusion is that in contrast to what we previously discussed about normalization, sometimes we should preserve the different ranges between features.
<br><br>
#### Feature scaling for our current dataset
In our current dataset, all the image features are pixel values in the range from 0 to 255.  We can directly compare one pixel value to another and a combination of pixel values will result in a composite pixel value.  In this situation, it is best to preserve the different ranges between features.  We can either not normalize at all or we can simply divide all features by the maximum pixel value in the whole dataset.  Thus all features are on the same 0 to 1 scale but any given feature may have a minimum higher than 0 or a maximum lower than 1.

In [None]:
# Scale the input features
X_max = X_byte.max()
X = X_byte/X_max

# Remove X_byte so that it is not accidentally used
X_byte = None

# Present a sample of the scaled input features
display(X[0:5,400:410])
print('Range of scaled input features is from', X.min(),'to', X.max())

## Question 1
Split the data into a training set and a test set
- the first 60,000 rows (images) are for training
- the last 10,000 rows (images) are for testing).
- show the shapes of the training and test data sets
- show the distribution of digits in the test set.
   - Has the distribution changed from the original dataset
   - What are the consequences for testing on an uneven distribution?

In [None]:
# The first 60000 rows are for training


# The last 10000 rows are for testing


# Show the shapes of the training and test data sets


# Show the distribution of digits in the test set



## Question 2
Train a Logistic Regression classifier on the dataset.
- The argument list must indicate that you want to do a multinomial logistic regression.
- Set  `max_iter` to 1000 (Before you set `max_iter` to 1000, you may want to test your code with `max_iter` set to 100 for faster debugging)  
- Time the training using the `time` or `timeit` module and present the training time in seconds

There is no need to predict on the training data

In [None]:
# Create multinomial logistic regression classifier


# Present the time it took for training (just ".fit")



## Question 3
Evaluate the resulting model on the test set.  Determine the accuracy.  For these purposes Accuracy is defined as <br><center>***correct predictions / all_predictions***</center><br>  You can use the `.score` method from logistic regression or the `metrics.accuracy_score` from sklearn or some other method that calculates accuracy.

In [None]:
# Get accuracy of model



## Question 4
Use PCA to analyze the data.  
- Train PCA on training data
- Present the explained variance (`.explained_variance_`) for each principal component in a scree plot
- Determine the minimum number of components to get 95% of the explained variance.
- Use the explained variance (`.explained_variance_`) to create a cumulative variance plot
- Create a lower dimensional dataset that has 95% of the explained variance and present the shape of the new dataset.

In [None]:
# Train PCA on training data



In [None]:
# Show Scree Plot of explained variance



In [None]:
# Determine Cumulative Explained Variance


# Determine number of principal components necessary for 95% of explained variance
# Find Number of Principal Components Neccessary to Achieve Minimum Explained Variance


# Plot Cumulative Explained Variance vs Principal Components



In [None]:
# Create reduced dataset that contains only the number of principal components necessary for 95% of explained variance


# Present shape of reduced dataset



## Question 5
Train a new Logistic Regression classifier on the reduced training dataset.  Use the same parameters (arguments) as before. 
- As before, time the training
- Was training much faster? Explain your results

In [None]:
# Create a multinomial logistic regression classifier on the reduced dataset


# Present the time it took for training (just ".fit")



## Question 6
1. Evaluate the new classifier 
  - Transform the test data using the PCA model that was trained on the training data
  - Remove the excess columns of the pca-transformed test data. 
  - Determine the accuracy of the PCR (logistic regression) on the test data with the same accuracy method as before.
2. Discuss how the accuracy compares to the previous classifier.  Discuss the speed vs. accuracy trade-off and in which case you'd prefer a very slight drop in model performance for a x-time speedup in training.

In [None]:
# transform input features of test set according to training set PCA


# Remove the excess columns of the pca-transformed test data


# Use score method to get accuracy of model



## Question 7
Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: 
- What was your incoming experience with this model, if any?
- what steps you took
- what obstacles you encountered
- how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) 

<br> <br>
This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.