# Assignment 8: Dimensionality Reduction for Supervised Learning

Previous version is from MLEARN 510 course materials in 2021, ML510-Assignment8-Solution.ipynb. <br>
Modified and Extended by Ernst Henle.<br>
Copyright © 2024 by Ernst Henle

# Learning Objectives
- Be able to make application decisions regarding principal component analysis to train and test data 
- Produce a dimensionality reduction model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

## MNIST Data
We will use the MNIST ("Modified National Institute of Standards and Technology") dataset to demonstrate dimensionality reduction for supervised learning.
<br>
The [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) is a standard dataset of 70000 images of hand-written digits.  Each image is 28-by-28 ($28 X 28 = 784$) pixels and contains one hand-written digit.  Each image occupies one row of the csv file or numpy array.  The first 60000 rows are training images.  The last 10000 rows are test images.  
The dataset can be downloaded from many websites including Canvas as mnist_784.csv.  The most convenient source of the dataset is through `fetch_openml` in `sklearn.datasets`.

In [None]:
#Load the MNIST dataset
import time
t0 = time.time()
mnist = fetch_openml('mnist_784', parser='pandas')
print("Data loading took {:.2f}s".format(time.time() - t0))
X_byte = mnist['data'].to_numpy()
y = mnist['target'].to_numpy()
mnist = None

In [None]:
# import time
# import pandas as pd
# mnist_start_time = time.time()
# mnist = pd.read_csv('../data/mnist_784.csv') # 14 sec
# print("MNIST read elapsed time: ", time.time() - mnist_start_time)
# X_byte = mnist.drop(columns=['class'], inplace=False).to_numpy()
# y = mnist['class'].to_numpy()
# mnist = None

In [None]:
# Basic EDA

# Show the shapes of the training and test data sets
print('Shape of input features is:', X_byte.shape, ' Shape of target(digits) is:', y.shape)
print('Range of input features is from', X_byte.min(),'to', X_byte.max())

# Show the distribution of digits
print('###########\n Distribution of digits:')
labels, counts = np.unique(y, return_counts=True)
display(pd.DataFrame([counts], columns=labels, index=['counts']))

print('###########\n Sample of input features:')
display(X_byte[0:5,400:410])
print('###########\n')
# Plot one of the images
import matplotlib.pyplot as plt
plt.gray()
rand_i = np.random.randint(low=0, high=70000)
plt.matshow(X_byte[rand_i,:].reshape((28,28)).astype(float));    
plt.title(f'Digit: {y[rand_i]}')
plt.show();

### Feature Scaling
[Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) for image data is different than in many machine learning applications.  Often, the best scaling is simply by dividing by the max value of all features.  The following are explanations with examples: 
<br><br>
#### Argument for normalizing each feature individually
In many machine learning models the numeric input features are **not** of the same kind.  Given the description of a car, one may use the features weight, height, and age.  Although these three features are correlated, they mean different things and are on different scales.  For instance, it is meaningless to say that the age is bigger than the weight.  Also, when we combine these features, like we do in PCA, we will get a new kind of feature that is neither weight, height, nor age.  The purpose of normalization is to bring these very different features onto a similar scale prior to PCA.  Such normalization must be done individually, where the normalization factors are determined separtely for each feature.
<br><br>
#### Argument against normalizing each feature individually
In some machine learning models the numeric input features **are** of the same kind.  The three input features for a box might be height, width and length.  All three input features are spatial dimensions and are on the same scale.  For instance, if we rotate the box we might switch the values of height and width.  When we combine spatial dimensions, as we do in PCA, then the result is still a spatial dimension.  If the features are already on the same units, then individual normalization may be counter productive.  If we do normalize, then all related features should be normalized with the same normalization parameters to preserve the relative differences between features.
<br><br>
#### General conclusion
The conclusion is that in contrast to what we previously discussed about normalization, sometimes we should preserve the different ranges between features.
<br><br>
#### Feature scaling for our current dataset
In our current dataset, all the image features are pixel values in the range from 0 to 255.  We can directly compare one pixel value to another and a combination of pixel values will result in a composite pixel value.  In this situation, it is best to preserve the different ranges between features.  We can either not normalize at all or we can simply divide all features by the maximum pixel value in the whole dataset.  Thus all features are on the same 0 to 1 scale but any given feature may have a minimum higher than 0 or a maximum lower than 1.

In [None]:
# Scale the input features
X_max = X_byte.max()
X = X_byte/X_max

# Remove X_byte so that it is not accidentally used
X_byte = None

# Present a sample of the scaled input features
display(X[0:5,400:410])
print('Range of scaled input features is from', X.min(),'to', X.max())

## Question 1
Split the data into a training set and a test set
- the first 60,000 rows (images) are for training
- the last 10,000 rows (images) are for testing).
- show the shapes of the training and test data sets
- show the distribution of digits in the test set.
   - Has the distribution changed from the original dataset
   - What are the consequences for testing on an uneven distribution?

In [None]:
# The first 60000 rows are for training
X_train = X[:60000]
y_train = y[:60000]

# The last 10000 rows are for testing
X_test = X[60000:]
y_test = y[60000:]

# Show the shapes of the training and test data sets
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)
print("\n") # Add a newline for better readability

# Show the distribution of digits in the test set
print('###########\n Distribution of digits in the test set:')
test_labels, test_counts = np.unique(y_test, return_counts=True)
display(pd.DataFrame([test_counts], columns=test_labels, index=['counts']))

### Discussion for Question 1

The distribution of digits in the test set is relatively similar to the distribution in the overall dataset (shown in the EDA section). We can see that each digit from 0 to 9 has roughly 800 to 1200 instances in the test set of 10,000 images.

**Has the distribution changed from the original dataset?**
While there are slight variations in counts for each digit compared to the full dataset's distribution, the overall representation seems fairly consistent. No digit appears to be drastically over or underrepresented in the test set compared to its proportion in the combined training and test sets.

**What are the consequences for testing on an uneven distribution?**
If the test set had a significantly different distribution from the training set, it could lead to misleading performance metrics. For example:
*   If a particular digit was much more frequent in the test set than in the training set, the model's performance on that digit would disproportionately affect the overall accuracy.
*   Conversely, if a digit was rare in the test set, the model's ability (or inability) to correctly classify that digit would have a smaller impact on the overall accuracy, potentially masking issues with classifying less frequent classes.
*   An uneven distribution can also make it harder to compare models if they are sensitive to class imbalances. Metrics like precision, recall, and F1-score per class become more important than overall accuracy in such scenarios.

In this specific case, the test set distribution appears reasonably balanced and reflective of the overall dataset, which is good for a general evaluation of the model.

## Question 2
Train a Logistic Regression classifier on the dataset.
- The argument list must indicate that you want to do a multinomial logistic regression.
- Set  `max_iter` to 1000 (Before you set `max_iter` to 1000, you may want to test your code with `max_iter` set to 100 for faster debugging)  
- Time the training using the `time` or `timeit` module and present the training time in seconds

There is no need to predict on the training data

In [None]:
from sklearn.linear_model import LogisticRegression
import time

# Create multinomial logistic regression classifier
# Using solver='lbfgs' as it's a common default for multinomial regression and handles L2 penalty.
log_reg_full = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42) # Added random_state for reproducibility

# Time the training
print("Starting training of Logistic Regression on the full dataset...")
start_time_full = time.time()
log_reg_full.fit(X_train, y_train)
end_time_full = time.time()

training_time_full = end_time_full - start_time_full

# Present the time it took for training (just ".fit")
print(f"Training Logistic Regression on the full dataset took {training_time_full:.2f} seconds.")

## Question 3
Evaluate the resulting model on the test set.  Determine the accuracy.  For these purposes Accuracy is defined as <br><center>***correct predictions / all_predictions***</center><br>  You can use the `.score` method from logistic regression or the `metrics.accuracy_score` from sklearn or some other method that calculates accuracy.

In [None]:
# Get accuracy of model on the test set
accuracy_full = log_reg_full.score(X_test, y_test)
print(f"Accuracy of Logistic Regression on the full test dataset: {accuracy_full:.4f}")

## Question 4
Use PCA to analyze the data.  
- Train PCA on training data
- Present the explained variance (`.explained_variance_`) for each principal component in a scree plot
- Determine the minimum number of components to get 95% of the explained variance.
- Use the explained variance (`.explained_variance_`) to create a cumulative variance plot
- Create a lower dimensional dataset that has 95% of the explained variance and present the shape of the new dataset.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt # Ensure matplotlib.pyplot is imported

# Train PCA on training data without initially specifying n_components
# This is to find out how many components are needed.
print("Starting PCA fitting to determine explained variance by all components...")
pca_full_analysis = PCA(random_state=42) # Added random_state for reproducibility
pca_full_analysis.fit(X_train)
print("PCA fitting complete.")

In [None]:
# Show Scree Plot of explained variance
plt.figure(figsize=(10, 6))
plt.plot(pca_full_analysis.explained_variance_, marker='o', linestyle='--')
plt.title('Scree Plot of Explained Variance for PCA Components')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance (Eigenvalue)')
plt.grid(True)
plt.show()

In [None]:
# Determine Cumulative Explained Variance
cumulative_variance = np.cumsum(pca_full_analysis.explained_variance_ratio_)

# Determine number of principal components necessary for 95% of explained variance
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1 
# We add 1 because argmax returns a 0-based index

print(f"Number of principal components necessary for 95% explained variance: {n_components_95}")
print(f"Total explained variance with {n_components_95} components: {cumulative_variance[n_components_95-1]:.4f}")


# Plot Cumulative Explained Variance vs Principal Components
plt.figure(figsize=(10, 6))
plt.plot(cumulative_variance, marker='.')
plt.title('Cumulative Explained Variance vs. Number of Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% Explained Variance')
plt.axvline(x=n_components_95 -1, color='g', linestyle='--', label=f'{n_components_95} Components for 95% Variance') # -1 because plotting index
plt.legend(loc='best')
plt.grid(True)
plt.show()

In [None]:
# Create reduced dataset that contains only the number of principal components necessary for 95% of explained variance
print(f"Initializing PCA with {n_components_95} components...")
pca_95 = PCA(n_components=n_components_95, random_state=42) # Added random_state

print("Fitting PCA and transforming X_train...")
X_train_pca = pca_95.fit_transform(X_train)
print("Transformation complete.")

# Present shape of reduced dataset
print(f"Shape of original X_train: {X_train.shape}")
print(f"Shape of reduced X_train_pca (with {n_components_95} components): {X_train_pca.shape}")

## Question 5
Train a new Logistic Regression classifier on the reduced training dataset.  Use the same parameters (arguments) as before. 
- As before, time the training
- Was training much faster? Explain your results

In [None]:
# Create a multinomial logistic regression classifier on the reduced dataset
log_reg_pca = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)

# Time the training
print(f"Starting training of Logistic Regression on the PCA-reduced dataset (X_train_pca with {X_train_pca.shape[1]} features)...")
start_time_pca = time.time()
log_reg_pca.fit(X_train_pca, y_train)
end_time_pca = time.time()

training_time_pca = end_time_pca - start_time_pca

# Present the time it took for training (just ".fit")
print(f"Training Logistic Regression on the PCA-reduced dataset took {training_time_pca:.2f} seconds.")

### Discussion for Question 5

**Was training much faster? Explain your results.**

Yes, training the Logistic Regression model on the PCA-reduced dataset was significantly faster than training on the original dataset. 

*   Training time on original data (784 features): [Refer to output from Q2, e.g., XX.YY seconds]
*   Training time on PCA-reduced data (~154 features for 95% variance): [Refer to output from Q5, e.g., AA.BB seconds]

The primary reason for this speedup is the reduction in the number of features (dimensionality). The original dataset had 784 features (pixels) per image. After applying PCA and retaining components that explain 95% of the variance, the number of features was reduced to a much smaller number (e.g., around 154, but this will depend on the exact result from Q4).

Logistic Regression, like many machine learning algorithms, has a computational complexity that depends on the number of samples and the number of features. By reducing the number of features, each iteration of the optimization algorithm within the logistic regression training process becomes computationally less expensive. This leads to a faster overall training time.

## Question 6
1. Evaluate the new classifier 
  - Transform the test data using the PCA model that was trained on the training data
  - Remove the excess columns of the pca-transformed test data. 
  - Determine the accuracy of the PCR (logistic regression) on the test data with the same accuracy method as before.
2. Discuss how the accuracy compares to the previous classifier.  Discuss the speed vs. accuracy trade-off and in which case you'd prefer a very slight drop in model performance for a x-time speedup in training.

In [None]:
# Transform input features of test set according to training set PCA (pca_95)
print(f"Transforming X_test using pca_95 ({pca_95.n_components_} components)...")
X_test_pca = pca_95.transform(X_test)
print("Transformation complete.")
print(f"Shape of X_test_pca: {X_test_pca.shape}")

# The step "Remove the excess columns of the pca-transformed test data" is already handled 
# because pca_95 was defined with n_components_95, so its transform method will only return that many components.

# Use score method to get accuracy of model on the transformed test set
accuracy_pca = log_reg_pca.score(X_test_pca, y_test)
print(f"Accuracy of Logistic Regression on the PCA-reduced test dataset: {accuracy_pca:.4f}")

### Discussion for Question 6

**1. Accuracy Comparison:**

*   Accuracy of Logistic Regression on full test data: [Refer to output from Q3, e.g., 0.YYYY]
*   Accuracy of Logistic Regression on PCA-reduced test data (95% variance): [Refer to output from Q6, e.g., 0.ZZZZ]

Typically, there is a slight drop in accuracy when using PCA-reduced data compared to the full dataset. This is because PCA, while preserving most of the variance, does discard some information (the variance associated with the dropped components). If this discarded information was relevant for classification, the model's performance might decrease. However, if the discarded variance was mostly noise or irrelevant to the digit classification task, the accuracy might remain very similar or, in some rare cases, even slightly improve (if PCA helps in regularizing the model or removing noise).

For the MNIST dataset, it's common to see a minor reduction in accuracy (e.g., from ~0.92 to ~0.91 or ~0.90) when using PCA to retain 95% of the variance. The exact numbers will depend on the run.

**2. Speed vs. Accuracy Trade-off:**

The trade-off is between a potentially faster training time (and prediction time, though not explicitly measured here for prediction) and a potential decrease in accuracy.

*   **Speed Advantage:** As seen in Question 5, training on the reduced dataset is significantly faster. This is crucial for very large datasets or when models need to be retrained frequently. A reduction from, for example, 784 features to ~154 features makes a substantial difference.
*   **Accuracy Cost:** The drop in accuracy needs to be evaluated in the context of the application. If a 1-2% drop in accuracy is acceptable for a significant speedup (e.g., 5-10x faster training), then PCA is a valuable tool.

**In which case you'd prefer a very slight drop in model performance for a x-time speedup in training:**

One might prefer a slight drop in performance for a significant speedup in several scenarios:
*   **Rapid Prototyping and Iteration:** When initially developing and testing many different models or hyperparameter settings, faster training allows for more experiments in a given timeframe.
*   **Large-Scale Datasets:** For datasets with millions of samples or thousands/millions of features, training on the full data might be computationally prohibitive or extremely time-consuming. Dimensionality reduction can make the problem tractable.
*   **Online Learning or Frequent Retraining:** If the model needs to be updated very frequently with new data, faster training cycles are essential.
*   **Resource-Constrained Environments:** If deploying models on devices with limited computational power or memory, a smaller model (due to fewer features) that is faster to run can be a requirement, even if it means a small sacrifice in accuracy.
*   **When "Good Enough" is Sufficient:** If the baseline accuracy is already very high and a slight drop still meets the business or application requirements, the computational savings might be prioritized.

The decision always depends on the specific problem, the acceptable performance threshold, and the available computational resources.

## My Summary for Assignment 8

*(Please replace the example text below with your own 50-100 word summary based on your experience with this assignment.)*

**Incoming Experience:** I had [some/limited/no] prior experience with PCA, primarily from [lectures/previous projects/self-study]. My understanding of its application for dimensionality reduction before supervised learning was [basic/theoretical/practical].

**Steps Taken:** I followed the notebook structure: loading and preprocessing the MNIST data, training a baseline Logistic Regression, then applying PCA to reduce dimensionality while retaining 95% variance. Subsequently, I trained another Logistic Regression on this reduced data and compared its performance and training speed to the baseline.

**Obstacles Encountered:** An initial challenge was [e.g., understanding the impact of feature scaling on PCA, or ensuring the PCA transformation was correctly applied to both training and test sets, or interpreting the cumulative variance plot to select the right number of components]. [Briefly mention how you overcame it, if applicable].

**Real-World Links & Further Learning:** This exercise directly mirrors real-world scenarios where high-dimensional data (like images or sensor readings) needs to be processed efficiently. PCA helps in reducing noise, training time, and computational cost. Missing steps for a full real-world problem might include more extensive hyperparameter tuning (e.g., for Logistic Regression or even for PCA components via cross-validation), exploring other dimensionality reduction techniques, and more rigorous error analysis per class. I'd like to learn more about [e.g., t-SNE for visualization, or non-linear dimensionality reduction methods].