# **Exploring PCA and Random Forest: Fashion MNIST Classification**

This project demonstrates the application of Principal Component Analysis (PCA) and Random Forest Classification on the Fashion MNIST dataset. The goal is to explore the effect of dimensionality reduction on model performance in terms of accuracy and computational time.


### **Importing Required Libraries**

This cell imports the necessary libraries for the project, including TensorFlow's Fashion MNIST dataset, visualization tools like Matplotlib, and machine learning utilities from Scikit-learn. These libraries will be used for data preprocessing, dimensionality reduction, and model evaluation.


In [1]:
from tensorflow.keras.datasets import fashion_mnist
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time

### **Loading the Dataset**

This cell loads the Fashion MNIST dataset, splitting it into training and testing sets. The shapes of the data arrays are printed to provide an overview of the dataset dimensions:

- `x_train` and `x_test` contain the image data.
- `y_train` and `y_test` contain the corresponding labels.


In [None]:
# Load the dataset
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

# Print the shape of the data
print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

### **Visualizing the Dataset**

This cell defines and uses a function `plot_sample_images` to display a grid of 5x5 images from the training dataset along with their corresponding labels. It provides a visual understanding of the data:

- Each image is shown in grayscale with its label displayed as the title.
- The function creates a subplot grid for better organization and layout.


In [None]:
# Display a grid of 5x5 images
def plot_sample_images(x, y, n=5):
    fig, axes = plt.subplots(n, n, figsize=(10, 10))
    axes = axes.ravel()  # Flatten the grid of axes
    for i in range(n * n):
        axes[i].imshow(x[i], cmap='gray')
        axes[i].set_title(f"Label: {y[i]}")
        axes[i].axis('off')
    plt.tight_layout()
    plt.show()

# Show sample images from the training set
plot_sample_images(x_train, y_train)


### **Flattening the Images**

In this step, the images from the training and test sets are flattened from 28x28 pixel arrays into 1-dimensional vectors of 784 pixels. Flattening is necessary for feeding the data into machine learning models such as Random Forest.

- `x_train_flattened` and `x_test_flattened` store the reshaped arrays.
- The shapes of the reshaped arrays are printed to confirm the successful transformation.


In [None]:
# Flatten the images
x_train_flattened = x_train.reshape(x_train.shape[0], -1)
x_test_flattened = x_test.reshape(x_test.shape[0], -1)

# Check the shapes
print(f"x_train_flattened shape: {x_train_flattened.shape}")
print(f"x_test_flattened shape: {x_test_flattened.shape}")

### **Applying PCA for Dimensionality Reduction**

In this step, Principal Component Analysis (PCA) is used to reduce the dimensionality of the flattened images, keeping most of the variance in the data.

1. First, the PCA model is fit on the training data (`x_train_flattened`).
2. The cumulative explained variance ratio is plotted to show how much variance is captured by the principal components.
3. Based on the plot, the number of components is selected to retain 95% of the variance (`n_components=0.95`).
4. The training and test data are transformed into the reduced-dimensionality space.
5. Finally, the new shapes of the transformed training and test data are printed, reflecting the reduced number of features after PCA.

The choice of 95% variance retention ensures a balance between reducing dimensionality and preserving important information in the dataset.


In [None]:
# Initialize PCA and fit it on the training data
pca = PCA()
pca.fit(x_train_flattened)

# Plot the explained variance ratio
plt.figure(figsize=(8, 6))
plt.plot(range(1, 785), np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title('Explained Variance by Number of Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()

# You can choose the number of components based on the explained variance.
# For example, let's choose the number of components to retain 95% of the variance.
pca = PCA(n_components=0.95)
x_train_pca = pca.fit_transform(x_train_flattened)
x_test_pca = pca.transform(x_test_flattened)

# Print the new shape after applying PCA
print(f"x_train_pca shape: {x_train_pca.shape}")
print(f"x_test_pca shape: {x_test_pca.shape}")


### **Evaluating the Effect of PCA on Model Performance and Training Time**

In this step, we evaluate the impact of different numbers of principal components on both the accuracy and the training time of the model.

1. A list of potential component sizes (`components_list`) is defined to test.
2. For each number of components in the list, the following steps are executed:
   - PCA is applied to reduce the dimensionality of the data.
   - A Random Forest classifier is trained using the transformed training data (`x_train_pca`).
   - The training time is measured using the `time` library to capture how long the model takes to fit and make predictions.
   - The accuracy of the model is calculated using `accuracy_score`.
3. The accuracy and time taken for each number of components are recorded in `accuracy_list` and `time_list`.
4. Finally, two plots are generated:
   - **Accuracy vs. Number of Principal Components**: Shows how accuracy changes with the number of components.
   - **Time Taken vs. Number of Principal Components**: Shows how the model's training time changes as more components are included.

This comparison helps in selecting an optimal number of components based on a trade-off between model performance and computational time.


In [None]:
# Number of components to test
components_list = [50, 100, 150, 200, 300, 400, 500]

# Store the results
accuracy_list = []
time_list = []

for n_components in components_list:
    # Apply PCA with n_components
    pca = PCA(n_components=n_components)
    x_train_pca = pca.fit_transform(x_train_flattened)
    x_test_pca = pca.transform(x_test_flattened)
    
    # Train Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Time the training and prediction
    start_time = time.time()
    model.fit(x_train_pca, y_train)
    end_time = time.time()
    
    # Get the accuracy
    y_pred = model.predict(x_test_pca)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Append results
    accuracy_list.append(accuracy)
    time_list.append(end_time - start_time)

# Plot the accuracy vs. number of components
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(components_list, accuracy_list, marker='o')
plt.title('Accuracy vs. Number of Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Accuracy')

# Plot the time taken vs. number of components
plt.subplot(1, 2, 2)
plt.plot(components_list, time_list, marker='o', color='r')
plt.title('Time Taken vs. Number of Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Time Taken (seconds)')

plt.tight_layout()
plt.show()


### **Model Training and Evaluation with PCA**

In this step, we train the Random Forest classifier on the PCA-transformed training data and evaluate its performance:

1. **Timer Start**: The training time is measured using `time.time()` to calculate the time taken for both training and prediction.
2. **Model Initialization**: A Random Forest classifier (`rf`) is initialized with 50 estimators and a fixed random state for reproducibility.
3. **Model Training**: The model is trained using the PCA-transformed training data (`x_train_pca`).
4. **Prediction**: The model makes predictions on the PCA-transformed test data (`x_test_pca`).
5. **Accuracy Calculation**: The accuracy of the model is computed by comparing the predicted labels (`y_pred_pca`) with the true labels (`y_test`).
6. **Timer End**: The end time is recorded, and the total time for training and prediction is calculated by subtracting the start time from the end time.
7. **Results**: Finally, the accuracy and time taken for training and prediction are printed.

This step allows for the evaluation of the model's performance and the computational time required after applying PCA for dimensionality reduction.


In [None]:
# Start the timer
start_time = time.time()

# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_estimators=50, random_state=42)

# Train the model on the PCA-transformed training data
rf.fit(x_train_pca, y_train)

# Predict on the PCA-transformed test data
y_pred_pca = rf.predict(x_test_pca)

# Calculate accuracy
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# End the timer
end_time = time.time()

# Calculate the time taken for training and prediction
time_taken = end_time - start_time

# Print the results
print(f"Accuracy with PCA: {accuracy_pca:.4f}")
print(f"Time taken with PCA: {time_taken:.4f} seconds")


### **Model Training and Evaluation without PCA**

In this step, we train the Random Forest classifier on the original data without applying PCA and evaluate its performance:

1. **Timer Start**: The training time is measured using `time.time()` to calculate the time taken for both training and prediction without PCA.
2. **Model Initialization**: A Random Forest classifier (`rf_no_pca`) is initialized with 100 estimators and a fixed random state for reproducibility.
3. **Model Training**: The model is trained using the flattened training data (`x_train_flattened`), which has not undergone PCA transformation.
4. **Prediction**: The model makes predictions on the original test data (`x_test_flattened`), which is also not PCA-transformed.
5. **Accuracy Calculation**: The accuracy of the model is computed by comparing the predicted labels (`y_pred_no_pca`) with the true labels (`y_test`).
6. **Timer End**: The end time is recorded, and the total time for training and prediction is calculated by subtracting the start time from the end time.
7. **Results**: Finally, the accuracy and time taken for training and prediction without PCA are printed.

This step provides insights into how the model performs without dimensionality reduction and helps compare the impact of PCA on training time and accuracy.


In [None]:
# Start the timer for training and prediction without PCA
start_time_no_pca = time.time()

# Train the model on the original data (no PCA)
rf_no_pca = RandomForestClassifier(n_estimators=100, random_state=42)
rf_no_pca.fit(x_train_flattened, y_train)

# Predict on the test data (no PCA)
y_pred_no_pca = rf_no_pca.predict(x_test_flattened)

# Calculate accuracy
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)

# End the timer
end_time_no_pca = time.time()

# Calculate the time taken for training and prediction (no PCA)
time_taken_no_pca = end_time_no_pca - start_time_no_pca

# Print the results
print(f"Accuracy without PCA: {accuracy_no_pca:.4f}")
print(f"Time taken without PCA: {time_taken_no_pca:.4f} seconds")


### **Model Comparison: With and Without PCA**

This section compares the performance of the Random Forest model when trained with and without PCA applied to the data.

1. **Results Dictionary**: A dictionary is created to store the accuracy and time taken for both models (with and without PCA). These results were previously calculated.
2. **DataFrame Creation**: The results dictionary is converted into a pandas DataFrame for easier presentation and comparison.
3. **Display Results**: The comparison DataFrame is printed to the console for a tabular view of the results.
4. **Visualization**: A bar plot is generated to visually compare the models' performance:
   - **Accuracy Comparison**: A bar plot shows the accuracy for both models, allowing us to visually compare the performance.
   - **Time Comparison**: Another bar plot shows the time taken for training and prediction for both models, highlighting the impact of PCA on time efficiency.

This step gives a clear overview of how PCA affects the model’s accuracy and training time.


In [None]:
# Comparison of models
results = {
    "Model": ["With PCA", "Without PCA"],
    "Accuracy": [0.8529, 0.8760],
    "Time Taken (seconds)": [109.6082, 146.5678]
}

# Convert to a DataFrame for better presentation
import pandas as pd
comparison_df = pd.DataFrame(results)

# Display the results
print("Comparison of Models:")
print(comparison_df)

# Visualizing the comparison
import matplotlib.pyplot as plt

# Plot accuracy comparison
plt.figure(figsize=(10, 5))

# Accuracy Bar Plot
plt.subplot(1, 2, 1)
plt.bar(results["Model"], results["Accuracy"], color=["blue", "green"])
plt.title("Accuracy Comparison")
plt.ylabel("Accuracy")
plt.ylim(0.8, 0.9)

# Time Taken Bar Plot
plt.subplot(1, 2, 2)
plt.bar(results["Model"], results["Time Taken (seconds)"], color=["blue", "green"])
plt.title("Time Comparison")
plt.ylabel("Time Taken (seconds)")
plt.ylim(100, 160)

plt.tight_layout()
plt.show()
