In [1]:
#Mount your drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Code
1.   Implement and apply kNN on MNIST with k=1, 5, 10. Apply kNN on raw images, and 2, 7 dimensional eigenspaces, respectively. Show the accuracy scores for each run (you'd run the algorithm 9 times).

2.   Use and run the Random Forest algorithm for MNIST classification (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Show the accuracy scores.

If it takes too much time, reduce the number of samples for training/testing (based on random selection).

In [None]:
#code here

import numpy as np
import os
import urllib.request
import gzip
import pickle

# from Homework1
def load_data(dataset):
  if not os.path.isfile(dataset):
    origin = (
      'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
    )
    print('Downloading data from %s' % origin)
    urllib.request.urlretrieve(origin, dataset)
  print('Loading data...')

  with gzip.open(dataset, 'rb') as f:
    try:
      train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
    except:
      train_set, valid_set, test_set = pickle.load(f)
  print('... data has been loaded!')

  return train_set, valid_set, test_set

def KNN_classifier(X_train, y_train, X_test, n_neighbors):
  y_pred = []
  for x in X_test:
    distances = [np.linalg.norm(x - x_train) for x_train in X_train]
    k_indices = np.argsort(distances)[:n_neighbors]
    k_nearest_labels = [y_train[i] for i in k_indices]
    # Count occurrences of each class label
    label_counts = np.bincount(k_nearest_labels)
    most_common_labels = np.where(label_counts == label_counts.max())[0]
    if len(most_common_labels) == 1:
      y_pred.append(most_common_labels[0])
    else:
      # If there's a tie, return the label with the smallest distance
      tie_distances = [distances[i] for i in k_indices if np.isin(y_train[i], most_common_labels)]
      if not tie_distances:
        min_distance_index = np.argmin(distances)
        y_pred.append(y_train[min_distance_index])
      else:
        y_pred.append(most_common_labels[np.argmin(tie_distances)])
  return np.array(y_pred)

def pca(X, n_components):
  mean = np.mean(X, axis=0)
  centered_data = X - mean
  cov_matrix = np.cov(centered_data, rowvar=False)
  eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
  sorted_indices = np.argsort(eigenvalues)[::-1]
  eigenvectors = eigenvectors[:, sorted_indices]
  selected_eigenvectors = eigenvectors[:, :n_components]
  reduced_data = centered_data.dot(selected_eigenvectors)
  return reduced_data

# Process data
def calculate_accuracy(y_true, y_pred):
  correct = sum(1 for true, pred in zip(y_true, y_pred) if true == pred)
  total = len(y_true)
  accuracy = correct / total
  return accuracy

# 1) Load the MNIST dataset as in Homework7
dataset_file = 'mnist.pkl.gz'
train_set, _, test_set = load_data(dataset_file)
X_train, y_train = train_set # data, label
X_test, y_test = test_set

# # Select a subset of the data to speed up the computation (optional)
subset_size = 1000
X_train = X_train[:subset_size]
y_train = y_train[:subset_size]
X_text = X_test[:subset_size]
y_text = y_test[:subset_size]

# kNN on Raw Images
k_values = [1, 5, 10]
for k_value in k_values:
  y_pred = KNN_classifier(X_train, y_train, X_test, k_value)
  accuracy = calculate_accuracy(y_test, y_pred)
  print(f"Accuracy(k={k_value}) : {accuracy}")

# kNN on  Reduced-Dementional Spaces
reduced_dimensions = [2, 7]
for reduced_dimension in reduced_dimensions:
  X_train_reduced = pca(X_train, reduced_dimension)
  X_test_reduced = pca(X_test, reduced_dimension)

  for k_value in k_values:
    y_pred = KNN_classifier(X_train_reduced, y_train, X_test_reduced, k_value)
    accuracy = calculate_accuracy(y_test, y_pred)
    print(f"Accuracy(k={k_value}, dim={reduced_dimension}) : {accuracy}")

# Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy(Random Forest) : {accuracy}")

Loading data...
... data has been loaded!


# ChatGPT prompts

Now solve the same task with ChatGPT with the following prompts:


1.   Can you implement and apply kNN on MNIST (from scratch without using any specialized library) with k=1, 5, 10? Can you show the accuracies of the predictions? Ignore the computational cost of this code.
2.   Can you use the same provided knn algorithm and apply it to the MNIST data with reduced dimensions to 2 and 7, with k=1, 5, 10? Please provide the accuracies.
3.   Can you provide a code for the random forest algorithm applied to the MNIST dataset? Please use specialized libraries like sklearn. Please show the accuracies scores.


**Paste the answer provided and run the code below:**

In [None]:
#paste code here
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load MNIST dataset
mnist = fetch_openml("mnist_784")
X = mnist.data
y = mnist.target

# Preprocess data
X = X / 255.0  # Normalize data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# kNN on Raw Images
k_values = [1, 5, 10]
for k in k_values:
    kNN_classifier = KNeighborsClassifier(n_neighbors=k)
    kNN_classifier.fit(X_train, y_train)
    y_pred = kNN_classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'kNN (k={k}) on Raw Images Accuracy: {accuracy}')

# kNN on Reduced-Dimensional Spaces
dimensions = [2, 7]
for dim in dimensions:
    pca = PCA(n_components=dim)
    X_train_reduced = pca.fit_transform(X_train)
    X_test_reduced = pca.transform(X_test)

    for k in k_values:
        kNN_classifier = KNeighborsClassifier(n_neighbors=k)
        kNN_classifier.fit(X_train_reduced, y_train)
        y_pred = kNN_classifier.predict(X_test_reduced)
        accuracy = accuracy_score(y_test, y_pred)
        print(f'kNN (k={k}) on {dim}-D Eigenspace Accuracy: {accuracy}')

# Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Random Forest Accuracy: {accuracy}')


# Report


1.   What are some differences you notice between your code and ChatGPT's? (Name at least 3)

1) Data Sourc
 In my code, the MNIST dataset is loaded from an external source using a URL (http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz).
But GPT suggest the MNIST dataset using scikit-learn's 'fetch_openml' function.

2) My code was executed by cutting only 1000 data into subsets to speed up calculation performance. On the other hand, GPT used the entire dataset even if it was a different source dataset. Maybe the difference in accuracy comes from here.

3) sklearn libirary
 My code includes a custom PCA (Principal Component Analysis) function and KNN_classifier function. However, GPT uses scikit-learn's 'PCA' module for dimensionality reduction, 'KNeighborsClassifier' module for KNN classifier and 'accuracy_score' module for metric calculation. But, yes, the GPT codes looks really simple and easy.

---

2.   Ask ChatGPT to compare kNN and Random Forest. Paste the answer here. Do you agree with ChatGPT's statement? Why yes or why not?

>> GPT Response:
Certainly! K-Nearest Neighbors (k-NN) and Random Forest are both machine learning algorithms, but they have different approaches and characteristics. Let's compare them in various aspects:
1. **Algorithm Type**:
   - **k-NN**: k-NN is a simple instance-based learning algorithm. It classifies data points based on their similarity to nearby neighbors.
   - **Random Forest**: Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.
2. **Supervised Learning**:
   - **k-NN**: k-NN is a supervised learning algorithm used for both classification and regression tasks.
   - **Random Forest**: Random Forest is primarily used for supervised classification and regression tasks.
3. **Model Complexity**:
   - **k-NN**: K-NN is a simple algorithm with no model training phase. It stores the entire training dataset and calculates distances during prediction.
   - **Random Forest**: Random Forest is more complex as it involves training multiple decision trees and combining their predictions.
4. **Training**:
   - **k-NN**: There is no explicit training in k-NN. The model stores the training data and uses it for prediction.
   - **Random Forest**: Random Forest trains an ensemble of decision trees using bootstrapped subsets of the data and random feature selection.
5. **Predictions**:
   - **k-NN**: k-NN predicts a data point's class based on the majority class among its k-nearest neighbors.
   - **Random Forest**: Random Forest combines the predictions of multiple decision trees and typically selects the majority class for classification tasks.
6. **Hyperparameters**:
   - **k-NN**: The main hyperparameter is 'k,' the number of neighbors to consider. The choice of 'k' impacts the model's bias-variance trade-off.
   - **Random Forest**: Random Forest has hyperparameters such as the number of trees, maximum depth of trees, and feature selection criteria.
7. **Performance**:
   - **k-NN**: k-NN is simple and can perform well on certain datasets, especially when the decision boundary is not too complex. However, it can be sensitive to the choice of 'k.'
   - **Random Forest**: Random Forest is known for its strong performance due to its ability to handle complex datasets, reduce overfitting, and provide feature importances.
8. **Scaling**:
   - **k-NN**: k-NN can be sensitive to the scale of features, so feature scaling is important.
   - **Random Forest**: Random Forest is less sensitive to feature scaling.
9. **Interpretability**:
   - **k-NN**: k-NN is relatively straightforward to interpret because it classifies based on the nearest neighbors.
   - **Random Forest**: Random Forest can be less interpretable due to the ensemble nature, but feature importances can be analyzed.
10. **Use Cases**:
    - **k-NN**: Suitable for simple classification and regression tasks where data has clear clusters or patterns.
    - **Random Forest**: Effective for a wide range of classification and regression tasks, including those with complex, high-dimensional data.<br><br>
In summary, the choice between k-NN and Random Forest depends on the nature of your data, the complexity of the problem, and your specific goals. k-NN is simple and can work well on certain datasets, while Random Forest is a versatile and powerful algorithm that often delivers robust performance in a variety of scenarios.

Yes, it talks well. I can summary and add some explanation.
k-NN is easy to understand and implement. It can work well when the decision boundary is smooth and data has clear clusters. But calculating distances to all training examples can be slow, especially with large datasets.
Random Forest typically provides robust and accurate results due to combining multiple decision trees. So it's suitable for high-dimensional and complex decision boundaries data.
In conclusion, Random Forest is a more versatile and powerful algorithm, making it a common choice, but k-NN can still be valuable in certain situations where simplicity and ease of implementation are important.

---

3.   In yours and ChatGPTs implementation, which algorithm had the highest accuracy?

In the result of both my code and ChatGPT's code, the Random Forest algorithm had the highest accuracy.
- My Implementation : 0.8791
- ChatGPT's Implementation : 0.9675

---

4.   Conclude with some thoughts and things you learned from this homework.

I experienced the implementations for solving a classification problem on the MNIST dataset using k-NN and Random Forest algorithms.
While the custom k-NN implementation in my code worked, it was less efficient and potentially less accurate than scikit-learn's KNeighborsClassifier. The accuracy results shows the Random Forest algorithm is better than KNN algorithm in both implementations.
Custom implementations was really educational in the aspect of studying mathmetical concepts, but using well-optimized libraries like scikit-learn must be happy.


