## Computer Vision

Let's do some very basic computer vision. We're going to import the MNIST handwritten digits data and $k$NN to predict values (i.e. "see/read").

1. To load the data, run the following code in a chunk:
```
from keras.datasets import mnist
df = mnist.load_data('minst.db')
train,test = df
X_train, y_train = train
X_test, y_test = test
```
The `y_test` and `y_train` vectors, for each index `i`, tell you want number is written in the corresponding index in `X_train[i]` and `X_test[i]`. The value of `X_train[i]` and `X_test[i]`, however, is a 28$\times$28 array whose entries contain values between 0 and 256. Each element of the matrix is essentially a "pixel" and the matrix encodes a representation of a number. To visualize this, run the following code to see the first ten numbers:
```
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000)
for i in range(5):
    print(y_test[i],'\n') # Print the label
    print(X_test[i],'\n') # Print the matrix of values
    plt.contourf(np.rot90(X_test[i].transpose())) # Make a contour plot of the matrix values
    plt.show()
```
OK, those are the data: Labels attached to handwritten digits encoded as a matrix.

2. What is the shape of `X_train` and `X_test`? What is the shape of `X_train[i]` and `X_test[i]` for each index `i`? What is the shape of `y_train` and `y_test`?
3. Use Numpy's `.reshape()` method to covert the training and testing data from a matrix into an vector of features. So, `X_test[index].reshape((1,784))` will convert the $index$-th element of `X_test` into a $28\times 28=784$-length row vector of values, rather than a matrix. Turn `X_train` into an $N \times 784$ matrix $X$ that is suitable for scikit-learn's kNN classifier where $N$ is the number of observations and $784=28*28$ (you could use, for example, a `for` loop).
4. Use the reshaped `X_test` and `y_test` data to create a $k$-nearest neighbor classifier of digit. What is the optimal number of neighbors $k$? If you can't determine this, play around with different values of $k$ for your classifier.
5. For the optimal number of neighbors, how well does your predictor perform on the test set? Use a confusion matrix and compute accuracy.
6. For your confusion matrix, which mistakes are most likely? Do you find any interesting patterns?
7. So, this is how computers "see." They convert an image into a matrix of values, that matrix becomes a vector in a dataset, and then we deploy ML tools on it as if it was any other kind of tabular data. To make sure you follow this, invent a way to represent a color photo in matrix form, and then describe how you could convert it into tabular data. (Hint: RGB color codes provide a method of encoding a numeric value that represents a color.)

In [3]:
from keras.datasets import mnist
df = mnist.load_data('minst.db')
train,test = df
X_train, y_train = train
X_test, y_test = test

SystemError: <built-in function isinstance> returned a result with an exception set

In [None]:
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000)
for i in range(5):
    print(y_test[i],'\n') # Print the label
    print(X_test[i],'\n') # Print the matrix of values
    plt.contourf(np.rot90(X_test[i].transpose())) # Make a contour plot of the matrix values
    plt.show()

In [None]:
#q2
print(X_test.shape, '\n') # List the dimensions of d
print(X_train.shape, '\n')
print(X_test[0].shape, '\n')
print(X_train[0].shape, '\n')
print(y_test.shape, '\n')
print(y_train.shape, '\n')

There are 10,000 matrices sized 28x28 in the x axis test set and 60,000 matrices sized 28x28 in the x axis training set. the vector for The y axis test set has 10,000 spots whereas the vecotr for the x axis train set has 60,000 spots. Each x axis training set is a matrix in the vector of y axis training space.

Use Numpy's .reshape() method to covert the training and testing data from a matrix into an vector of features. So, X_test[index].reshape((1,784)) will convert the  𝑖𝑛𝑑𝑒𝑥 -th element of X_test into a  28×28=784 -length row vector of values, rather than a matrix. Turn X_train into an  𝑁×784  matrix  𝑋  that is suitable for scikit-learn's kNN classifier where  𝑁  is the number of observations and  784=28∗28  (you could use, for example, a for loop).

In [None]:
# Import MNIST dataset
from keras.datasets import mnist

# Load the MNIST data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Check the original shape of the data
print(f'Original X_train shape: {X_train.shape}')  # Should be (60000, 28, 28)
print(f'Original X_test shape: {X_test.shape}')    # Should be (10000, 28, 28)

# Reshape the data to (N, 784) where N is the number of samples
X_train_reshaped = X_train.reshape(X_train.shape[0], 28*28)
X_test_reshaped = X_test.reshape(X_test.shape[0], 28*28)

# Check the new shape of the data
print(f'Reshaped X_train shape: {X_train_reshaped.shape}')  # Should be (60000, 784)
print(f'Reshaped X_test shape: {X_test_reshaped.shape}')    # Should be (10000, 784)

#USED CHAT GPT

4. Use the reshaped `X_test` and `y_test` data to create a $k$-nearest neighbor classifier of digit. What is the optimal number of neighbors $k$? If you can't determine this, play around with different values of $k$ for your classifier.

In [1]:
# Import necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# List to store accuracy results for different k values
k_values = range(1, 11)  # Testing k values from 1 to 10
accuracies = []

# Loop through different values of k
for k in k_values:
    # Create k-NN classifier with current k
    knn = KNeighborsClassifier(n_neighbors=k)

    # Train the classifier on the training data
    knn.fit(X_train_reshaped, y_train)

    # Predict the labels for the test data
    y_pred_test = knn.predict(X_test_reshaped)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred_test)
    accuracies.append(accuracy)

    # Print the accuracy for the current k
    print(f'Accuracy for k={k}: {accuracy:.4f}')

# Find the optimal k with the highest accuracy
best_k = k_values[accuracies.index(max(accuracies))]
print(f'\nThe optimal value of k is: {best_k}')
print(f'Best accuracy: {max(accuracies):.4f}')

KeyboardInterrupt: 

In [5]:
def knn_class(x_hat,gdf,K):
    # Compute distances between x_hat and the data:
    squared_differences = (x_hat - gdf.loc[:,['x1','x2']])**2
    distances = np.sum( squared_differences , axis = 1)
    # Find k smallest values in dist:
    neighbors = np.argsort(distances)[:K].tolist()
    # Find g values for the nearest neighbors:
    g_star = gdf['g'].iloc[neighbors]
    # Modal class:
    g_dist = g_star.value_counts()/K
    g_modal = g_dist.index[g_dist.argmax()]
    # Return a dictionary of computed values of interest:
    return({'neighbors':neighbors, 'g_star':g_star, 'g_dist':g_dist, 'g_modal':g_modal, })

In [4]:
#Q3
import numpy as np

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

X_train = train_images.reshape(train_images.shape[0], 28 * 28)
X_test = test_images.reshape(test_images.shape[0], 28 * 28)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Shape of X_train: (60000, 784)
Shape of X_test: (10000, 784)


4. Use the reshaped `X_test` and `y_test` data to create a $k$-nearest neighbor classifier of digit. What is the optimal number of neighbors $k$? If you can't determine this, play around with different values of $k$ for your classifier.

In [8]:
from sklearn.neighbors import KNeighborsClassifier

# Determine the optimal k:
k_bar = 50
k_grid = np.arange(2,k_bar) # The range of k's to consider
accuracy = np.zeros(k_bar)

for k in range(k_bar):
    knn = KNeighborsClassifier(n_neighbors=k+1)
    predictor = knn.fit(Z_train.values,y_train)
    #y_hat = predictor.predict(Z_test.values)
    accuracy[k] = knn.score(Z_test.values,y_test) # Bug in sklearn requires .values

accuracy_max = np.max(accuracy) # highest recorded accuracy
max_index = np.where(accuracy==accuracy_max)
k_star = k_grid[max_index] # Find the optimal value of k
print(k_star)

plt.plot(np.arange(0,k_bar),accuracy) # Plot accuracy by k
plt.xlabel("k")
plt.title("optimal k:"+str(k_star))
plt.ylabel('Accuracy')
plt.show()


NameError: name 'Z_train' is not defined