## Assignment 3: $k$ Nearest Neighbor


**Q1.**
1. What is the difference between regression and classification?
2. What is a confusion table? What does it help us understand about a model's performance?
3. What does the SSE quantify about a particular model?
4. What are overfitting and underfitting?
5. Why does splitting the data into training and testing sets, and choosing $k$ by evaluating accuracy or SSE on the test set, improve model performance?
6. With classification, we can report a class label as a prediction or a probability distribution over class labels. Please explain the strengths and weaknesses of each approach.

1. Regression is predicting a numberic outcome and classification is predicting a categorical outcome.
2. A confusion table cross tabulates the predicted and true values. The table is made up of the predicted values as columns along the top and true values along the side. This table helps us understand the success and failure points of the model. For example if the model predicts it as negative and it is actually positive this is a false negative and when the model predicts it as positive and it is actually negative this is a false positive. This helps us understand in which scenarios the model is making the wrong decisions.
3. The SSE quantifies the error in a specific model by summing how far each of the predicted values are from the true values.
4. Overfitting is when your model is too complex to reliably explain the phenomenon you are interested in (in KNN when k is too big), underfitting is when your model is too simple to reliably explain the phenomenon you are interested in ( in KNN when k is too small).
5. Splitting the data into training and testing sets improve the model because it allows you to first create the model based on data and train it to make the correct decisions, and then you can evaluate its performance on never before seen data to see if any adjustments to the model are needed. Choosing k by evaluating where SSE is the lowest allows us to optimize the model and choose a k where the model is performing at its best instead of choosing a random k value.
6. When making predictions with classification models, we can output either a single class label or a probability distribution over all classes. Class labels are simple and actionable, providing a clear decision that's easy to evaluate and communicate. However, they omit information about the model's confidence. This makes it impossible to adjust decisions for different contexts or identify borderline cases that might need further review.
Probability distributions keep the model's uncertainty, showing how confident it is in each possible class. This allows you to tune decision thresholds based on which errors you are willing to risk. However, they require more interpretation and decision making after the model is done.
The best choice depends on your application class labels work well for  simple, actionable decisions, and probabilities work well when you want to incorporate uncertainty for further processing.


**Q2.** This question is a case study for $k$ nearest neighbor regression, using the `USA_cars_datasets.csv` data.

The target variable `y` is `price` and the features are `year` and `mileage`.

1. Load the `./data/USA_cars_datasets.csv`. Keep the following variables and drop the rest: `price`, `year`, `mileage`. Are there any `NA`'s to handle? Look at the head and dimensions of the data.
2. Maxmin normalize `year` and `mileage`.
3. Split the sample into ~80% for training and ~20% for evaluation.
4. Use the $k$NN algorithm and the training data to predict `price` using `year` and `mileage` for the test set for $k=3,10,25,50,100,300$. For each value of $k$, compute the mean squared error and print a scatterplot showing the test value plotted against the predicted value. What patterns do you notice as you increase $k$?
5. Determine the optimal $k$ for these data.
6. Describe what happened in the plots of predicted versus actual prices as $k$ varied, taking your answer into part 6 into account. (Hint: Use the words "underfitting" and "overfitting".)

In [18]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
import seaborn as sns

df = pd.read_csv('/content/USA_cars_datasets.csv', low_memory = False)
print(df.dtypes, '\n') #also not sure but i think
# Keep only the variables you need
df = df[['price', 'year', 'mileage']]

# Check for NAs
print(df.isna().sum())

# Or for total count
print(f"\nTotal NAs: {df.isna().sum().sum()}")

# Look at the head and dimensions
print(f"\nShape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

#maxmin NORMALIZE
#create a funciton first
def maxmin(z):
    w = (z - np.min(z)) / (np.max(z) - np.min(z))
    return w

# Apply the function to the columns
df['year'] = maxmin(df['year'])
df['mileage'] = maxmin(df['mileage'])

# Check the result
print(df.head())
X = df[['year', 'mileage']]  # Features
y = df['price']  # Target
#split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, # Feature and target variables
                                                    test_size=.2, # Split the sample 80 train/ 20 test
                                                    random_state=65) # For replication purposes

k=5
model = KNeighborsRegressor(n_neighbors=k) # Create a sk model for k=3
fitted_model = model.fit(X,y) # Train the model on our data
N_x = 100 # Coarseness of x variable
N_y = 100 # Coarseness of y variable
total = N_x*N_y # Total number of points to plot

grid_x = np.linspace(0,1,N_x) # Create a grid of x values
grid_y = np.linspace(0,1,N_y) # Create a grid of y values

xs, ys = np.meshgrid(grid_x,grid_y) # Explode grids to all possible pairs
X = xs.reshape(total) # Turns pairs into vectors
Y = ys.reshape(total) # Turns pairs into vectors

x_hat = pd.DataFrame({'year:X' ,'mileage:Y'}) # Create a dataframe of points to plot
y_hat = fitted_model.predict(x_hat) # Fit the model to the points
x_hat['Predicted Sales'] = y_hat # Add new variable to the dataframe
# Create seaborn plot:
this_plot = sns.scatterplot(data=x_hat,x='year',y='footprint',
                            hue='Predicted Sales', palette = 'crest', linewidth=0)
sns.move_legend(this_plot, "upper left", bbox_to_anchor=(1, 1)) # Move legend off the plot canvas

# Create seaborn plot
this_plot = sns.scatterplot(data=x_hat, x='year', y='mileage', hue='Predicted Class', linewidth=0)
sns.move_legend(this_plot, "upper left", bbox_to_anchor=(1, 1))

Unnamed: 0       int64
price            int64
brand           object
model           object
year             int64
title_status    object
mileage          int64
color           object
vin             object
lot              int64
state           object
country         object
condition       object
dtype: object 

price      0
year       0
mileage    0
dtype: int64

Total NAs: 0

Shape: (2499, 3)

First few rows:
   price  year  mileage
0   6300  2008   274117
1   2899  2011   190552
2   5350  2018    39590
3  25000  2014    64146
4  27700  2018     6654
   price      year   mileage
0   6300  0.744681  0.269287
1   2899  0.808511  0.187194
2   5350  0.957447  0.038892
3  25000  0.872340  0.063016
4  27700  0.957447  0.006537




ValueError: could not convert string to float: 'year:X'

1. The data is 2499 x 3 columns (after making it only price, year, mileage). I did not find any NAs.
2.

**Q3.** This is a case study on $k$ nearest neighbor classification, using the `animals.csv` data.

The data consist of a label, `class`, taking integer values 1 to 7, the name of the species, `animal`, and 16 characteristics of the animal, including `hair`, `feathers`, `milk`, `eggs`, `airborne`, and so on.

1. Load the data. For each of the seven class labels, print the values in the class and get a sense of what is included in that group. Perform some other EDA: How big are the classes? How much variation is there in each of the features/covariates? Which variables do you think will best predict which class?
2. Split the data 50/50 into training and test/validation sets. (The smaller the data are, the more equal the split should be, in my experience: Otherwise, all of the members of one class end up in the training or test data, and the model falls apart.)
3. Using all of the variables, build a $k$-NN classifier. Explain how you select $k$.
4. Print a confusion table for the optimal model, comparing predicted and actual class label on the test set. How accurate it is? Can you interpret why mistakes are made across groups?
5. Use only `milk`, `aquatic`, and `airborne` to train a new $k$-NN classifier. Print your confusion table. Mine does not predict all of the classes, only a subset of them. To see the underlying probabilities, use `model.predict_proba(X_test.values)` to predict probabilities rather than labels for your `X_test` test data for your fitted `model`. Are all of the classes represented? Explain your results.

**Q4.** Write your own function to make a kernel density plot.

- The user should pass in a Pandas series or Numpy array.
- The default kernel should be Gaussian, but include the uniform/bump and Epanechnikov as alternatives.
- The default bandwidth should be the Silverman plug-in, but allow the user to specify an alternative.
- You can use Matplotlib or Seaborn's `.lineplot`, but not an existing function that creates kernel density plots.

You will have to make a lot of choices and experiment with getting errors. Embrace the challenge and track your choices in the comments in your code.

Use a data set from class to show that your function works, and compare it with the Seaborn `kdeplot`.

We covered the Gaussian,
$$
k(z) = \dfrac{1}{\sqrt{2\pi}}e^{-z^2/2}
$$
and uniform
$$
k(z) = \begin{cases}
\frac{1}{2}, & |z| \le 1 \\
0, & |z|>1
\end{cases}
$$
kernels in class, but the Epanechnikov kernel is
$$
k(z) = \begin{cases}
\frac{3}{4} (1-z^2), & |z| \le 1 \\
0, & |z|>1.
\end{cases}
$$

In order to make your code run reasonably quickly, consider using the `pdist` or `cdist` functions from SciPy to make distance calculations for arrays of points. The other leading alternative is to thoughtfully use NumPy's broadcasting features. Writing `for` loops will be slow, but that's fine.