This notebook performs knn linear regression on our honey production and air quality combined dataset, which is read in in the second cell<br>
Nearest neighbors are found using only the geographical (latitude and longitude) features - so it is not knn in the classical sense<br>
We use the matrix solution to minimize the MSE<br>
Cross validation is performed and an MSE is output near the end. Everything is written from scratch with numpy

In [1]:
import pandas as pd
from scipy.stats import norm
import numpy as np
import math
np.set_printoptions(suppress=True) # disable scientific notation when printing

In [2]:
data = pd.read_csv("data/completeFeatureVectors2.csv")

X = data[['o3','co','so2','no2','pm25_frm', 'pressure', 'temperature', 'wind', 'year', 'latitude', 'longitude']].to_numpy()
# subtract 1998 from the year so that it starts at zero
X[:,8] = X[:,8]-1998
# Append ones to the start of X for the bias term
X = np.append(np.ones((X.shape[0],1)), X, axis=1)
y = data[['yield_per_col']].to_numpy()

The predict function makes a prediction for the query given the training data X and responses y

In [3]:
def predict(query, X, y, k=50):
    distances = []
    for x in X:
        distances.append(np.linalg.norm(query[10:]-x[10:])) #TODO should be [9:11]? or [10:11] b/c of bias?
    
    # determine the cutoff for being one of k nearest neighbors
    # np.sort returns a copy of distances so distances isn't affected
    cutoff = np.sort(distances)[k]
    # find the neighbors with a distance less than the cutoff
    X_neighbors = X[distances<cutoff][:,:10]
    y_neighbors = y[distances<cutoff]

    # find theta with matrix formula to minimize MSE
    theta = np.matmul(np.matmul(np.linalg.inv(np.matmul(X_neighbors.T, X_neighbors)),X_neighbors.T),y_neighbors)

    # return prediction found via theta
    return np.matmul(query[:10].T, theta)

Perform k-fold cross validation to find our mean squared error (MSE)

In [4]:
k = 10

# the size of the testing set for each fold
chunk_size = X.shape[0] // k

# shuffle X and y together
Xy_shuffled = np.append(X, y, axis=1)
np.random.shuffle(Xy_shuffled)

sq_errors = []

# iterate through k folds
for i in range(k):

    # split out testing and training data
    X_k_test = Xy_shuffled[chunk_size*i:chunk_size*(i+1),:12]
    y_k_test = Xy_shuffled[chunk_size*i:chunk_size*(i+1),12]

    if i == 0:
        X_k_train = Xy_shuffled[chunk_size:,:12]
        y_k_train = Xy_shuffled[chunk_size:,12]
    elif i == k-1:
        X_k_train = Xy_shuffled[:chunk_size*i,:12]
        y_k_train = Xy_shuffled[:chunk_size*i,12]
    else:
        X_k_train = np.append(Xy_shuffled[:chunk_size*i,:12], Xy_shuffled[chunk_size*(i+1):,:12], axis=0)
        y_k_train = np.append(Xy_shuffled[:chunk_size*i,12], Xy_shuffled[chunk_size*(i+1):,12], axis=0)

    for j in range(X_k_test.shape[0]):
        y_pred = predict(X_k_test[j,:], X_k_train, y_k_train, k=150)
        if (y_pred - y_k_test[j])**2 > 1000:
            print("predicted:", y_pred, "actual", y_k_test[j], "se:", (y_pred - y_k_test[j])**2)
        sq_errors.append((y_pred - y_k_test[j])**2)
    
    print("i:", i)
    
mean_sq_error = np.mean(sq_errors)
print("median sq error:", np.median(sq_errors))
print("mean sq error:", mean_sq_error)

    

i: 0
predicted: 83.54396455129451 actual 124.0 se: 1636.6908042269154
i: 1
predicted: 52.00222060504069 actual 85.0 se: 1088.853444998401
predicted: 109.47566870959473 actual 77.0 se: 1054.66905813535
i: 2
i: 3
predicted: 49.50984184809639 actual 92.0 se: 1805.4135397737805
i: 4
predicted: 97.06205548021076 actual 136.0 se: 1516.163523426185
i: 5
predicted: 65.22877491383885 actual 98.0 se: 1073.9531936478381
predicted: 59.6116311699281 actual 26.0 se: 1129.7417499032824
i: 6
i: 7
i: 8
predicted: 94.82833897170909 actual 131.0 se: 1308.3890615455796
i: 9
median sq error: 85.55832769394857
mean sq error: 173.93412078789424
