This notebook performs linear regression on our bee and air quality combined dataset, which is read in in the second cell<br>
We use the matrix solution to minimize the MSE

In [1]:
import pandas as pd
from scipy.stats import norm
import numpy as np
import math

In [2]:
data = pd.read_csv("data/completeFeatureVectors.csv")

X = data[['o3','co','so2','no2','pm25_frm', 'pressure', 'temperature', 'wind', 'year']].to_numpy()
# subtract 1998 from the year so that it starts at zero
X[:,8] = X[:,8]-1998
# Append ones to the start of X for the bias term
X = np.append(np.ones((X.shape[0],1)), X, axis=1)
y = data[['yield_per_col']].to_numpy()

Minimize the MSE by using the matrix formula to find theta.<br>
Also find the variance sigma so that we can find confidence intervals for our predictions

In [3]:
np.set_printoptions(suppress=True) # disable scientific notation when printing

theta = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)),X.T),y)

# print out theta with labels
for label,theta_i in zip(['intercept','o3','co','so2','no2','pm25_frm','pressure','temperature','wind','year'], theta):
    print(label + ": " + "{:f}".format(theta_i[0]))

intercept: 120.190591
o3: -544.230626
co: -20.476653
so2: 0.998524
no2: 0.152943
pm25_frm: -0.686405
pressure: 0.002974
temperature: 0.317392
wind: -0.375663
year: -1.848450


Make a prediction and show a 95% confidence interval for that prediction

In [10]:
x = np.array([[1],[0.0312],[0.759],[1.575],[14.757],[43.9],[984.],[59.17],[97.27], [10]])
prediction = np.matmul(x.T, theta)[0][0]
print("prediction:", prediction)

# find variance sigma and find a 95% confidence interval
sigma_sq = (np.matmul((y-np.matmul(X, theta)).T,(y-np.matmul(X, theta))) / X.shape[0]-1)[0][0]
sigma = math.sqrt(sigma_sq)
dev = sigma_sq*np.matmul(np.matmul(x.T,np.linalg.inv(np.matmul(X.T,X))),x)[0][0]
print(dev) # getting 190... seems very high, not sure how to confirm
ppf = norm.cdf(0.975, loc=0, scale=1) # should this be norm.ppf or norm.cdf?
print(ppf)
rng = ppf*sigma # *sigma, *sigma_sq, or not times anything at all?

print("range:", rng)
print("95% conf. Lower bound:", prediction-rng)
print("95% conf. Upper bound:", prediction+rng)

prediction: 28.046243398850937
190.05662274680785
0.8352198700196897
range: 11.382234570468597
95% conf. Lower bound: 16.66400882838234
95% conf. Upper bound: 39.42847796931953


In [5]:
cov_matrix = sigma_sq*np.linalg.inv(np.matmul(X.T,X))
threshold = 0.2357 # inverse tail of chi squared(?) function
features = ['intercept','o3','co','so2','no2','pm25_frm','pressure','temperature','wind','year']
for i in range(theta.shape[0]):
    sig = (theta[i,:][0] / cov_matrix[i,i])**2
    if sig > threshold:
        print(str(i) + ": " + features[i] + ": significant")
    else:
        print(str(i) + ": " + features[i] + ": not significant")

0: intercept: significant
1: o3: not significant
2: co: significant
3: so2: significant
4: no2: significant
5: pm25_frm: significant
6: pressure: significant
7: temperature: significant
8: wind: significant
9: year: significant
