In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from scipy.stats import norm
import numpy as np
import math

In [2]:
data = pd.read_csv("data/completeFeatureVectors.csv")

X = data[['o3','co','so2','no2','pm25_frm', 'pressure', 'temperature', 'wind']].to_numpy()
y = data[['yield_per_col']].to_numpy()

Use sklearn to run a simple regression model. A bias (intercept) term is included by default.<br>
X is a 2D numpy array of features where each column is a feature and each row is one data point (state-year)<br>
y is a numpy column array of responses (avg. honey per hive) to correspond with X

In [3]:
model = LinearRegression(fit_intercept=True, normalize=False).fit(X,y)

We can view the coeffecients (theta) of the regression model

In [4]:
np.set_printoptions(suppress=True) # don't use scientific notation
print("coeffecients:", model.coef_[0])
print("intercept:", model.intercept_)

coeffecients: [-667.47195265   -1.05147514    1.03059897    0.22588529    0.51598921
   -0.00016845    0.17898027   -0.48249963]
intercept: [106.36707261]


We can perform 5-fold cross validation liks so - you can ignore the fit times and score times. Test score and train score tell us how well the model performed on testing and training on each fold of the cross-validation

In [5]:
cross_validate(model, X, y=y, return_train_score=True, cv=5)

{'fit_time': array([0.00086451, 0.00092244, 0.00052714, 0.00045156, 0.00043964]),
 'score_time': array([0.00060511, 0.0004673 , 0.00040245, 0.00026941, 0.00026608]),
 'test_score': array([ 0.19562148,  0.37253433,  0.48585462,  0.1235356 , -0.09997231]),
 'train_score': array([0.3745242 , 0.36156911, 0.35909169, 0.42102716, 0.42431873])}

pred_X is a 2D numpy array of feature vectors that you would like to predict - each row should correspond to one feature and each column to one data point

In [6]:
pred_X = np.array([[0.0312,0.759,1.575,14.757,43.9,984.,59.17,97.27]]) # fill in with values to estimate

prediction = model.predict(pred_X)[0][0]
print("Prediction:", prediction)

Prediction: 75.84415972663882


Sklearn does not include a way to find a confidence interval, so we can do it manually using the value of theta determined by sklearn and by calculating our own value of sigma

In [7]:
theta = np.append(model.intercept_[0], model.coef_[0])[...,None]
new_X = np.append(np.ones((X.shape[0],1)), X, axis=1)
sigma_sq = (np.matmul((y-np.matmul(new_X, theta)).T,(y-np.matmul(new_X, theta))) / X.shape[0])[0][0]
sigma = math.sqrt(sigma_sq)
ppf = norm.ppf(0.975, loc=0, scale=1)
range = ppf*sigma
print("95% conf. Lower bound:", prediction-range)
print("95% conf. Upper bound:", prediction+range)

NameError: name 'predicted' is not defined