**Machine Learning  
Robert Knox**

Assignment details:

Use Linear equation normal equation to predict water temperature T_degC

1) Only use 'Salnty', 'STheta' for predictors

2) Remove NaN / NA values from dataset (prior to building train/test sets). 

3) Solve for rmse, variance explained, and r-squared.

## Step 1: Import

In [1]:
import numpy as np
from numpy import linalg as LA
import pandas as pd
import os
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics


## Step 2: Read in Data

In [2]:
with open('bottle.csv') as f:
    line = f.readline()
line_column_names = line.split(',')

#47 (MeanAq) & 73 (pH1) have mixed types so I only use the desired columns

desired_cols = ['T_degC','Salnty','STheta']
df = pd.read_csv('bottle.csv',sep=',',header='infer',index_col=None,usecols=desired_cols)

In [3]:
df.head()

Unnamed: 0,T_degC,Salnty,STheta
0,10.5,33.44,25.649
1,10.46,33.44,25.656
2,10.46,33.437,25.654
3,10.45,33.42,25.643
4,10.45,33.421,25.643


## Step 3: Handle Nulls & NAN

In [4]:
df.dropna(axis=0,inplace=True)

In [5]:
#make sure it worked
df.isna().any()

T_degC    False
Salnty    False
STheta    False
dtype: bool

In [6]:
df["Intercept"] = np.ones(len(df))

## Step 4: Convert to arrays & solve using the Normal Equation


In [7]:
T_degC = df['T_degC'].values
Intercept = df['Intercept'].values
Salnty = df['Salnty'].values
STheta = df['STheta'].values

#build our data matrix A
X = np.column_stack((Intercept,Salnty,STheta))

#create the transpose of the data matrix for ease of calculation
Xtran = np.transpose(X)

#define our y_true as T_degC
y_true = T_degC.copy()
#print("y_true: ",y_true)

theta_best = LA.inv((Xtran.dot(X))).dot(Xtran).dot(y_true)
yhat = X.dot(theta_best)

#Make the output look nice with array_str
print("theta_best: ",np.array_str(theta_best,precision=4,suppress_small=True))

ybar = sum(y_true)/len(y_true)
#print("ybar",ybar)

#alternative calculations
#SST = sum((y_true-ybar)**2)
#SSM = sum((yhat-ybar)**2)
#SSE = sum((y_true-yhat)**2)

#Calculate Variance Explained
var_explained = 1-np.cov(y_true-yhat)/np.cov(y_true)

#Calculate root mean square error
rmse = (sum((yhat-y_true)**2)/len(y_true))**0.5

#Calculate R-Squared (not adjusted)
r_squared = 1-((y_true-yhat)**2).sum()/(len(y_true)*(y_true.std()**2))

print("\nRMSE:\t\t\t",np.round(rmse,4))
print("Variance Explained:\t",np.round(var_explained,4))
print("R-Squared:\t\t",np.round(r_squared,4))

theta_best:  [89.7647 -0.0555 -2.9838]

RMSE:			 2.3595
Variance Explained:	 0.6875
R-Squared:		 0.6875


## Check using Scikit learn linear model

In [8]:
lin_reg = LinearRegression()
reg = lin_reg.fit(X, y_true)
print("Intercept: ",reg.intercept_,"\nThetas: ", reg.coef_)
print("R-Squared:",reg.score(X,y_true))
yhat_sk = reg.predict(X)
print("RMSE",metrics.mean_squared_error(y_true,yhat_sk)**0.5)

Intercept:  89.76473480839081 
Thetas:  [ 0.         -0.05547897 -2.98377603]
R-Squared: 0.6875216833872659
RMSE 2.3595303631129534
