<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Evaluating Model Performance

Continuing on from the last example let's load the income vs happiness dataset and evaluate the performance of our model.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

tips_df = pd.read_csv('../../Data/tips.csv')

X = tips_df['tip'].values.reshape(-1,1)  # this could have been done: happiness_df[['income']].values
y = tips_df['total_bill'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)

model = LinearRegression()
model.fit(X_train, y_train)  # Train the model
model.score(X_test, y_test)  # Score the model's accuracy - this is the R^2 value for this model


0.38686327140864907

As you can see the Accuracy of this model isn't great. 

A perfect model, with 100% accuracy would have a value of 1. This model has an $R^2$ of approx 0.39.

But what is $R^2$? 

# Regression Mechanics

Using a single feature for X is "simple linear regression", it's the simpliest form of linear regression so that's what we're going to use for this explanation, however, the same principles apply as you add more features.

The line in the chart above follows the equation: 
y = ax+b

So finding the line of best fit is about finding suitable values for a and b.

How do we do this?

The strategy is to define an Error Function, and then try to minimise this error.

People use the following terms for Error Function
* Error Function
* Loss Function 
* Cost Function

![Calculate Error](../../Images/error-from-line.png)

For each observation, we calculate the "residual", i.e. the vertical distance from each point to the line.

If we simply sum up the residuals the positive values would cancel out the negative values. 

To counter this, we square the residuals. 

The Residual Sum of Squares (RSS) is given by the equation: 

![RSS Formula](../../Images/rss-formula.png)


This type of Linear Regression is called **Ordinary Least Squares** where we try to minimise the RSS by varying the values a and b.


# Multiple Linear Regression
Where you have more dimensions (or features) the formula for the line of best fit is:

$y = a_1x_1 + a_2x_2... + b$

But the same principles apply. 


# Fit

$R^2$ is the variance of the target values. 

$R^2$ ranges from 0 to 1 depending on how close the points are to the line. 

$R^2$ = 1 means that all points lie on the line.

We calculate the $R^2$ by scoring our model. 

In [3]:
model.score(X_test, y_test) # yes, we did this earlier - this is a recap

0.38686327140864907

# Mean Squared Error

Another way to calculate the Error is with the Mean Squared Error (MSE)

![Mean Squared Error](../../Images/mean-squared-error-formula.png)

MSE is measured in units of our target value squared i.e. if the target value is in £, then the MSE will be in $ £^2 $ 

In [8]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred, squared=True) # by setting squared to false, we get the RMSE value
mse

45.4310606551747

To get the error in £ we can take the square root. This gives the RMSE error i.e. Root Mean Squared Error

In [9]:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, y_pred, squared=False) # by setting squared to false, we get the RMSE value
rmse

6.74025672027221