In [143]:
from IPython.core.display import display, HTML
display(HTML('<img style="width:100%" src="./assets/head.jpg"><h1 style="text-align:center">Module 1: Python for Machine Learning</h1><h1 style="text-align:center;"><em>Linear Regression</em></h1>'))

  from IPython.core.display import display, HTML


#### The Supervised Machine Learning Structure:
*assumptions and definitions of regression and supervised machine learning*

* $y$ -- target
* $x$ -- feature
* $X$ -- features
* $y = f(x)$ -- relationship
* $\hat{y}$ -- predicted $y$
* $\hat{f(x)}$ -- predicted relationship
* $y \in \mathbb{R}$ -- y is a real number

#### Example Problem:
If $y$ : rating of a film, and $x$ : age of viewer, use linear regression to explore the relationship between a viewers age and how highly they rate the film


Let's say that the function of y given x ($y = f(x)$) is tested as $0.08x + 0.5$

In [4]:
def f_rating(x):
    return 0.08*x + 0.5

In [32]:
{
    0: f_rating(0), 
    80: f_rating(80), 
    10: f_rating(10)}

{0: 0.5, 80: 6.9, 10: 1.3}

defining our function means we can now say that $y = f_rating(x)$

from here we can also define our function for $\hat{f}$

In [33]:
def fhat_rating(x):
    return 0.07*x + 0.6

In [34]:
{
    0: fhat_rating(0),
    80: fhat_rating(80),
    10: fhat_rating(10)
}

{0: 0.6, 80: 6.2, 10: 1.3}

To calculate our error, we can simply do $f$ rating - $\hat{f}$ rating for any given value of $x$

In [36]:
f_rating(21) - fhat_rating(21)

0.10999999999999943

And naturally from here we can then square it, and then find out the mean of multiple squared errors to caluclate our mean squared error -- our loss function

$loss = \sum(\hat{y}, y)$ -- how incorrect a given point is

$L = \sum(\bar{\hat{y}-y)^2}$ -- how incorrect the entire model is

In [49]:
def error(x):
    return f_rating(x) - fhat_rating(x)


In [50]:
round(error(21), 3), 

(0.11,)

In [51]:
def loss(x):
    return error(x)**2

In [52]:
round(loss(21), 3)

0.012

##### By using a csv and taking advantage of pandas and numpy, we can instead calculate the loss for several datapoints and compare to the given y values in the csv, like so:

1. Import Pandas and Numpy, have Pandas read the CSV file and then turn it into a Numpy array

In [132]:
import pandas as pd
import numpy as np
movie_age = pd.read_csv("./assets/movie_age.csv")
df = pd.DataFrame(movie_age)
ndf = df.to_numpy()
ndf

array([[10,  7],
       [ 8,  7],
       [33,  5],
       [84,  8],
       [36,  6],
       [23,  7],
       [12,  3],
       [74,  8],
       [12, 10],
       [47,  2],
       [23,  6],
       [34,  6],
       [23,  8],
       [75,  8],
       [23,  8],
       [21,  2],
       [85, 10]], dtype=int64)

2. Split the Numpy array so the first column becomes it's own data array: this is our $x$ value, age

In [133]:
ageData = ndf[:, 0]

3. Create a new array, and for every datum in the age array, use the $\hat{f(x)}$ from earlier to calculate $\hat{y}$ - this can then be inserted as a new column to the original Pandas Dataframe

In [134]:
loss = []
for age in ageData:
    loss.append(fhat_rating(age))

df.insert(2, "PredY", loss, True)

In [135]:
df

Unnamed: 0,age,rating,PredY
0,10,7,1.3
1,8,7,1.16
2,33,5,2.91
3,84,8,6.48
4,36,6,3.12
5,23,7,2.21
6,12,3,1.44
7,74,8,5.78
8,12,10,1.44
9,47,2,3.89


4. Now, take the second column from the numpy array - rating. For every value of $y$ subtract the $\hat{y}$ calculated and square the result to find the square error. We will also add this as a column to our Pandas Dataframe so it is clear what we have done so far

In [136]:
rating = ndf[:, 1]
rating

array([ 7,  7,  5,  8,  6,  7,  3,  8, 10,  2,  6,  6,  8,  8,  8,  2, 10],
      dtype=int64)

In [137]:
sqerror = (rating-loss)**2
df.insert(3, "sqError", sqerror, True)

In [138]:
df

Unnamed: 0,age,rating,PredY,sqError
0,10,7,1.3,32.49
1,8,7,1.16,34.1056
2,33,5,2.91,4.3681
3,84,8,6.48,2.3104
4,36,6,3.12,8.2944
5,23,7,2.21,22.9441
6,12,3,1.44,2.4336
7,74,8,5.78,4.9284
8,12,10,1.44,73.2736
9,47,2,3.89,3.5721


5. Finally, we can use the mean function from statistic to calculate the mean of all of our squared errors -- $\bar{(y-\hat{y})}$ -- to get our final value of $L$

In [139]:
from statistics import *
mean(sqerror)

17.398994117647057

*Whilst admitedly this could be done much quicker without python for a small dataset such as this, when working with much larger datasets it makes sense to have a simple script which is capable of processing all the data - and without making errors or getting bored!*