In [144]:
from IPython.display import display, HTML
display(HTML('<img style="width:True00%" src="./assets/head.jpg"><h1 style="text-align:center">Module 1: Python for Machine Learning</h1><h1 style="text-align:center;"><em>Linear Regression</em></h1>'))

### Part One: The Basic Pythonic Statistics
---
*Understanding the setup of machine learning with python - what are the functions used, how might they interact, and what basic calculations are needed to begin to stucture a linear regression model and calculate loss*

**NOTATIONS FOR PART ONE**

* $y$ -- target
* $x$ -- feature
* $X$ -- features
* $\mathcal{n}$ -- the total number of datapoints in a sample
* $\mathcal{N}$ -- the total number of datapoints in the population^1
* $y = f(x)$ -- relationship
* $\hat{y}$ -- predicted $y$
* $\hat{f(x)}$ -- predicted relationship
* $err = (\hat{y} - y)$ -- error (distance between predicted and actual)
* $loss = err^2 = (\hat{y}, y)^2$ -- loss (squared error): how wrong a specific predicted $\hat{y}$ is
* $L$ (or $mse$) $= \bar{err^2} = \frac{\sum(err^2)}{\mathcal{n}} = \frac{\sum(\hat{y}-y)^2}{\mathcal{n}}$ -- Loss (mean of the squared errors): how wrong the whole model is

###### ^1 Whilst this is valuable to know, as we will not be working with data for the population and only ever a sample, it is unlikely to be seen in use for linear regression

1. The first step is to begin by defining how $y = f(x)$ and doing so with a formula

In [4]:
def f_rating(x):
    return 0.08*x + 0.5

In [32]:
{
    0: f_rating(0), 
    80: f_rating(80), 
    10: f_rating(10)}

{0: 0.5, 80: 6.9, 10: 1.3}

2. In this example, $\hat{f(x)}$ also needs to be calculated, as we are assuming the previous $f(x)$ is our real and not our predicted -- in future samples there will be a real value of $y$ but when working without one, defining $f$ and $\hat{f}$ can also be done to enable comparrison

In [33]:
def fhat_rating(x):
    return 0.07*x + 0.6

In [34]:
{
    0: fhat_rating(0),
    80: fhat_rating(80),
    10: fhat_rating(10)
}

{0: 0.6, 80: 6.2, 10: 1.3}

3. Now we can begin to calculat a formula for our loss. As $loss$ is just $f(x) - \hat{f(x)}$ ( or $y - \hat{y}$ ) this is simple enough

In [36]:
f_rating(21) - fhat_rating(21)

0.10999999999999943

4. And naturally from here we can then square it, and then find out the mean of multiple squared errors to caluclate our mean squared error -- our loss function

In [49]:
def error(x):
    return f_rating(x) - fhat_rating(x)


In [50]:
round(error(21), 3), 

(0.11,)

In [51]:
def loss(x):
    return error(x)**2

In [52]:
round(loss(21), 3)

0.012

**IT IS SIMPLE TO APPLY THE SAME APPROACH TO READING AND PROCESSING DATA IN A CSV FILE AS WELL**

5. Import Pandas and Numpy, have Pandas read the CSV file and then turn it into a Numpy array

In [132]:
import pandas as pd
import numpy as np
movie_age = pd.read_csv("./assets/movie_age.csv")
df = pd.DataFrame(movie_age)
ndf = df.to_numpy()
ndf

array([[10,  7],
       [ 8,  7],
       [33,  5],
       [84,  8],
       [36,  6],
       [23,  7],
       [12,  3],
       [74,  8],
       [12, 10],
       [47,  2],
       [23,  6],
       [34,  6],
       [23,  8],
       [75,  8],
       [23,  8],
       [21,  2],
       [85, 10]], dtype=int64)

6. Split the Numpy array so the first column becomes it's own data array: this is our $x$ value, age

In [133]:
ageData = ndf[:, 0]

7. Create a new array, and for every datum in the age array, use the $\hat{f(x)}$ from earlier to calculate $\hat{y}$ - this can then be inserted as a new column to the original Pandas Dataframe

In [134]:
loss = []
for age in ageData:
    loss.append(fhat_rating(age))

df.insert(2, "PredY", loss, True)

In [135]:
df

Unnamed: 0,age,rating,PredY
0,10,7,1.3
1,8,7,1.16
2,33,5,2.91
3,84,8,6.48
4,36,6,3.12
5,23,7,2.21
6,12,3,1.44
7,74,8,5.78
8,12,10,1.44
9,47,2,3.89


8. Now, take the second column from the numpy array - rating. For every value of $y$ subtract the $\hat{y}$ calculated and square the result to find the square error. We will also add this as a column to our Pandas Dataframe so it is clear what we have done so far

In [136]:
rating = ndf[:, 1]
rating

array([ 7,  7,  5,  8,  6,  7,  3,  8, 10,  2,  6,  6,  8,  8,  8,  2, 10],
      dtype=int64)

In [137]:
sqerror = (rating-loss)**2
df.insert(3, "sqError", sqerror, True)

In [138]:
df

Unnamed: 0,age,rating,PredY,sqError
0,10,7,1.3,32.49
1,8,7,1.16,34.1056
2,33,5,2.91,4.3681
3,84,8,6.48,2.3104
4,36,6,3.12,8.2944
5,23,7,2.21,22.9441
6,12,3,1.44,2.4336
7,74,8,5.78,4.9284
8,12,10,1.44,73.2736
9,47,2,3.89,3.5721


9. Finally, we can use the mean function from statistic to calculate the mean of all of our squared errors -- $\bar{(y-\hat{y})}$ -- to get our final value of $L$

In [139]:
from statistics import *
mean(sqerror)

17.398994117647057

*Whilst admitedly this could be done much quicker without python for a small dataset such as this, when working with much larger datasets it makes sense to have a simple script which is capable of processing all the data - and without making errors or getting bored!*

### Part Two: Iteration for Loss
---
*Use of a small sample of training along with previous functions to incorperate a for-loop for loss, demonstrating how simple and effective it is to calculate loss when taking advantage of automation such as iteration*

**NOTATIONS FOR PART TWO**
* $\mathcal{D} = \{X, y \}$ -- the dataset with $X$ being the features and $y$ the target
* $x_1$ -- the first feature
* $x_n$ -- the $n^{th}$ feature
* $\mathcal{D}_{train}$ -- the dataset for training
* $\mathcal{D}_{test}$ -- the dataset for testing
* $\mathcal{D} = \{\{x_0^0, x_1^0 \dots x_n^0, y^0\}, \{x_0^1, x_1^True, \dots x_n^True, y^True\}, \dots \{x_0^n, x_True^n, \dots x_n^n, y^n\}\}$ -- the dataset defined by a total $^n$ rows and $_n$ features

1. Start by defining the training dataset

In [155]:
Dtrain = {
    (10, 8.5),
    (45, 9.1),
    (23, 9.9),
    (97, 6.4),
    (17, 8.6),
    (13, 1.2),
    (64, 2.0),
    (17, 9.0),
    (75, 8.4),
    (43, 4.9),
    (43, 4.1),
    (56, 3.5)
}

2. We then define our $f(x)$ function : f(x) and our $loss(\hat{y}, y$ function : loss(yhat, y)

In [156]:
def f(x):
    return 0.08*x + 0.6

In [179]:
def loss(yhat, y):
    return (yhat-y)**2

3. We can then use a for-loop to run through the data in our dataset, and compute the following
    * Calculate $\hat{y}$ given $x$
    * Calculate $loss^2$ given $\hat{y}$ and $y$
    * Add $\hat{y}$ to the array 'predicted'
    * Add $loss^2$ to the array 'sq_error'
    * Print out our values for $y$, $\hat{y}$, and $loss^2$

In [185]:
predicted = []
sq_error = []
print(" y   |   y^  |  loss")
print("-----+-------+-------")
for (x, y) in Dtrain:
    yhat = f(x)
    l = loss(yhat, y)
    predicted.append(yhat)
    sq_error.append(l)
    print(round(yhat, 1), " | ", y, " | ", round(l, 3))

 y   |   y^  |  loss
-----+-------+-------
1.4  |  8.5  |  50.41
4.0  |  4.9  |  0.74
4.0  |  4.1  |  0.004
1.6  |  1.2  |  0.194
2.4  |  9.9  |  55.652
4.2  |  9.1  |  24.01
2.0  |  9.0  |  49.562
6.6  |  8.4  |  3.24
5.1  |  3.5  |  2.496
8.4  |  6.4  |  3.842
2.0  |  8.6  |  44.09
5.7  |  2.0  |  13.838


In [186]:
print(predicted, "\n\n", sq_error)

[1.4, 4.04, 4.04, 1.6400000000000001, 2.44, 4.2, 1.96, 6.6, 5.08, 8.36, 1.96, 5.72] 

 [50.41, 0.7396000000000006, 0.003599999999999953, 0.19360000000000016, 55.651600000000016, 24.009999999999994, 49.5616, 3.2400000000000024, 2.4964000000000004, 3.8415999999999966, 44.0896, 13.838399999999998]


4. By applying this same logic to the CSV from earlier, it quickly becomes apparent how this saves even more time with calcualting our predictions and loss values, as show below:

In [193]:
import pandas as pd
import numpy as np
Dframe = pd.DataFrame(pd.read_csv("./assets/movie_age.csv"))
Darray = Dframe.to_numpy()
x_csv = []
y_csv = []
predicted_csv = []
sq_error_csv = []

for (x, y) in Darray:
    x_csv.append(x)
    y_csv.append(y)
    yhat = f(x)
    l = loss(yhat, y)
    predicted_csv.append(yhat)
    sq_error_csv.append(l)
    
Dataset = pd.DataFrame({"x": x_csv, "y": y_csv, "yhat": predicted_csv, "squared_error": sq_error_csv})
Dataset

Unnamed: 0,x,y,yhat,squared_error
0,10,8.5,1.4,50.41
1,43,4.9,4.04,0.7396
2,43,4.1,4.04,0.0036
3,13,1.2,1.64,0.1936
4,23,9.9,2.44,55.6516
5,45,9.1,4.2,24.01
6,17,9.0,1.96,49.5616
7,75,8.4,6.6,3.24
8,56,3.5,5.08,2.4964
9,97,6.4,8.36,3.8416


### Part Three: Regression and Classification
---
*Whilst regression requires y be a real integer, classification focuses on either binary outcomes (e.g., 0 and 1; -1 and 1) and multiclass (blonde, brunnete, ginger). These can also be computed in python, although any non-numerical catergories will need to be dummy-coded with numbers for computational purposes*

**NOTATIONS FOR PART THREE**

* $ y \in \mathbb{R}$ -- regression outcome; $y$ is a real number
* $y \in \{ -1, +1 \}$ -- binary classification outcome; $y$ can be of two options
* $y \in \{a_{lpha}, b_{ravo}, c_{harlie} \dots \}$ - multiclass outcome; $y$ can be of many options

1. We start by defining our function of x ( $f(x)$ ) as we have done previously. This time, however, rather than running a linear equation to calculate our $\hat{y}$ a simple if/else statement can be used. For example, if 1 = like, -1 = dislike, and $x$ is a movie's length, we could create a function as follows:

In [206]:
def classify(x):
    if x > 200:
        return -1
    else:
        return +1

In [207]:
classify(180)

1

2. We can even use dictionaries to de-code our dummy variables and return the actual category

In [208]:
decode = {
    1: "like",
    -1: "dislike"
}

In [209]:
def classify(x):
    if x > 200:
        yhat = -1
        return decode[yhat]
    else:
        yhat = 1
        return decode[yhat]

In [210]:
classify(209), classify(158), classify(92)

('dislike', 'like', 'like')

### Part Four: Summary and Machine Learning Setup in Python
---
*Ensuring familiarity with previous topics, as well as setting the groundwork for exploring further machine leaning setup within python prior to exploring more complex tasks and algorythms*

**SET NOTATIONS FOR DIFFERENT TYPES OF PROBLEM**
* $ y \in \mathbb{R}$ -- regression outcome: $x$ is in the set of all real numbers
* $y \in \{ -1, +1 \}$ -- binary classification outcome: $x$ is in the set of {$-1, 1$}
* $y \in \{a, b, c \dots \}$ - multiclass outcome: $x$ is in the set of {$a$, $b$, $c$, $\dots$}

$\mathcal{Task}$ $\mathcal{One}$

*Given the age of a viewer and the length of a film, predict if the viewer will like or dislike the film*

1. Start by creating the training dataset and defining your f(x) and loss(yhat, y) functions*

###### *normally, you would calculate a regression function, and this will be explored later: for now, simply make one up of your choosing

In [427]:
Dtrain = {
    'Age': [54,72,36,45,32,6,85,2,16,48,42,32,36,1,92,12,8,43,58,10],
    'Length': [87,160,250,132,184,96,234,216,190,87,145,69,152,167,273,97,164,189,151,163],
    'Liked': [1,1,1,-1,-1,1,-1,-1,1,1,-1,1,1,1,-1,-1,-1,1,1,-1]
}

x1 = Dtrain['Age']
x2 = Dtrain['Length']
y = Dtrain['Liked']

In [394]:
def f(x1, x2):
    if 2*(x1 - 40) + (x2-135) > 0:
        return 1
    else: 
        return-1

In [429]:
def err(yh, y):
    if yh - y != 0:
        return 1
    else: 
        return yh-y

2. Create a for-loop to run throught the dataset computing preditions and loss for each value. Save these to a pandas Dataframe for the sake of visual ease

In [430]:
predictions = []
squared_error = []
actual = []
for age, length, liked in zip(x1, x2, y):
    actual.append(liked)
    yh = f(age, length)
    predictions.append(yh)
    error = err(liked, yh)
    squared_error.append(error)

In [397]:
D = pd.DataFrame({"age": x1, "length": x2, "liked": y, "y^": predictions, "e^2": squared_error})
D

Unnamed: 0,age,length,liked,y^,e^2
0,54,87,1,-1,1
1,72,160,1,1,0
2,36,250,1,1,0
3,45,132,-1,1,1
4,32,184,-1,1,1
5,6,96,1,-1,1
6,85,234,-1,1,1
7,2,216,-1,1,1
8,16,190,1,1,0
9,48,87,1,-1,1


In [398]:
from statistics import *
mse = mean(squared_error)

print(mse)

0.55


3. Now we've calculated a $mse$ of 0.55 we can see that we are really far out with our formula! As a result, we want to start to tweak it slightly, and see if we can get that error closer to zero

In [434]:
def f(x1, x2):
    if (x1**2) + (x2-1) > 0:
        return 1
    else: 
        return-1

In [435]:
from statistics import *
import pandas as pd

def err(yh, y):
    if yh - y != 0:
        return 1
    else: 
        return yh-y

predictions = []
squared_error = []
actual = []
for age, length, liked in zip(x1, x2, y):
    actual.append(liked)
    yh = f(age, length)
    predictions.append(yh)
    error = err(liked, yh)
    squared_error.append(error)

In [436]:
mse2 = mean(squared_error)
mse

0.55

4. Even after trying some pretty drastic changes, there doesn't seem to be any signficant impact on our mse. This suggests either:
* There is no real relationship between age, length of film, and enjoyment within our sample
* There is a relationship, but it's not linear, and we'd need to use a different approach to figure it out

### Appendix: Formulae and Notation
##### $all$ $notations$ $used$ $in$ $this$ $module$ $are$ $detailed$ $below$

**NOTATIONS FOR INPUT DATA**

* $x$ -- feature
* $X$ -- features
* $y$ -- target
* $\mathcal{n}$ -- the total number of datapoints in a sample
* $\hat{y}$ -- predicted $y$

**NOTATIONS FOR FUNCTIONS OF X**

* $f(x)$ -- the function of x (in respect to y): relationship between $y$ and $x$
* $\hat{f}(x)$ -- the predicted function of x (in respect to y): predicted relationship btween $y$ and $x$

**NOTATIONS FOR FORMULAE**

* *error ($err$)* => * $(\hat{y} - y)$ -- distance between predicted y and actual y
* *squared error ($err^2$) / loss ($loss$)* => $(\hat{y}, y)^2$ -- how wrong a specific predicted $\hat{y}$ is
* *mean squared error ($mse$) / Loss ($L$)* => $\bar{err^2} = \frac{\sum(err^2)}{\mathcal{n}} = \frac{\sum(\hat{y}-y)^2}{\mathcal{n}}$ -- how wrong the whole model is

**NOTATIONS FOR DATASETS**
* $\mathcal{D} = \{X, y \}$ -- the dataset
* $\mathcal{D}_{train}$ -- the dataset for training
* $\mathcal{D}_{test}$ -- the dataset for testing
* $x_0$ -- the first feature: $x$ at index 0
* $x_n$ -- the $n^{th}$ feature: $x$ at index $\mathcal{n}$

*nb: below are detailed formula for $\mathcal{D}$ where each feature ($x$) is denoted as $x_a^b$:*
* $x_n$ = the $n^{th}$ feature (column) in the dataset
* $x^n$ = the $n^{th}$ instance (row) in the dataset
* $\mathcal{D} = \{\{ x_0^0, x_1^0, \dots x_0^n, y_0 \}, \{ x_1^0, x_1^1, \dots x_1^n, y_1 \}, \dots \{ x_0^n, x_1^n, \dots x_n^n, y_n \} \}$

**SET NOTATIONS FOR DIFFERENT TYPES OF PROBLEM**
* $ y \in \mathbb{R}$ -- regression outcome: $x$ is in the set of all real numbers
* $y \in \{ -1, +1 \}$ -- binary classification outcome: $x$ is in the set of {$-1, 1$}
* $y \in \{a, b, c \dots \}$ - multiclass outcome: $x$ is in the set of {$a$, $b$, $c$, $\dots$}