<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Simple Linear Regression

## What is Linear Regression? 

**Regression** is about finding the line of best fit through many points on a chart. 

When we use a **Linear Regression** alogorithm we're looking for the best straight line **straight line** through the points. If the relationship between your data are not a straight line, you may need to use a different algorithm such as logistic regression or nonlinear regression. 

Linear Regression should be familar to you. You'll probably recall gathering data and plotting charts in GCSE Science experiments? You might have collected 10 readings in a double period, but by drawing a line of best fit you'd be able to infer what the data at other points might be. i.e. you'd be able to predict what other values might be along the line.

![Best Fit](../../Images/best-fit.png)

In Machine Learning, we refer to the data along the X axis as a "feature" and the data along the y axis as the "target variable".

In GCSE Science you'd have called the variable on the X axis the "independent variable", although we'll call it a feature. And, you'd have called the variable on the Y axis the "dependent variable", but we'll call it the "target variable". 

In essence, this means you were generally dealing with 2 columns of data at GCSE. 

As we progress through this course, you'll eventually use more complex data that has additional columns of data. 

In Machine Learning you can have as many columns of input data (features) as you like, but you can only have one output variable, i.e one target variable.

This tutorial focuses on **Simple Linear Regression** i.e. we'll find a straight line of best fit for 2 columns of data (one feature and a target variable). 

# Example Data - Happiness index

Money can't buy happiness? Right? 

In this example we're going to put this to the test, by loading data that compares yearly income to a happiness index.

> This example was created using data from an R tutorial that's available here:
> https://www.scribbr.com/statistics/simple-linear-regression/


## Load the data


In [None]:
import pandas as pd
happiness_df = pd.read_csv('../../Data/income_data.csv')
happiness_df

## Plot the data on a scatter to see if it's suitable for Linear Regression

In [None]:
happiness_df.plot('income', 'happiness', kind='scatter')


According to the data there is a strong correlation between salary and happiness. 

While I'm not sure of the providence of this particular dataset there have been other studies that concur (see datasets on https://data.world). 

Nevertheless, this is a nice, simple dataset for your first attempt at linear regression. 

Let's see if we can get a line of best fit through the points. 

### Steps

Do you remember the four steps for machine learning with SciKit Learn? 

```
1. from sckitlearn.module import Model
2. model = Model()
3. model.fit(X,y)
4. predictions = model.predict(new_X) 
````
Note: the code above won't run but it does show the steps!

### Rules

Do you remember the rules of Machine Learning with SciKit Learn?

SciKit Learn requires:
1. numerical data
2. no missing data
3. a numpy array of data

Let's quickly test if our dataset meets these criteria.

### Checking the dataset for missing values and numberical data

In [None]:
happiness_df.info()

Good Result!
* both columns contain numbers
* both columns are "non-null" (there are no empty cells in the table of data we imported)

Just remember to convert the dataframe columns to numpy arrays when passing them to the model.

## Training the Model 

### 1. Import the Model

In [None]:
from sklearn.linear_model import LinearRegression

### 2. Create a Linear Regression Model Object

In [None]:
model = LinearRegression()

### 3. Train the Model (i.e. fit the data to the model)

The code is ```model.fit(X,y)``` however there are a couple of things we need to do first. 

Remember - SciKit Learn requires:
1. numerical data
2. no missing data
3. a numpy array of data

In this case the data is all numerical, and doesn't have missing values. However, we do need to make it into a numpy array. 

#### 3.1 Getting the features - Method 1

In [None]:
X = happiness_df['income'].values  # get the values from the income column as a numpy array
X

In [None]:
# Let's check the shape of X
X.shape


This tells us is that the shape of X is 498 items, and isn't organised into columns.

We need to reshape the feature data so that it's a single column with lots of rows. 



In [None]:
X = happiness_df['income'].values.reshape(-1,1)  
X

#### 3.2 Getting the features from a dataframe with 1-column

It's worth noting that a single column of a table is a series

In [None]:
happiness_df['income']

In [None]:
type(happiness_df['income'])

In [None]:
happiness_df[['income']]

In [None]:
type(happiness_df[['income']])

In [None]:
X = happiness_df[['income']].values
X

The target value (y) is expected to be a single column, so we don't have to reshape it

### 4. Get the Target Value as a Numpy Array

In [None]:
y = happiness_df['happiness'].values

In [None]:
model.fit(X,y)

### 5. Let's make predictions

Make predictions for the each of the following salarie: 
£20000
£40000
£60000 

Exercise
1. Modify the code below, to make the predictions one at a time (run the cell 3 times with different values)
2. Modify the code below to make all 3 predictions in one step (run the cell 1 time to make 3 predictions) 

In [None]:
import numpy as np

salary_in_thousands = [20]
new_X = np.array(salary_in_thousands).reshape(-1,1)


In [None]:
model.predict(new_X)

# Visualise the model & plot the line of best fit

**1) Let's visualise our data one more time**

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.show()

**2) Generate points along the lines of best fit by making predictions**

In [None]:
new_X = np.arange(1, 80).reshape(-1,1)


In [None]:
new_y = model.predict(new_X)

**3) make a new plot with the line of best fit overlayed**

In [None]:
plt.cla() # clear previous plot from memory
plt.scatter(X, y)
plt.plot(new_X, new_y, color='red')
plt.show()

# Summary 

Linear Regression finds a line of best fit through your data.

This tutorial an example of Simple Linear Regression - with only 2 variables that can easily be plotted on an X and Y axis. 

At GCSE you'd have said that the data follows the pattern $ y=mx+c $, however, in Data Science we tend to write: 

$ y=ax+b $

In the next tutorial we'll look at Multiple Linear Regression, and learn how to deal with more columns of input data (i.e. more than one feature).


