<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Train Test Split

In this tutorial, we're going to load the income_data.csv dataset, and consider how we're going to train then test the model. 

In an earlier example, we created a model for this dataset, but we really have little idea as to how well that model performs. 

This tutorial is concerned with creating a new testing dataset that we can use to test the performance of the model.

Let's start by loading income_data.csv

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [26]:
happiness_df = pd.read_csv('../../Data/income_data.csv')
happiness_df

Unnamed: 0,income,happiness
0,38.63,2.314489
1,49.79,3.433490
2,49.24,4.599373
3,32.14,2.791114
4,71.96,5.596398
...,...,...
493,52.49,4.568705
494,34.72,2.535002
495,60.88,4.397451
496,34.41,2.070664


In [57]:
X = happiness_df['income'].values.reshape(-1,1)  # this could have been done: happiness_df[['income']].values
y = happiness_df['happiness'].values

In [54]:
model = LinearRegression()
model.fit(X, y)

At this point, we can make predictions about our data with new values e.g. what if someone makes £40,000 per year, how happy will they be?

In [55]:
new_X = np.array([40]).reshape(-1,1)
new_X

array([[40]])

The problem we have now is knowing how good the prediction is. 

Even if we were to find someone who had a salary of exactly £40,000, that wouldn't help us to evaluate the model. Perhaps that person has an unusually happy disposition? Maybe they're just back from the holiday of a lifetime? Or perhaps they've been bereaved this year? 

What we need is a range of values for income and happiness so that we can see how well our model performs on aggregate. 

And we can't use any of the values from our training dataset. They were used to create the model, so they're effectively biased. We want to know how well our model performs on NEW data.

The easiest way to achieve this is to withhold some of the data from income_data.csv i.e. we can use some of the data for building a model, and withhold some data to test it with.

Thankfully SciKit Learn has a library for exactly this. It's called **train_test_split**



In [62]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)

## Explanation

```test_size=0.3``` 

> This sets 30% of our data as test set i.e. 70% will be used to train our model

```random_state=21```

> This is a seed for a random number generator that splits the data i.e. test rows are selected at random based on the seed. 
>  Choosing the same number will guarantee the same rows are selected for the split each time, making our program repeatable, with the same rows in the training and test set each time.


**RETURENED VALUES**

The data is returned as a tuple that contains 4 variales

```X_train``` 
> this is the training data

```X_test```
> this is the test data

```y_train```
> the traing data labels labels (in this case the species)

```y_test```
> the test data labels

## Train the model

In [63]:
model.fit(X_train, y_train)


## Calculating the Accuracy Score

In [64]:
model.score(X_test, y_test)

0.7749620717403137

Our accuracy was therefore 77%  But what does this really mean? 

## Simple Linear Regression in Excel

Many people are aware that it's possible to perform Simple Linear Regression using Excel:
1. Make a scatter chart
2. Add Chart Element > Trendline

![Analysing Income and Happiness in Excel](../../Images/income-happiness-in-excel.png)

Notice the $R^2$ value at the top of the chart I produced in Excel.

In Excel the $R^2$ value gives the accuracy of the trendline, and the number that is shown is very similar to the score for our model using SciKit Learn. That's because our model.score is using $R^2$ as well. 

## What is model.score()?

In the case of Linear Regression, the score gives the $R^2$ value; however, the metric that ```model.score()``` uses will vary from one machine learning algorithm to another. 

In the next tutorial we'll look at what $R^2$ means, and look at alternatives that can be used for scoring a Linear Regression model. 