# Detecting Bias and Variance in a Supervised Learning Model

This exercise is for learning purposes and taken from https://www.dataquest.io/blog/learning-curves-machine-learning/
    

## The Data

For this exercise, we will use the "Combined Cycle Power Plant" data set from Pınar Tüfekci and Heysem Kaya.  This data contains hourly electrical energy output of a power plant.  The data was sourced from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant). 


In [7]:
import pandas as pd
electricity = pd.read_excel('data/Folds5x2_pp.xlsx')
print(electricity.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
AT    9568 non-null float64
V     9568 non-null float64
AP    9568 non-null float64
RH    9568 non-null float64
PE    9568 non-null float64
dtypes: float64(5)
memory usage: 373.8 KB
None


In [8]:
electricity.sample(3)

Unnamed: 0,AT,V,AP,RH,PE
8339,24.38,57.17,1009.85,74.46,444.59
5123,10.32,40.35,1011.64,84.05,481.51
5179,7.99,36.08,1020.08,83.2,481.35


Here is a brief description of what each column means

|     | Description |
| --- | --- |
| AT  | Ambiental Temperature |
| V   |Exhaust Vacuum |
| AP  |Ambiental Pressure |
| RH  |Relative Humidity |
| PE  |Electrical Energy Output |

Since we are trying to predict the energy output from the data, **PE** will be our target variable (this will be our "Y")

## Learning Curves

We will train the model using different training set sizes and generate a **learning curve**.  We will do a 80:20 training/validation ratio, where 80% of the data will be used for training and 20% will be used fir .   

In [20]:
size = electricity.shape[0]
print('size=',size)
max_train_size = int(size*0.8)
print('max_train_size=',max_train_size)

size= 9568
max_train_size= 7654


The maximum training size we can do is 7654 trainging instances.  We now pick varying training instance sizes.  For this exercise, we pick the following sizes

In [22]:
train_sizes = [1, 100, 500, 2000, 5000, max_train_size]
print('train_sizes=',train_sizes)

train_sizes= [1, 100, 500, 2000, 5000, 7654]


To generate a **learning curve**, we use the `learning_curve()` function in scikit-learn.  First, we import the `sklearn` libraries.  We will use the **LinearRegression** learner.

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve