In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

----
## Train | Test Split Procedure 

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [2]:
df = pd.read_csv('../DATA/Advertising.csv')

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


Split the dataset into X & y, removing the sales data to create X...

In [4]:
X = df.drop('sales',axis=1)

...And using the sales data as y -

In [5]:
y = df['sales']

Import the train_test_split function from SciKit-Learn -

In [6]:
from sklearn.model_selection import train_test_split

Copy this from the 'train_test_split' help text, hit 'Shift+Tab' to access the text, and scroll down -

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Import a scaling function to apply to the data -

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()

Remember in order to prevent data leakage, ***we only fit to the training data -***

In [10]:
scaler.fit(X_train)

StandardScaler()

Apply the scaler to the training data sets -

In [11]:
X_train = scaler.transform(X_train)

In [12]:
X_test = scaler.transform(X_test)

Now we have the data scaled, we will now create the model -

In [13]:
from sklearn.linear_model import Ridge

In [14]:
model = Ridge(alpha=100)

^ Remember 'alpha' is basically the weight we will apply to the model, how well we want it to apply itself to the training data....

Now fit the data on to the model -

In [15]:
model.fit(X_train,y_train)

Ridge(alpha=100)

From this fit we will have our predictions, so here we are saying 'Predict on the X_test' -

In [16]:
y_pred = model.predict(X_test)

Now import out metric to assess our results -

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
mse = mean_squared_error

Apply the mean_squared_error to the test and prediction values to determine the predicted values to the actual -

In [19]:
mse(y_test, y_pred)

7.341775789034129

Now to evaluate the model, we will create a second model to adjust the hyperparameter 'alpha' of to compare results -

In [20]:
model_two = Ridge(alpha=1)

Fit the training data now to this new model so we can compare the results -

In [21]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

Create another prediction -

In [22]:
y_pred2 = model_two.predict(X_test)

Again run the mean_squared_error function to compare that figure with the initial mse figure with alpha=100.

In [23]:
mse(y_test, y_pred2)

2.3190215794287514

^ Notice the better performance (lower deviation of the predictions from the actual test data = lower mse) with the updated alpha value.

----

#### Whilst Training and Testing works fine in most applications, below we will split further into a validation set enabling us to test models on purely unseen data.
The 3 sets are:
- **Training** - to create and fit the model to,
- **Test** - to verify and gauge the performance of the model on, with the ability to return to the training set to adjust figures as required, &
- **Validation** - as a purely unseen data set used for FINAL validation of the model's performance. Once the Validation set has been passed in, there should be **NO** alteration to the model.

Using a Validation set, we know that the modeal was not fit to this data, **AND** the model's hyperparameters were not adjusted off that data set.

----
## Train | Test | Validation Split

Often called a "hold-out" set, since we should not adjust parameters based on the final test set, instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

First we will recreate our X & y sets -

In [24]:
X = df.drop('sales',axis=1)
y = df['sales']

Now we will import the train_test_split model, and perform the method ***TWICE*** to first seperate the train and test sets, and *then* to split the test set into **test** and **evaluation** sets (named accordingly) -

In [25]:
from sklearn.model_selection import train_test_split

# 70% of the data is training data, the other 30% is set aside for validation (hence is named 'OTHER')
X_train, X_OTHER, y_train, y_OTHER = train_test_split(X, y, test_size=0.3, random_state=101)

# The remaining 30% is split into evaluation and test sets
# Each is 15% of the original data size
X_eval, X_test, y_eval, y_test = train_test_split(X_OTHER, y_OTHER, test_size=0.5, random_state=101)

In [26]:
len(df)

200

In [27]:
len(X_train)

140

In [28]:
len(X_eval)

30

In [29]:
len(X_test)

30

Check the lengths of each, notice that the lengths of **X_train**, **X_eval** & **X_test** all sum to the length of the original dataframe.

Now to import and apply scaling to the data -

In [30]:
from sklearn.preprocessing import StandardScaler

In [31]:
scaler = StandardScaler()

Again we **fit** to ONLY the training data

In [32]:
scaler.fit(X_train)

StandardScaler()

Apply the scaler to the X sets -

In [33]:
X_train = scaler.transform(X_train)

In [34]:
X_test = scaler.transform(X_test)

In [35]:
X_eval = scaler.transform(X_eval)

Now import and create our *first* instance of the model -

In [36]:
from sklearn.linear_model import Ridge

...with alpha value 100

In [37]:
model_one = Ridge(alpha=100)

Pass in the training data to fit the model -
- X_train: The contributing factors
- y_train: The associated outcomes

In [38]:
model_one.fit(X_train,y_train)

Ridge(alpha=100)

Now pass in the X_eval dataset to the created model to gauge performance -
- X_eval: is like the original 'test' set, and is a data set of the factors used to generate an estimated output by passing it to the 'predict' function.

In [39]:
y_eval_pred = model_one.predict(X_eval)

To evaluate, import the mean_squared_error function -

In [40]:
from sklearn.metrics import mean_squared_error

In [41]:
mse = mean_squared_error

Pass in the 'y_eval' (actual evaluation figures) and the predicted 'y_eval_pred' to compare via the mean_squared_error -

In [42]:
mse(y_eval,y_eval_pred)

7.320101458823872

Notice the above figure is *close* but not the same as the previously predicted mean_squared_error figure, because we are testing off of *half* the data allocated for testing.

Now create the second model, with alpha figure 1 -

In [43]:
model_two = Ridge(alpha=1)

As before, fit that model to the X_train & y_train sets, but with a different alpha figure -

In [44]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

Again, pass the 'X_eval' to the new model to gauge performance -

In [45]:
new_pred = model_two.predict(X_eval)

...then apply the mean_squared_error to compare the two -

In [46]:
mse(y_eval,new_pred)

2.3837830750569866

As the previous two fits and mse calcs have been performed with a small knowledge of the test set, now we will bring the test set into play and compare -

In [47]:
y_final_test_predictions = model_two.predict(X_test)

In [48]:
mse(y_test,y_final_test_predictions)

2.254260083800517

----
## Cross Validation

With **K-Fold** Cross Validation, starting with a complete dataset, we will split it into a larger training & smaller test portions.

The **test** set will be removed to be used for final model evaluation, thus it cannot be used to train the model for fear that it could bias the fitting of the model.

For the **training** data, we will identify what is known as the *K-Value*, which is the number of divisions we will split the data into in order to perform our cross validation functions. Remembering though that whilst the largest K-Value possible (number of rows - 1) may seem ideal, it will lead to increased calculations and processor use.

Cross validation is a practice of training a model whilst leaving a certain portion of that model aside for validation. Iterating through the dataset, the portion of data used for validation will change with every iteration and the error noted. This allows us to train AND validate our model on ALL portions of the data.

After iterating through all the data, the collection of errors are then averaged to identify possible hyperparameter adjustments, before the cross-validation function is then run again.

Following cross validation, we then test the model on our segregated test dataset as normal.


<img src="grid_search_cross_validation.png">

----

#### SciKit Learn's *'cross_val_score'* function allows us to perform this function automatically.

In [49]:
X = df.drop('sales',axis=1)
y = df['sales']

First we will perform our train test split as per normal to remove our test data -

In [50]:
from sklearn.model_selection import train_test_split

When removing the final test data, with cross validation our test size doesn't need to be *too* large (can be 15, 20% for example). We will leave it at 30% for this example to keep our results comparable to the previous model outcomes -

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [52]:
from sklearn.preprocessing import StandardScaler

In [53]:
scaler = StandardScaler()

In [54]:
scaler.fit(X_train)

StandardScaler()

In [55]:
X_train = scaler.transform(X_train)

In [56]:
X_test = scaler.transform(X_test)

In [57]:
model = Ridge(alpha=100)

In [58]:
from sklearn.model_selection import cross_val_score

Now we have the 'cross_val_score' function, we must pass in:

- The instance of the model we are using (in this case it's 'model')
- Our training datasets X & y, &
- Our scoring function
- The K-fold value as 'cv'


In [59]:
scores = cross_val_score(model,X_train,y_train,scoring="neg_mean_squared_error",cv=5)

Now we can see our scores -

In [60]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

Remembering that as we're looking at the negative mse, the higher number (that is the *lowest* negative integer value) here is the better score.

So from theses scores we can now generate the average (as an absolute value), and compare that to our values from our previous models -

Seen here, this not such a good value compared to other methods performed above.

In [61]:
abs(scores.mean())

8.215396464543607

Now we will run the same functions, but adjusting our alpha values to gauge performance and improvements -

In [62]:
model_a = Ridge(alpha=1)

In [63]:
scores = cross_val_score(model_a,X_train,y_train,scoring="neg_mean_squared_error",cv=5)

Seen below, a much better score than the original, but still with room for improvement -

In [64]:
abs(scores.mean())

3.344839296530695

In [65]:
model_a.fit(X_train,y_train)

Ridge(alpha=1)

In [66]:
y_final_test_pred = model_a.predict(X_test)

Now here we have our final mean_sqaured_error from the cross validation model fit -

In [67]:
mse(y_test,y_final_test_pred)

2.3190215794287514

----

# Cross Validation with the cross_validate function

**cross_validate** allows us to view numerous performance metrics on a model and provides feedback on times taken to both fit and test models.



In [68]:
X = df.drop('sales',axis=1)
y = df['sales']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
# Our set used for cross validation -
X_train = scaler.transform(X_train)
# Our set used for final performance metrics -
X_test = scaler.transform(X_test)

In [69]:
model = Ridge(alpha=100)

Import the **cross_validate** function -

In [70]:
from sklearn.model_selection import cross_validate

Using cross_validate we can pass in a list of methods to score off of, which will return a dictionary of scores -

In [71]:
scores = cross_validate(model,X_train,y_train,scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=10)

In [72]:
scores

{'fit_time': array([0.00199246, 0.00199413, 0.00099802, 0.00199533, 0.00099826,
        0.00099754, 0.00099707, 0.00099778, 0.00099707, 0.00099778]),
 'score_time': array([0.00199389, 0.0009973 , 0.00099611, 0.00099874, 0.00099683,
        0.        , 0.00099754, 0.0009973 , 0.00099754, 0.0019927 ]),
 'test_neg_mean_absolute_error': array([-1.8102116 , -2.54195751, -1.46959386, -1.86276886, -2.52069737,
        -2.45999491, -1.45197069, -2.37739501, -2.44334397, -1.89979708]),
 'test_neg_mean_squared_error': array([ -6.06067062, -10.62703078,  -3.99342608,  -5.00949402,
         -9.14179955, -13.08625636,  -3.83940454,  -9.05878567,
         -9.05545685,  -5.77888211]),
 'test_max_error': array([ -5.89305462,  -6.06524931,  -5.26424161,  -4.29881153,
         -5.61586024, -10.60207363,  -4.73613382,  -6.520936  ,
         -7.37049532,  -5.41946249])}

Now we can pass this convoluted data into a Pandas DataFrame to make it more readable -

In [75]:
scores = pd.DataFrame(scores)

In [76]:
scores

Unnamed: 0,fit_time,score_time,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_max_error
0,0.001992,0.001994,-1.810212,-6.060671,-5.893055
1,0.001994,0.000997,-2.541958,-10.627031,-6.065249
2,0.000998,0.000996,-1.469594,-3.993426,-5.264242
3,0.001995,0.000999,-1.862769,-5.009494,-4.298812
4,0.000998,0.000997,-2.520697,-9.1418,-5.61586
5,0.000998,0.0,-2.459995,-13.086256,-10.602074
6,0.000997,0.000998,-1.451971,-3.839405,-4.736134
7,0.000998,0.000997,-2.377395,-9.058786,-6.520936
8,0.000997,0.000998,-2.443344,-9.055457,-7.370495
9,0.000998,0.001993,-1.899797,-5.778882,-5.419462


Then take the mean values to see the averages....

In [78]:
scores.mean()

fit_time                        0.001297
score_time                      0.001097
test_neg_mean_absolute_error   -2.083773
test_neg_mean_squared_error    -7.565121
test_max_error                 -6.178632
dtype: float64

Now let's try it with our better alpha score, then run the same code again -

In [92]:
model = Ridge(alpha=1)

In [93]:
scores = cross_validate(model,X_train,y_train,scoring=['neg_mean_absolute_error','neg_mean_squared_error','max_error'],cv=10)

In [94]:
scores = pd.DataFrame(scores)

In [95]:
scores

Unnamed: 0,fit_time,score_time,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_max_error
0,0.000998,0.000997,-1.457174,-2.962508,-2.680446
1,0.000996,0.0,-1.555308,-3.057378,-3.251285
2,0.0,0.001005,-1.23877,-2.17374,-2.708825
3,0.000991,0.000959,-0.768938,-0.833034,-1.495801
4,0.000997,0.000997,-1.434489,-3.464018,-4.427072
5,0.000998,0.0,-1.494316,-8.232647,-9.637507
6,0.0,0.000998,-1.081362,-1.905864,-3.006887
7,0.000997,0.001036,-1.250011,-2.765048,-4.27949
8,0.000997,0.0,-1.580971,-4.989505,-6.052545
9,0.0,0.000994,-1.223326,-2.846438,-3.732716


And see our better performing scores -

In [97]:
scores.mean()

fit_time                        0.000697
score_time                      0.000699
test_neg_mean_absolute_error   -1.308467
test_neg_mean_squared_error    -3.323018
test_max_error                 -4.127257
dtype: float64

Now if we're happy with our results, we can now fit this model to all of our data -

In [98]:
model.fit(X_train,y_train)

Ridge(alpha=1)

In [100]:
y_final_prediction = model.predict(X_test)

Yet again, let's run a mean_squared_error comparison of the actual outputs against our predicted to gauge performance -

In [102]:
mse(y_test,y_final_prediction)

2.3190215794287514