In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

----
## Train | Test Split Procedure 

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [2]:
df = pd.read_csv('../DATA/Advertising.csv')

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


Split the dataset into X & y, removing the sales data to create X...

In [4]:
X = df.drop('sales',axis=1)

...And using the sales data as y -

In [5]:
y = df['sales']

Import the train_test_split function from SciKit-Learn -

In [6]:
from sklearn.model_selection import train_test_split

Copy this from the 'train_test_split' help text, hit 'Shift+Tab' to access the text, and scroll down -

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Import a scaling function to apply to the data -

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()

Remember in order to prevent data leakage, ***we only fit to the training data -***

In [10]:
scaler.fit(X_train)

StandardScaler()

Apply the scaler to the training data sets -

In [11]:
X_train = scaler.transform(X_train)

In [12]:
X_test = scaler.transform(X_test)

Now we have the data scaled, we will now create the model -

In [13]:
from sklearn.linear_model import Ridge

In [14]:
model = Ridge(alpha=100)

^ Remember 'alpha' is basically the weight we will apply to the model, how well we want it to apply itself to the training data....

Now fit the data on to the model -

In [15]:
model.fit(X_train,y_train)

Ridge(alpha=100)

From this fit we will have our predictions, so here we are saying 'Predict on the X_test' -

In [16]:
y_pred = model.predict(X_test)

Now import out metric to assess our results -

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
mse = mean_squared_error

Apply the mean_squared_error to the test and prediction values to determine the predicted values to the actual -

In [19]:
mse(y_test, y_pred)

7.341775789034129

Now to evaluate the model, we will create a second model to adjust the hyperparameter 'alpha' of to compare results -

In [20]:
model_two = Ridge(alpha=1)

Fit the training data now to this new model so we can compare the results -

In [21]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

Create another prediction -

In [22]:
y_pred2 = model_two.predict(X_test)

Again run the mean_squared_error function to compare that figure with the initial mse figure with alpha=100.

In [23]:
mse(y_test, y_pred2)

2.3190215794287514

^ Notice the better performance (lower deviation of the predictions from the actual test data = lower mse) with the updated alpha value.

----

## Train | Test | Validation Split

Often called a "hold-out" set, since we should not adjust parameters based on the final test set, instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

First we will recreate our X & y sets -

In [24]:
X = df.drop('sales',axis=1)
y = df['sales']

Now we will import the train_test_split model, and perform the method ***TWICE*** to first seperate the train and test sets, and *then* to split the test set into **test** and **evaluation** sets (named accordingly) -

In [25]:
from sklearn.model_selection import train_test_split

# 70% of the data is training data, the other 30% is set aside for validation (hence is named 'OTHER')
X_train, X_OTHER, y_train, y_OTHER = train_test_split(X, y, test_size=0.3, random_state=101)

# The remaining 30% is split into evaluation and test sets
# Each is 15% of the original data size
X_eval, X_test, y_eval, y_test = train_test_split(X_OTHER, y_OTHER, test_size=0.5, random_state=101)

In [26]:
len(df)

200

In [27]:
len(X_train)

140

In [28]:
len(X_eval)

30

In [29]:
len(X_test)

30

Check the lengths of each, notice that the lengths of **X_train**, **X_eval** & **X_test** all sum to the length of the original dataframe.

Now to import and apply scaling to the data -

In [30]:
from sklearn.preprocessing import StandardScaler

In [31]:
scaler = StandardScaler()

Again we **fit** to ONLY the training data

In [32]:
scaler.fit(X_train)

StandardScaler()

Apply the scaler to the X sets -

In [33]:
X_train = scaler.transform(X_train)

In [34]:
X_test = scaler.transform(X_test)

In [35]:
X_eval = scaler.transform(X_eval)

Now import and create our instance of the model -

In [36]:
from sklearn.linear_model import Ridge

...with alphs value 100

In [37]:
model_one = Ridge(alpha=100)

Pass in the training data to fit the model -

In [38]:
model_one.fit(X_train,y_train)

Ridge(alpha=100)

Now pass in the X_eval dataset to the created model to gauge performance -

In [48]:
y_eval_pred = model_one.predict(X_eval)

To evaluate, import the mean_squared_error function -

In [49]:
from sklearn.metrics import mean_squared_error

In [50]:
mse = mean_squared_error

Pass in the y_eval (actual evaluation figures) and the predicted to compare via mean_squared_error -

In [51]:
mse(y_eval,y_eval_pred)

7.320101458823872

Notice the above figure is *close* but not the same as the previously predicted mean_squared_error figure, because we are testing off of *half* the data.

Now create the second model, with alpha figure 1 -

In [46]:
model_two = Ridge(alpha=1)

In [47]:
model_two.fit(X_train,y_train)

Ridge(alpha=1)

In [52]:
new_pred = model_two.predict(X_eval)

In [53]:
mse(y_eval,new_pred)

2.3837830750569866

As the previous two fits and mse calcs have been performed with a small knowledge of the test set, now we will bring the test set into play and compare -

In [54]:
y_final_test_predictions = model_two.predict(X_test)

In [56]:
mse(y_test,y_final_test_predictions)

2.254260083800517