<a href="https://colab.research.google.com/github/cagBRT/IntroToDNNwKeras/blob/master/Train_Test_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we look at doing a train-test-validate split.<br>

We also look at linear and polynomial regression and using cross validation

**Example of Dataset split and cross-validation**

In [None]:
from sklearn.model_selection import train_test_split
# Assuming we have a dataset `X` of features and `y` of corresponding labels
# Splitting the data into training and test sets
# Generate some random data
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 + 3 * X + np.random.randn(100, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)
# Splitting the training set into training and validation sets



print("X_train: ",X_train.shape)
print("y_train: ",y_train.shape)
print("X_test: ",X_test.shape)
print("y_test: ",y_test.shape)


In [None]:
X_train, X_val, y_train, y_val = train_test_split(
  X_train,
  y_train,
  test_size=0.2,
  random_state=42)

print("X_train: ",X_train.shape)
print("y_train: ",y_train.shape)
print("X_val: ",X_val.shape)
print("y_val: ",y_val.shape)



---



---



---



An example of using the training and test sets with a linear reg and poly reg models

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import sklearn

np.random.seed(42)

**Generate and plot the dataset**

In [None]:
# Generate data and plot
N = 300
x = np.linspace(0, 7*np.pi, N)
smooth = 1 + 0.5*np.sin(x)
y = smooth + 0.2*np.random.randn(N)
plt.plot(x, y)
plt.plot(x, smooth)
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(0,2)
plt.legend(["dataset","noisy set"], loc="lower right")
plt.show()

**Train-test split, intentionally use shuffle=False**<br>

**shuffle: bool, default=True**<br>
Whether or not to shuffle the data before splitting.

If shuffle=False then stratify must be None.<br>

If not None, data is split in a stratified fashion, using this as the class labels.<br>
<br>
Stratified sampling is a technique that can help you improve the quality and efficiency of your machine learning models. It involves dividing your data into subgroups that share similar characteristics and then selecting a representative sample from each subgroup.<br>

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), ***shuffling the data first may be essential to get a meaningful cross- validation result***. <br>

***However, the opposite may be true if the samples are not independently and identically distributed***. <br>

For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

In [None]:
# Train-test split, intentionally use shuffle=False
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
print("X_train ", X_train.shape)
print("X_test ", X_test.shape)

**Create two models: Polynomial and linear regression**

In [None]:
degree = 2 #other degree value will create better model

In [None]:
polyreg = make_pipeline(PolynomialFeatures(degree),
                        LinearRegression(fit_intercept=False))
linreg = LinearRegression()

**Use the training set to do cross-validation**

In [None]:
# Cross-validation
scoring = "neg_root_mean_squared_error"
polyscores = cross_validate(polyreg, X_train, y_train, scoring=scoring,
                            return_estimator=True)
linscores = cross_validate(linreg, X_train, y_train, scoring=scoring,
                           return_estimator=True)

**Compare the scores for the models**

In [None]:
# Which one is better? Linear and polynomial
print("Linear regression score:", linscores["test_score"].mean())
print("Polynomial regression score:", polyscores["test_score"].mean())
print("Difference:", linscores["test_score"].mean() - polyscores["test_score"].mean())

**Get the coefficients of the two regression models**

In [None]:
print("Coefficients of ")
# Let's show the coefficient of the last fitted polynomial regression
# This starts from the constant term and in ascending order of powers
print("Polyreg: ",polyscores["estimator"][0].steps[1][1].coef_)
# And show the coefficient of the last-fitted linear regression
print("Linearreg: ",linscores["estimator"][0].intercept_,
      linscores["estimator"][-1].coef_)

**Polynomial Regression**:<br>
y=1.2 -0.05x + 0.002x^2

**Linear Regression:**<br>
y=0.99x-0.001x

**Plot the models**

In [None]:
# Plot and compare
plt.plot(x, y)#blue
plt.plot(x, smooth)#orange
plt.plot(x, polyscores["estimator"][0].predict(X))#green
plt.plot(x, linscores["estimator"][0].predict(X))#red
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(["dataset","noisy set","polyreg", "linearreg"], loc="lower right")
plt.show()

**Train the model and test it with the test set**

Since we decided to use linear model for regression, we need to re-train the model and test it using our held out test data.<br>

Cloning constructs a new unfitted estimator with the same parameters.
<br>
Clone does a deep copy of the model in an estimator without actually copying attached data. <br>

It returns a new estimator with the same parameters that has not been fitted on any data.

In [None]:
# Retrain the model and evaluate
linregclone = sklearn.base.clone(linreg)
linregclone.fit(X_train, y_train)
print("Test set RMSE:", mean_squared_error(y_test, linregclone.predict(X_test), squared=False))
print("Mean validation RMSE:", -linscores["test_score"].mean())

**Assignment**<br>

1. change shuffle in the train-test split to True.
What happens to the models?

2. Change the number of degrees of the polynomial regression model. Is there a number of degrees that is a better model?