# Exercise: Model Training and Evaluation

Now that we have the data fundamentals for creating, cleaning, and modifying our datasets, we can train and evaluate a model, in this case it's a linear regression model.

Your tasks for this exercise are:
1. Create a dataframe with the regression dataset, include the features and target within the same dataframe.
2. Create a 60% Train / 20% Validation / 20% Test dataset group using the `train_test_split` method.
3. Fit the LinearRegression model on the training set.
4. Evaluate the model on the validation set.
5. Evaluate the model on the test set.

In [2]:
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [3]:
regression_dataset = make_regression(
    n_samples=10000,
    n_features=10,
    n_informative=5,
    bias=0,
    noise=40,
    n_targets=1,
    random_state=0,
)

In [5]:
# Create the dataframe using the dataset
df = pd.DataFrame(regression_dataset[0])
df["target"] = regression_dataset[1]

In [6]:
# `.head()` to view what the dataset looks like
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,target
0,-1.039309,-0.533254,0.006352,-0.130216,-0.672371,-1.227693,-1.605115,0.313087,1.709311,1.486217,-190.336109
1,0.906268,1.112101,-0.8165,0.461619,0.883569,1.125719,-0.993897,0.999854,-1.919401,-1.137031,33.264389
2,0.334137,0.320004,-0.248267,-0.317444,0.834343,1.381073,0.901058,-0.655725,0.340868,-1.481551,120.287805
3,0.250441,-1.21511,-1.56245,0.162566,-1.630155,-0.449801,-1.033361,-0.67175,-1.331549,-0.979638,-472.599566
4,-1.440993,-0.388298,-0.431737,0.51842,-0.405904,-0.785488,1.00809,-0.695019,1.885108,-0.913755,42.355214


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       10000 non-null  float64
 1   1       10000 non-null  float64
 2   2       10000 non-null  float64
 3   3       10000 non-null  float64
 4   4       10000 non-null  float64
 5   5       10000 non-null  float64
 6   6       10000 non-null  float64
 7   7       10000 non-null  float64
 8   8       10000 non-null  float64
 9   9       10000 non-null  float64
 10  target  10000 non-null  float64
dtypes: float64(11)
memory usage: 859.5 KB


In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,10000.0,0.001308,0.987169,-3.745536,-0.646897,0.002266,0.668527,3.49155
1,10000.0,-0.011299,0.999013,-4.659953,-0.691134,-0.013958,0.657367,3.844825
2,10000.0,0.003908,1.000586,-3.979925,-0.664979,0.015803,0.67331,3.85202
3,10000.0,-0.002088,0.997873,-3.694285,-0.679852,-0.006015,0.669624,3.687019
4,10000.0,0.000451,1.003074,-3.740101,-0.684338,-0.003275,0.677439,4.019774
5,10000.0,0.004861,0.991729,-4.446632,-0.666371,0.013982,0.67257,3.83179
6,10000.0,0.000872,0.999249,-3.666662,-0.678239,0.000501,0.676898,4.241772
7,10000.0,-0.001579,0.992302,-3.532992,-0.667808,-0.001851,0.660881,3.803844
8,10000.0,0.017416,0.991444,-3.581046,-0.628837,0.017455,0.684594,3.766942
9,10000.0,0.001917,1.011011,-4.852118,-0.681952,-0.013882,0.692497,3.483755


In [11]:
# train: 0.8 | test: 0.2
df_train, df_test = train_test_split(df,test_size=0.2,random_state=0)

# train: 0.6 | validation: 0.2
df_train, df_val = train_test_split(df_train,test_size=0.25,random_state=0)

# Final dataset sizes: train: 0.6, validation: 0.2, text: 0.2,

In [12]:
# Output each shape to confirm the size of train/validation/test
print(f"Train: {df_train.shape}")
print(f"Validation: {df_val.shape}")
print(f"Test: {df_test.shape}")

Train: (6000, 11)
Validation: (2000, 11)
Test: (2000, 11)


In [14]:
features = df_train.drop("target",axis=1);
target = df_train["target"]

# Train the linear model by fitting it on the dataframe features and dataframe target
reg = LinearRegression(normalize=True).fit(features,target)

In [15]:
# Evaluate the linear model by scoring it, by default it's the metric r2.
reg.score(features,target)

0.9377324850713262

In [16]:
#on val 

features_val = df_val.drop("target",axis=1);
target_val = df_val["target"]

reg.score(features_val,target_val)

0.9349344900971387

In [17]:
# Once done optimizing the model using the validation dataset,
# Evaluate the linear model by scoring it on the test dataset.
features_test = df_test.drop("target",axis=1);
target_test = df_test["target"]

reg.score(features_test,target_test)

0.9323863267980969