<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Linear Regression: Further Practice

For this example, we will build on the bikes dataset we used last time, this time with more "best practice".

Read in the bikes dataset.

In [1]:
import pandas as pd

bikes = pd.read_csv("../assets/data/bikeshare.csv")
bikes.rename(columns={"count": "total_rentals"}, inplace=True)

#### 1: Choose 3 features to predict `total_rentals` and put them in variables X and y

In [2]:
X = bikes[["temp", "windspeed", "holiday"]]
y = bikes["total_rentals"]

#### 2: Create a training and test set

We'll be using the training set for cross-validation

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### 3: Measure the performance of your model across 7 folds

Get a feel for your model's performance. Try both RMSE (using `'neg_mean_squared_error'`) and Mean Absolute Error (MAE) using `'neg_mean_absolute_error'` as scoring metrics and compare them.

In [4]:
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

model = LinearRegression()

mse_scores = cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv=7)
mae_scores = cross_val_score(model, X_train, y_train, scoring="neg_mean_absolute_error", cv=7)

avg_rmse = np.mean(np.sqrt(-mse_scores))
avg_mae = np.mean(-mae_scores)

print(f"Average RMSE: {avg_rmse}\tAverage MAE: {avg_mae}")

Average RMSE: 165.98840264655772	Average MAE: 126.5131019365265


#### 4: We'll try two more models

First, do a new train-test split on the entire bikes data.

Previously, we only used 3 features for our split, but we want access to all of them now, so we can train different models that use different combinations of features.

We will do both our cross-validated training and model evaluation on our **training** set.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(bikes.drop("total_rentals", axis=1),
                                                    bikes["total_rentals"],
                                                    test_size=0.3,
                                                    random_state=42)

Now try **two** more models with different combinations of features and compare their performance.

*As bonus practice, you could put the code you've written so far into a function so you can easily try different combinations of features!*

*For example, it could take as parameters a feature matrix X (your training set) and y (your training targets) and a list of columns to use for the model, and could print/return the cross-validated scores.*

```python
def evaluate_features(features, X, y):
    ...```

In [6]:
def evaluate_features(features, X, y, k=7):
    model = LinearRegression()
    
    X_local = X[features]

    mse_scores = cross_val_score(model, X_local, y, scoring="neg_mean_squared_error", cv=k)
    
    avg_rmse = np.mean(np.sqrt(-mse_scores))
    
    print(f"Average RMS for {features} across {k} folds:\t{avg_rmse}")

evaluate_features(["humidity", "holiday"], X_train, y_train)
evaluate_features(["humidity", "holiday", "atemp"], X_train, y_train)

Average RMS for ['humidity', 'holiday'] across 7 folds:	172.0394358437882
Average RMS for ['humidity', 'holiday', 'atemp'] across 7 folds:	158.36817203339197


#### 5: Take the best of your three trained models and evaluate it on the *test* set.

This is to get an estimate of how well your model performs in the real world. It would be your final reported accuracy score. After this step, you should **not** train-test on the same data anymore, because you will be prone to overfitting.

For this question, first use your best model to predict values on the test inputs (X_test) and compare to the actual values (y_test).

(*Note: if you have a model object from step 4, but used `cross_val_score` to evaluate performance, you will need to fit the model again because `cross_val_score` doesn't do this for you.*)

In [7]:
from sklearn.metrics import mean_squared_error

humidity_holiday_atemp_model = LinearRegression().fit(X_train[["humidity", "holiday", "atemp"]], y_train)

y_pred = humidity_holiday_atemp_model.predict(X_test[["humidity", "holiday", "atemp"]])
print(np.sqrt(mean_squared_error(y_test, y_pred)))

156.2316178502663


#### 6: How did your model do?

If your test set error is similar to the training/cross-validated error, it means your training accuracy was representative of the model's real world performance.

Overfitting happens when your test error is much higher than your training error - i.e. your model hasn't generalised.

Look at the output from **5** - how well did your model do "in the real world"?

#### Answer

The test set error is very similar to the cross-validated error, which means the model performs just as well "in the real world", outside of the training phase, as it did during training.

This is a good sign that it's a stable model!