# Homework 5: Feature Engineering
DATA 202 @ Calvin, FA19


In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

# Set some display settings.
sns.set(context='notebook')

In [None]:
# Here is a convenient function for you to use.
def show_coefficients(model, feature_names):
    '''
    Show the coeffients and intercept of a linear model in a nice format.
    
    model: a LinearRegression object
    feature_names: a sequence of column names (e.g., train_X.columns)
    '''
    coefs_with_intercept = np.r_[[model.intercept_], model.coef_]
    feature_names_with_intercept = ["intercept"] + list(feature_names)
    display(
        pd.Series(coefs_with_intercept, index=feature_names_with_intercept)
        .to_frame('coefficient')
        .rename_axis(index="feature")
        .style.bar(align='mid'))

# Setup

We started working with the Capital Bike Share dataset in Homework 2 and 4. In this exercise we will use *multiple features* and *feature engineering* to dramatically improve our prediction performance. But as we saw in Lab 5, these tools also run the risk of *overfitting*, so be careful...

As before, our basic goal will be to try to predict ridership in 2012 based on ridership data in 2011. So the 2011 data will be our *training set* and the 2012 data will be our *test set* (aka *held-out* data or sometimes *validation* data).

First, we'll load up the data.

In [None]:
hourly_counts_orig = pd.read_csv('data/hour.csv')
print(len(hourly_counts_orig), "observations")
hourly_counts_orig.head()

## Renaming
Some of those column names are pretty awful. Let's fix a few of them up.

In [None]:
hourly_counts = hourly_counts_orig.rename(columns={
    'dteday': 'date',
    'hr': "hour",
    'mnth': "month",
    'weekday': "day_of_week",
    'holiday': "is_holiday",
    'workingday': "is_workingday",

    'weathersit': "precip_type",
    "hum": "humidity",
    'cnt': "rides"
}).drop(["instant", "casual", "registered"], axis=1)
hourly_counts['day_of_year'] = pd.to_datetime(hourly_counts['date']).dt.dayofyear

# Rearrange the column order
hourly_counts = hourly_counts[['date', 'day_of_year', 'season', 'yr', 'day_of_week', 'month',
       'hour', 'is_holiday', 'is_workingday', 'precip_type', 'temp', 'atemp',
       'humidity', 'windspeed', 'rides']]
hourly_counts.head()

In [None]:
plt.plot(
    hourly_counts.groupby(
        pd.to_datetime(hourly_counts['date'])
    ).rides.sum(),
    '.')
plt.xlabel("Date")
plt.ylabel("# Rides");

# HACK!
*(You may safely ignore this section.)*

Since the bike share program was overall more popular in 2012 than in 2011, predicting based on 2011 data will systematically under-predict ridership in 2012. More advanced modeling and validation techniques can handle this shift directly, but for now we'll do this little hack to make our current tools work. (Only do this in real life if you have a *really* good explanation for why, and be *totally* transparent about it if you do.)

First, we notice that there were many more rides in 2012 than in 2011.

In [None]:
hourly_counts.groupby('yr').rides.sum()

To make things comparable, let's normalize by popularity. Of course we wouldn't actually *know* the total popularity of 2012 during that year, which is why this particular approach is labeled "HACK!".

In [None]:
year_counts = hourly_counts.groupby('yr').rides.sum()
ratio = year_counts[1] / year_counts[0]
print("Scaling test set by", ratio)

hourly_counts['rides'] = np.where(hourly_counts.yr == 0, hourly_counts.rides, hourly_counts.rides / ratio)
print("New ridership totals by year:", hourly_counts.groupby('yr').rides.sum())

## Train-Test Split
We're going to use 2011 as the training set and 2012 as the test set.

In [None]:
train = hourly_counts[hourly_counts.yr == 0]
test = hourly_counts[hourly_counts.yr == 1]

In [None]:
assert train['date'].iloc[0] == '2011-01-01'
assert test['date'].iloc[0] == '2012-01-01'
assert all(train['date'].str.startswith('2011'))
assert all(test['date'].str.startswith('2012'))
assert len(train) + len(test) == len(hourly_counts)

In [None]:
train.drop(["yr"], axis=1).head()

# A Single Feature

Does apparent temperature ("feels like") affect ridership? Let's make a quick plot (we did this already last homework).

In [None]:
plt.scatter(train['atemp'], train['rides'], s=.1)

It looks like it does.

# Exercise 1: Fit a linear regression predicting `rides` from `atemp`

## 1a: Make a model called `temp_only_model`.

In [None]:
def transform(data):
    ...
    ...
    return X, y

...

## 1b: Show the model's MSE and R2 for the training and test set
Use a function to computate and display the scores so you don't have to repeat quite so much code.

In [None]:
def show_scores(train_y, train_y_pred, test_y, test_y_pred):
    ...


# Exercise 2: Add `month` as a continuous feature
## 2a: Make a model called `temp_and_month_model`

In [4]:
# your code here

## 2b: Show the model's MSE and R2 for the training and test set
Use a function to computate and display the scores so you don't have to repeat quite so much code.

Does `temp_and_month_model` predict better or worse than `temp_only_model`? How can you tell?

*answer*

For each of the following hypothetical kinds of relationships, can `temp_and_month_model` model them? Answer just *yes* or *no*:

1. There were more rides in summer months than winter months
2. The number of rides increased overall throughout the year.
3. A 1-degree increase in temperature has a larger effect on ridership in winter months than in summer months.

*Answer*:

1. __
2. __
3. __

# Exercise 3: Add `month` as a one-hot-encoded feature instead

In [5]:
# your code here

Why would it have been redundant to include `month_1`?

*answer*

Does `temp_and_month_model` predict better or worse than `temp_only_model`? How can you tell?

*answer*

For each of the following hypothetical kinds of relationships, can `temp_and_month_model` model (now with the one-hot encoding) model them? Answer just *yes* or *no*:

1. There were more rides in summer months than winter months
2. The number of rides increased overall throughout the year.
3. A 1-degree increase in temperature has a larger effect on ridership in winter months than in summer months.

*Answer*:

1. __
2. __
3. __

# Exercise 4: Instead of `month`, one-hot encode `date`.

In [6]:
dates = sorted(set(train['date']))

NameError: name 'train' is not defined

What happened? Why?

*answer*

# Exercise 5: Add a one-hot encoding for `hour`


In [7]:
# your code here

How does the performance of this new model compare to the best previous model?

*your answer here*

Why did adding features for `month` and `hour` improve performance on the test set, while adding `date` did not?

*your answer here*

# Exercise 6: Change the scale of the indicator variables

When we did one-hot encoding, we arbitrarily picked 1.0 to be the value that would indicate the active month. What if we used 10.0 instead of 1.0? **Copy your code for `temp_and_month_model` to here** and **change the 1.0 to 10.0** for the indicator scale.

How does the **accuracy** of the 10.0 model on the training and test set compare with the 1.0 model?

*answer*

How do the **coefficients** of the 10.0 model compare with the coefficients of the 1.0 model?

*answer*

**True or False**: The larger the coefficient of a feature is, the more important that feature is to the model.

*answer*

# Exercise 7: Add a 2nd-degree polynomial term for `atemp`
The model should now include the following features:
* `atemp`
* `atemp ** 2`
* `month` (one-hot encoded)
* `hour` (one-hot encoded)

How does the performance of this new model compare to the best previous model?

*your answer here*

# Exercise 8: Use a k-NN model

Try using a KNeighborsRegressor instead of the LinearRegression. Note that you'll need to set a value for n_neighbors. What do you notice?