### Lesson 12 Bonus Lab:  Linear Regression With the Bikeshare Dataset

Welcome!  This notebook is designed to provide additional practice for people looking to get more familiar with Gradient Boosting & SciKit Learn.

The topic of the notebook is using Gradient Boosting to forecast demand for bikeshare stations.

The dataset has the following columns:  

  - **datetime:** a timestamp collected hourly.
  - **season:** a categorical column that lists the current season for that observation
  - **holiday:** a column (0 or 1), that detects whether or not it was a holiday
  - **workingday:** a column (0 or 1), that encodes whether or not it was a workday or not
  - **weather:** a categorical column that lists a light weather description for the observation
  - **temp:** the temperature outside
  - **atemp:** the temperature it feels like outside
  - **humidity:** the humidity outside
  - **windspeed:** the windspeed, in mph
  - **count:** the number of bikes checked out during that hour
  
Your job is to build a regression model that appropriately captures the information available to make the most accurate predictions.

### Step 1:  Load in the Dataset

 - It's called `bikeshare.csv`
 - Make sure to make `datetime` a time column
 - It's not a bad idea to use it as an index column, although this isn't necessary.

In [39]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

In [40]:
df = pd.read_csv('../data/bikeshare.csv', parse_dates = ['datetime'],index_col = 'datetime')

In [41]:
# initialize the model
gbm = GradientBoostingRegressor()

In [42]:
#check if any value is missing
df.isnull().sum()

season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
count         0
dtype: int64

In [43]:
#check data types
df.dtypes

season         object
holiday         int64
workingday      int64
weather        object
temp          float64
atemp         float64
humidity        int64
windspeed     float64
count           int64
dtype: object

### Step 2: Transform Your Categorical Variables (If Necessary)

This dataset has two categorical columns -- `weather` and `season`.  Decide how you might encode these -- One Hot encoding, ordinal encoding, or categorical encoding (or try several if you have time).

In [44]:
# the categorical columns are:
df.select_dtypes(include=np.object)

Unnamed: 0_level_0,season,weather
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-01 00:00:00,Spring,Clear Skies
2011-01-01 01:00:00,Spring,Clear Skies
2011-01-01 02:00:00,Spring,Clear Skies
2011-01-01 03:00:00,Spring,Clear Skies
2011-01-01 04:00:00,Spring,Clear Skies
...,...,...
2012-12-19 19:00:00,Winter,Clear Skies
2012-12-19 20:00:00,Winter,Clear Skies
2012-12-19 21:00:00,Winter,Clear Skies
2012-12-19 22:00:00,Winter,Clear Skies


In [45]:
#onehot (dummies) encoding
df_dummies=pd.get_dummies(df).copy()

In [46]:
#categorical encoding
cat_cols = df.select_dtypes(include=np.object).columns.tolist()
df[cat_cols] = df[cat_cols].astype('category')
for col in cat_cols:
    df[col] = df[col].cat.codes

### Step 3: Create a Training & Test Set

Given that there's a time based column, make the most recent values your test set.  Do a 20% split.  (You can use `train_test_split` for this, but it's not necessary.  You could also just sort by `datetime` and take the bottom 20% of rows for your test set).

**Note:** You can use the argument `shuffle=False` if you want to use `train_test_split` without shuffling the data.

In [47]:
X = df.drop('count', axis=1)
y = df['count']

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

### Step 4: Create a Validation Set From Your Training Set

Remember....this is your test set within the training set.  Make it 20% of your training set.

In [49]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, shuffle=False)

### Step 5:  Do An Initial Fitting And Scoring of Your Model

 - Remember, fit on the training set, and score on the validation set.
 - How much is your model overfitting (if at all)?

In [50]:
gbm.fit(X_train,y_train).score(X_val,y_val)

0.16157226477301956

### Step 6: Look At Your Feature Importance Scores

What seems to be having the most impact?

In [51]:
feat_importance = pd.DataFrame({
    'feature':X.columns,
    'importance' : gbm.feature_importances_
})

In [52]:
feat_importance.set_index('feature').sort_values(by='importance', ascending = False)

Unnamed: 0_level_0,importance
feature,Unnamed: 1_level_1
humidity,0.365932
atemp,0.283644
temp,0.217744
workingday,0.045799
season,0.045406
windspeed,0.027583
weather,0.013429
holiday,0.000464


### Step 7: Build New Features (ie, Add New Columns To Your Dataset)

This is your chance to think about ways to better capture the value and impact of time and other variables on the target variable (`count`).

What you should do here is add a new feature to your training and validation set, re-run your model on the  training set, and score it on the validation set to see if it made an improvement.  

A good place to start with this is extracting out different date parts to see if they improve your validation score.

You can find information about the different dateparts in pandas here:  https://pandas.pydata.org/pandas-docs/version/0.24/reference/series.html#time-series-related

Or if you're using the `datetime` column as an index:  https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html

Keep features if they improve your validation score, discard them if they don't.

A few other ideas:  

 - you can create a column called `Daytime` that tests whether or not it's light outside.  (Ie, between 7PM - 7AM is `False`, `True` otherwise).  You could also get fancier and adjust the daylight hours depending on season.  

 - you can also create a variable that tracks the passage of time.  This can be done by finding the earliest date in the dataset, subtracting each observed date from that and extracting the datepart in days.  This way, if you have an upward or downward trend throughout the dataset, you'd be able to capture it. So something like `X_train['time'] = (X_train.index.hour - earliest_date).days` 
 - You could also try multiplying different columns together.  Maybe it being `Daytime`, `Sunny` and low humidity has a multiplicative effect that isn't totally captured by any of the variables by themselves.

**Note:** Dateparts, despite being numbers, are probably best thought of as **categorical** variables.....think about it -- the 11 AM hour is something distinct from the 11 PM hour.....they are best interpreted as being separate categories than one continuous column.

In [53]:
df['hour'] = pd.Series(df.index).dt.hour.tolist()

In [54]:
df['Daytime']

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,hour
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2011-01-01 00:00:00,1,0,0,0,9.84,14.395,81,0.0000,16,0
2011-01-01 01:00:00,1,0,0,0,9.02,13.635,80,0.0000,40,1
2011-01-01 02:00:00,1,0,0,0,9.02,13.635,80,0.0000,32,2
2011-01-01 03:00:00,1,0,0,0,9.84,14.395,75,0.0000,13,3
2011-01-01 04:00:00,1,0,0,0,9.84,14.395,75,0.0000,1,4
...,...,...,...,...,...,...,...,...,...,...
2012-12-19 19:00:00,3,0,1,0,15.58,19.695,50,26.0027,336,19
2012-12-19 20:00:00,3,0,1,0,14.76,17.425,57,15.0013,241,20
2012-12-19 21:00:00,3,0,1,0,13.94,15.910,61,15.0013,168,21
2012-12-19 22:00:00,3,0,1,0,13.94,17.425,61,6.0032,129,22


### Step 8: Fit Your Model On ALL of your training data

An important step here -- now that you've figured out what columns to include, and what ones to exclude, concatenate your training and validation sets, and fit your model on ALL of your training data.

The idea now is that you've found the features that help, you should give your model more samples to infer from.

You would use the `pd.concat()` method here.

Also -- for good measure, standardize all of your training data before fitting it if you haven't done so already.

In [8]:
# your answer here

### Step 9: Score Your Model on the Test Set

Once you've found the best version of your model on your validation set, transform your test set so that it is setup the same way.

Ie, if you added a column that improved your validation score, add that same column to your test set.  

Remember to standardize your test set using the values from your training set.

How close were your validation scores to your test set scores?

In [9]:
# your answer here

### Diagnostics

Now we'll look at a few different areas of your model to see if there's anything causing our results to be skewed.

### Step 10:  Make a prediction on your test set, and calculate the error

The error in this case is just the difference between the value for `count` and the value of your prediction.

In [11]:
# your answer here