<a href="https://colab.research.google.com/github/andreacohen7/tourism/blob/main/Bike_Share_Rental_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bike Share Rental Predictions
- Andrea Cohen
- 01.31.23

## Task
  - to predict the total number of bike share rentals during a given hour of the day.

## Load the data

In [None]:
#mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [96]:
#import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn import set_config
set_config(display='diagram')

In [97]:
#load the data
df = pd.read_csv('/content/bikeshare_train - bikeshare_train.csv')
display(df.head())
display(df.info())

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


None

In [98]:
#drop unnecessary columns
df = df.drop(columns = ['casual', 'registered'])
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


## Inspect the data

In [99]:
df.shape

(10886, 10)

  - There are 10886 rows and 10 columns.

In [100]:
#check datatypes
df.dtypes

datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
count           int64
dtype: object

  - datetime is datatype object.
  - season, holiday, workingday, weather, humidity, and count are all datatype int64.
  - temp, atemp, and windspeed are all datatype float64.

In [101]:
#check for outliers and obvious errors
df.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,977.0


  - There are no obvious outliers or errors in the data.

In [102]:
#check for duplicated rows
df.duplicated().sum()

0

  - There are 0 duplicates.

In [103]:
#check for missing values
df.isna().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
count         0
dtype: int64

  - There are 0 missing values.

## Prepare the data for modeling.

### Make copies of the dataframe for modeling.

In [104]:
orig_df = df.copy()
fe_df = df.copy()

### Feature engineering

In [105]:
#change the date to datetime
fe_df['datetime'] = pd.to_datetime(fe_df['datetime'])
fe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


In [106]:
#create a column containing the name of the month
fe_df['month (name)'] = fe_df['datetime'].dt.month_name()
#create a column containing the name of the day of the week
fe_df['day of week (name)'] = fe_df['datetime'].dt.day_name()
#create a column containing the hour of the day
fe_df['hour'] = fe_df['datetime'].dt.hour
fe_df['hour'] = fe_df['hour'].astype(object)
fe_df = fe_df.drop(columns = ['datetime', 'season'])
display(fe_df.head())
display(fe_df.info())

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month (name),day of week (name),hour
0,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   holiday             10886 non-null  int64  
 1   workingday          10886 non-null  int64  
 2   weather             10886 non-null  int64  
 3   temp                10886 non-null  float64
 4   atemp               10886 non-null  float64
 5   humidity            10886 non-null  int64  
 6   windspeed           10886 non-null  float64
 7   count               10886 non-null  int64  
 8   month (name)        10886 non-null  object 
 9   day of week (name)  10886 non-null  object 
 10  hour                10886 non-null  object 
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB


None

In [107]:
#convert temp and atemp columns to Fahrenheit
fe_df['temp'] = fe_df['temp'].apply(lambda x: ((x*1.8)+32))
fe_df['atemp'] = fe_df['atemp'].apply(lambda x: ((x*1.8)+32))
fe_df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month (name),day of week (name),hour
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4


In [108]:
#create a temp_variance column
fe_df['temp_variance'] = fe_df['temp'] - fe_df['atemp']
fe_df = fe_df.drop(columns = 'atemp')
fe_df.head()

Unnamed: 0,holiday,workingday,weather,temp,humidity,windspeed,count,month (name),day of week (name),hour,temp_variance
0,0,0,1,49.712,81,0.0,16,January,Saturday,0,-8.199
1,0,0,1,48.236,80,0.0,40,January,Saturday,1,-8.307
2,0,0,1,48.236,80,0.0,32,January,Saturday,2,-8.307
3,0,0,1,49.712,75,0.0,13,January,Saturday,3,-8.199
4,0,0,1,49.712,75,0.0,1,January,Saturday,4,-8.199


## Predictive modeling

### Decision Tree Regression with the original data

#### Split the data

  - Assign the count column as the target and the rest of the columns as the features matrix

In [109]:
#assign X and y
y = orig_df['count']
X = orig_df.drop(columns = ['count'])

#### Train test split (model validation)

In [110]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#### Create a preprocessing object

  - Numeric data will be scaled.
  - Categorical (object) columns will be one-hot encoded.

In [111]:
#create columnselectors for the numeric and categorical data
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')
#create tuples for the columntransformer
#instantiate the OneHotEncoder and the StandardScaler within the tuples
cat_tuple = (OneHotEncoder(sparse = False, handle_unknown = 'ignore'), cat_selector)
num_tuple = (StandardScaler(), num_selector)
#create the preprocessing columntransformer
preprocessor1 = make_column_transformer(cat_tuple, num_tuple, remainder='drop')
preprocessor1

#### Create a model pipeline

In [112]:
#instantiate the decision tree regressor
dt = DecisionTreeRegressor(random_state = 42)
#combine the preprocessing columntransformer and the decision tree regressor in a pipeline
dt_pipe = make_pipeline(preprocessor1, dt)
dt_pipe

#### Fit the model pipeline on the training data and make predictions

In [113]:
#fit the model pipeline on the training data
dt_pipe.fit(X_train, y_train)
#make predictions using the training and testing data
training_predictions = dt_pipe.predict(X_train)
test_predictions = dt_pipe.predict(X_test)

#### Evaluate the default model

In [114]:
train_score = dt_pipe.score(X_train, y_train)
test_score = dt_pipe.score(X_test, y_test)
print(train_score)
print(test_score)

1.0
0.03505162953277119


  - The default decision tree had a much higher R^2 score on the training data than it did on the test data—the model is overfitting.

#### Tune the model to optimize performance on the test set

In [115]:
#determine the depth of the default tree
dt.get_depth()

207

  - The default tree had a depth of 207.

In [116]:
#use a for loop to try many values at once and compare them
depths = list(range(2, 207))
scores = pd.DataFrame(index=depths, columns = ['Test Score', 'Train Score'])
for depth in depths:
  dt = DecisionTreeRegressor(max_depth=depth, random_state=42)
  dt_pipe = make_pipeline(preprocessor1, dt)
  dt_pipe.fit(X_train, y_train)
  train_score = dt_pipe.score(X_train, y_train)
  test_score = dt_pipe.score(X_test, y_test)
  scores.loc[depth, 'Train Score'] = train_score
  scores.loc[depth, 'Test Score'] = test_score
scores.head()

Unnamed: 0,Test Score,Train Score
2,0.206995,0.193698
3,0.26074,0.249816
4,0.281225,0.282055
5,0.297731,0.312186
6,0.308381,0.34491


In [117]:
#Check best score for the model by sorting dataframe to find the depth for the best score
#looking for the index of the best test score
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
9,0.319598,0.428692
10,0.315926,0.458161
8,0.31491,0.401343
11,0.313585,0.484523
7,0.313269,0.373902


  - The optimal max_depth is 9.

In [118]:
# run the model with the optimized value for max_depth
dt_9 = DecisionTreeRegressor(max_depth = 9, random_state = 42)
dt_9_pipe = make_pipeline(preprocessor1, dt_9)
dt_9_pipe.fit(X_train, y_train)
train_preds = dt_9_pipe.predict(X_train)
test_preds = dt_9_pipe.predict(X_test)
train_9_score = dt_9_pipe.score(X_train, y_train)
test_9_score = dt_9_pipe.score(X_test, y_test)
print(train_9_score)
print(test_9_score)

0.42869218456536107
0.3195978380009208


  - The r2 of the final model is .4287 on the training set, and the r2 of the final model is .3196 on the test set.
  - The training and test results have moved closer to each other (a sign that overfitting was reduced). Most importantly, the testing score is higher.

### Decision Tree Regression with feature engineering

#### Split the data

  - Assign the count column as the target and the rest of the columns as the features matrix

In [119]:
#assign X and y
yy = fe_df['count']
XX = fe_df.drop(columns = ['count'])

#### Train test split (model validation)

In [120]:
XX_train, XX_test, yy_train, yy_test = train_test_split(XX, yy, random_state=42)

#### Create a preprocessing object

  - Numeric data will be scaled.
  - Categorical (object) columns will be one-hot encoded.

In [121]:
#create columnselectors for the numeric and categorical data
cat_selector2 = make_column_selector(dtype_include='object')
num_selector2 = make_column_selector(dtype_include='number')
#create tuples for the columntransformer
#instantiate the OneHotEncoder and the StandardScaler within the tuples
cat_tuple2 = (OneHotEncoder(sparse = False, handle_unknown = 'ignore'), cat_selector2)
num_tuple2 = (StandardScaler(), num_selector2)
#create the preprocessing columntransformer
preprocessor2 = make_column_transformer(cat_tuple2, num_tuple2, remainder='drop')
preprocessor2

#### Create a model pipeline

In [122]:
#instantiate the decision tree regressor
dt_fe = DecisionTreeRegressor(random_state = 42)
#combine the preprocessing columntransformer and the decision tree regressor in a pipeline
dt_fe_pipe = make_pipeline(preprocessor2, dt_fe)
dt_fe_pipe

#### Fit the model pipeline on the training data and make predictions

In [123]:
#fit the model pipeline on the training data
dt_fe_pipe.fit(XX_train, yy_train)
#make predictions using the training and testing data
training_predictions2 = dt_fe_pipe.predict(XX_train)
test_predictions2 = dt_fe_pipe.predict(XX_test)

#### Evaluate the default model

In [124]:
train_score2 = dt_fe_pipe.score(XX_train, yy_train)
test_score2 = dt_fe_pipe.score(XX_test, yy_test)
print(train_score2)
print(test_score2)

0.9999543627384183
0.7105917132657692


  - The default decision tree had a higher R^2 score on the training data than it did on the test data—this model is also overfitting.

#### Tune the model to optimize performance on the test set

In [125]:
#determine the depth of the default tree
dt_fe.get_depth()

49

  - The default tree had a depth of 49.

In [126]:
#use a for loop to try many values at once and compare them
depths2 = list(range(2, 49))
scores2 = pd.DataFrame(index=depths2, columns = ['Test Score', 'Train Score'])
for depth2 in depths2:
  dt_fe = DecisionTreeRegressor(max_depth=depth2, random_state=42)
  dt_fe_pipe = make_pipeline(preprocessor2, dt_fe)
  dt_fe_pipe.fit(XX_train, yy_train)
  train_score2 = dt_fe_pipe.score(XX_train, yy_train)
  test_score2 = dt_fe_pipe.score(XX_test, yy_test)
  scores2.loc[depth2, 'Train Score'] = train_score2
  scores2.loc[depth2, 'Test Score'] = test_score2
scores2.head()

Unnamed: 0,Test Score,Train Score
2,0.183919,0.186007
3,0.262318,0.28198
4,0.336173,0.36568
5,0.410911,0.452166
6,0.480203,0.527552


In [127]:
#Check best score for the model by sorting dataframe to find the depth for the best score
#looking for the index of the best test score
sorted_scores2 = scores2.sort_values(by='Test Score', ascending=False)
sorted_scores2.head()

Unnamed: 0,Test Score,Train Score
24,0.723036,0.960913
30,0.721153,0.992226
26,0.719685,0.975765
27,0.718899,0.980258
21,0.716158,0.933812


  - The optimal max_depth is 24.

In [128]:
# run the model with the optimized value for max_depth
dt_fe_24 = DecisionTreeRegressor(max_depth = 24, random_state = 42)
dt_fe_24_pipe = make_pipeline(preprocessor2, dt_fe_24)
dt_fe_24_pipe.fit(XX_train, yy_train)
train_preds2 = dt_fe_24_pipe.predict(XX_train)
test_preds2 = dt_fe_24_pipe.predict(XX_test)
train_fe24_score = dt_fe_24_pipe.score(XX_train, yy_train)
test_fe24_score = dt_fe_24_pipe.score(XX_test, yy_test)
print(train_fe24_score)
print(test_fe24_score)

0.9609132261197659
0.7230362576274019


  - The r2 of the final model is .9609 on the training set, and the r2 of the final model is .7230 on the test set.
  - The training and test results have moved closer to each other (a sign that overfitting was reduced). Most importantly, the testing score is higher.

### Evaluate the models with the original data and with feature engineering

#### Evaluate the performance of the models based on R^2

In [129]:
print(f'R^2 for Decision Tree Regression with the Original Data: {test_9_score}')
print(f'R^2 for Decision Tree Regression with Feature Engineering: {test_fe24_score}')

R^2 for Decision Tree Regression with the Original Data: 0.3195978380009208
R^2 for Decision Tree Regression with Feature Engineering: 0.7230362576274019


  - The Decision Tree Model used the features to explain 31.96% of the variation in the target for the original testing data.
  - The Deicision Tree Model used the features to explain 72.30% of the variation in the target for the testing data with feature engineering.

#### Evaluate the performance of the models based on RMSE

In [130]:
test_MAE1 = mean_absolute_error(y_test, test_preds)
test_MAE2 = mean_absolute_error(yy_test, test_preds2)
test_MSE1 = mean_squared_error(y_test, test_preds)
test_MSE2 = mean_squared_error(yy_test, test_preds2)
test_RMSE1 = np.sqrt(test_MSE1)
test_RMSE2 = np.sqrt(test_MSE2)
print(f'RMSE for Decision Tree Regression with the Original Data: {test_RMSE1}')
print(f'RMSE for Decision Tree Regression with Feature Engineering: {test_RMSE2}')

RMSE for Decision Tree Regression with the Original Data: 149.4213633487508
RMSE for Decision Tree Regression with Feature Engineering: 95.33257233104298


  - The Decision Tree Model with feature engineering had a lower RMSE than the Decision Tree Model with the original data.

Did the feature engineering choices improve the ability to predict the 'count'?
- Yes!