## Predicting bike-share program usage from weather conditions
by Elaine Chu, Dhruv Garg, Shawn Xiao Hu, Lukman Lateef, Eugene You

In [18]:
import pandas as pd
import numpy as np

## Used for preprocesing and modelling
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV

## Used for plotting
import altair as alt
# import altair_ally as aly
alt.data_transformers.enable('vegafusion')

df = pd.read_csv('https://archive.ics.uci.edu/static/public/560/seoul+bike+sharing+demand.zip', encoding = 'latin-1')
print(df.head())

         Date  Rented Bike Count  Hour  Temperature(°C)  Humidity(%)  \
0  01/12/2017                254     0             -5.2           37   
1  01/12/2017                204     1             -5.5           38   
2  01/12/2017                173     2             -6.0           39   
3  01/12/2017                107     3             -6.2           40   
4  01/12/2017                 78     4             -6.0           36   

   Wind speed (m/s)  Visibility (10m)  Dew point temperature(°C)  \
0               2.2              2000                      -17.6   
1               0.8              2000                      -17.6   
2               1.0              2000                      -17.7   
3               0.9              2000                      -17.6   
4               2.3              2000                      -18.6   

   Solar Radiation (MJ/m2)  Rainfall(mm)  Snowfall (cm) Seasons     Holiday  \
0                      0.0           0.0            0.0  Winter  No Holiday   


# Summary

In this analysis, we developed a XXX model using the XXX algorithm to predict the usage of bike share at each hour of the day based on weather conditions. The final XXX performed XXX 

# Introduction

Over the past 2 decades, a growing number of countries worldwide have introduced bike-sharing programs as an integral part of their urban transportation systems (Shaheen et al. 2013). These initiatives are often designed to address the “last mile” problem – a common challenge in public transit to get passengers from a transportation hub, like train stations and bus stops, to their final destination. By providing a sustainable, accessible, and cost-effective mode of transportation for short trips, bike-share programs have become a popular solution to close this gap (Shaheen et al. 2013).  

The demand and usage of bike-share programs are known to be heavily influenced by the weather conditions (Eren and Uz 2020). Factors such as temperature, precipitation, humidity, and wind speed all have an affect the number of bikes being used at any given time. Understanding these relationships is crucial for the effective management of bike-share systems.

In this study, we explore whether a machine learning algorithm can predict the usage of bike-share program. It is important to accurately predict usage of the bikes as it gives organizers the ability to plan ahead and make sure there is a stable supply of bikes to match the fluctuating demands. This ensures an efficient allocation of resources and ultimately improve the overall performance of the bike-share programs.

# Methods

## Data

The data set used in this project is the Seoul bike sharing demand data set sourced from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand). Each row in the data set represents the number of bikes being rented at a specific hour of a day, along with corresponding weather conditions (e.g. temperature, humidity, and rainfall), whether the day was a holiday, and the season in which the rentals occurred.

## Analysis

The decision regressor algorithm was used to develop a regression model to predict the number of bikes being rented out for a specific hour of the day. All variables, except for `dew point temp`, from the original data set was used to fit the model. The data was split into training and test set at a 70:30 ratio. The hyperparameters `tree depth`, `minimum samples per split`, and `minimum sample per leaf` were optimized based on the training accuracy score through a 5-fold cross-validation. The `seasons` and `hour` feature were processed by one-hot encoding and all the other features were standardized just before model fitting. The Python programming language (Van Rossum and Drake 2009) and the following Python packages were used to perform the analysis: numpy (Harris et al. 2020), Pandas (McKinney 2010), altair (VanderPlas, 2018), vegafusion(Kruchten et al. 2022), scikit-learn (Pedregrosa et al. 2011). The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/DSCI522-2425-28-five_guys 

# Results and Discussion
XXX

# EDA Analysis

# 1. Check missing data

# Check for missing values
missing_values = df.isnull().sum()

# Summary statistics
summary_stats = df.describe()

# Prepare missing values and summary statistics as separate tables
missing_values_table = pd.DataFrame({"Missing Values": missing_values})
summary_stats_table = summary_stats.loc['mean':'max']

# Display both tables
print("Missing Values:")
print(missing_values_table)

print("\nSummary Statistics:")
print(summary_stats_table)

## Visualization 1: Distribution of Rented Bike Count

In [2]:
rented_bike_hist = alt.Chart(df).mark_bar().encode(
    alt.X('Rented Bike Count:Q', bin=True, title='Rented Bike Count'),
    alt.Y('count()', title='Frequency'),
    tooltip=['count()']
).properties(
    title='Distribution of Rented Bike Count',
    width=700,
    height=400
)
rented_bike_hist

## Visualization 2: Average Rented Bike Count by Hour

In [3]:
hourly_avg_chart = alt.Chart(df).mark_line(point=True).encode(
    x=alt.X('Hour:O', title='Hour of Day'),
    y=alt.Y('mean(Rented Bike Count):Q', title='Average Rented Bike Count'),
    tooltip=['Hour', 'mean(Rented Bike Count)']
).properties(
    title='Average Rented Bike Count by Hour',
    width=700,
    height=400
)
hourly_avg_chart

## Visualization 3: Average Rented Bike Count by Season & Temperature

In [20]:

season_avg_chart = alt.Chart(df).mark_bar().encode(
    x=alt.X('Seasons:O', title='Season'),
    y=alt.Y('mean(Rented Bike Count):Q', title='Average Rented Bike Count'),
    color='Seasons',
    tooltip=['Seasons', 'mean(Rented Bike Count)']
).properties(
    title='Average Rented Bike Count by Season',
    width=700,
    height=400
)
season_avg_chart

In [21]:
alt.Chart(df).mark_circle().encode(
    alt.X('Temperature(°C):Q', title='Temperature (°C)'),
    alt.Y('Rented Bike Count:Q', title='Rented Bike Count'),
    color='Seasons:N',
    tooltip=['Temperature(°C)', 'Rented Bike Count', 'Seasons']
).properties(
    title='Bike Rentals vs. Temperature',
    width=600,
    height=400
)

## Visualization 4: Rentals on Holidays vs. Non-Holidays

In [22]:
alt.Chart(df).mark_boxplot().encode(
    alt.X('Holiday:N', title='Holiday'),
    alt.Y('Rented Bike Count:Q', title='Rented Bike Count'),
    color='Holiday:N',
    tooltip=['Holiday', 'Rented Bike Count']
).properties(
    title='Bike Rentals on Holidays vs. Non-Holidays',
    width=600,
    height=400
)

# 2. Data Preprocessing 

In [5]:
## Renaming the colums for the simplicity
df=df.rename(columns={
    'Temperature(°C)':'Temperature',
    'Humidity(%)':'Humidity',
    'Rainfall(mm)':'Rainfall',
    'Snowfall (cm)':'Snowfall',
    'Wind speed (m/s)':'Wind speed',
    'Visibility (10m)':'Visibility',
    'Solar Radiation (MJ/m2)':'Radiation',
    'Dew point temperature(°C)':'Dew point temperature'})

In [6]:
#Convert the Date column in Datetime Dtype
df['Date']=pd.to_datetime(df['Date'], format = 'mixed')

# Extract features from the Date column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.weekday
df = df.drop(['Date'], axis=1)  # Exclude unwanted columns

# Convert to categorical
#df['Hour'] = df['Hour'].astype(str) 
df['Seasons'] = df['Seasons'].astype(str)

## Converting to binary for EDA and for values to feed into model
df['Holiday'] = df['Holiday'].apply(lambda x: 1 if x == "Holiday" else 0)
df['Functioning Day'] = df['Functioning Day'].apply(lambda x: 1 if x == "Yes" else 0)



1. Hour, Seasons should be OHE (Should be converted to Object/categorical first)
2. Holiday  and Functioning day should be binary encoded

In [7]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
train_df

Unnamed: 0,Rented Bike Count,Hour,Temperature,Humidity,Wind speed,Visibility,Dew point temperature,Radiation,Rainfall,Snowfall,Seasons,Holiday,Functioning Day,Year,Month,Day,Weekday
3850,0,10,17.3,52,2.3,1235,7.3,2.38,0.0,0.0,Spring,0,0,2018,10,5,4
4491,562,3,19.6,68,1.7,1260,13.5,0.00,0.0,0.0,Summer,1,1,2018,6,6,2
3305,1632,17,18.3,29,4.3,1626,0.0,1.61,0.0,0.0,Spring,0,1,2018,4,17,1
2511,329,15,11.6,97,3.4,117,11.1,0.26,0.0,0.0,Spring,0,1,2018,3,15,3
2487,1025,15,21.7,39,3.3,1979,7.1,2.09,0.0,0.0,Spring,0,1,2018,3,14,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7382,0,14,24.3,39,1.7,2000,9.4,2.18,0.0,0.0,Autumn,0,0,2018,4,10,1
7763,1175,11,16.2,39,1.6,1580,2.2,2.05,0.0,0.0,Autumn,0,1,2018,10,20,5
5218,998,10,23.7,59,1.7,2000,15.2,0.89,0.0,0.0,Summer,0,1,2018,6,7,3
1346,54,2,-15.6,33,2.2,2000,-28.2,0.00,0.0,0.5,Winter,0,1,2018,1,26,4


In [11]:
##Temp - rental bike graph 
alt.Chart(train_df[train_df['Rented Bike Count']!=0]).mark_bar().encode(
    x = alt.X('Rented Bike Count', bin=alt.Bin(maxbins=30)),
    y = alt.Y('count()', title='Distribution of rental bikes'),
    tooltip = ['Rented Bike Count', 'count()']
)


In [12]:
##Hourly bike graph for seasons
alt.Chart(train_df).mark_line().encode(
    x = 'Hour',
    y = 'mean(Rented Bike Count)',
    color = 'Seasons',
    tooltip = ['Hour', 'mean(Rented Bike Count)']
)

In [14]:
## Correlation graph
aly.corr(train_df)

NameError: name 'aly' is not defined

Here we can see that Temp and Dew point temp have a high correlation. For now we are dropping dew point temperature as most of the dew point temperature values are associated with temperature. Other values are not so highly correlated so not dropping them

# 3.Modeling

In [16]:
## Model separation
X_train, y_train = train_df.drop("Rented Bike Count", axis = 1), train_df["Rented Bike Count"]
X_test, y_test = test_df.drop("Rented Bike Count", axis = 1), test_df["Rented Bike Count"]

In [17]:
# Define column transformer for preprocessing
column_transformer = make_column_transformer(
    (OneHotEncoder(), ['Hour', 'Seasons']),  # One-hot encode Hour and Seasons
    ("drop", ['Dew point temperature']),
    remainder='passthrough'  # Leave other columns as they are
)

# Ridge Regression Pipeline
ridge_pipeline = make_pipeline(
    column_transformer,
    StandardScaler(),
    Ridge()
)

# Decision Tree Pipeline
tree_pipeline = make_pipeline(
    column_transformer,
    StandardScaler(),
    DecisionTreeRegressor(random_state=42)
)


# Define parameter grids for RandomizedSearchCV
ridge_param_grid = {
    'ridge__alpha': np.logspace(-3, 3, 10)
}

tree_param_grid = {
    'decisiontreeregressor__max_depth': [None, 10, 20, 30, 40],
    'decisiontreeregressor__min_samples_split': [2, 5, 10],
    'decisiontreeregressor__min_samples_leaf': [1, 2, 4]
}

# Apply RandomizedSearchCV
ridge_search = RandomizedSearchCV(ridge_pipeline, ridge_param_grid, cv=5, n_iter=10, random_state=42)
tree_search = RandomizedSearchCV(tree_pipeline, tree_param_grid, cv=5, n_iter=10, random_state=42)

# Fit models
ridge_search.fit(X_train, y_train)
tree_search.fit(X_train, y_train)

## Best params
ridge_best_params = ridge_search.best_params_
tree_best_params = tree_search.best_params_

## Predicting accuracy on test set
ridge_score = ridge_search.score(X_test, y_test)
tree_score = tree_search.score(X_test, y_test)

ridge_best_params, ridge_score, tree_best_params, tree_score


({'ridge__alpha': np.float64(10.0)},
 0.6560417726146159,
 {'decisiontreeregressor__min_samples_split': 10,
  'decisiontreeregressor__min_samples_leaf': 4,
  'decisiontreeregressor__max_depth': 20},
 0.7936150627629828)

# References

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Eren, E., Uz, V.E., 2020. A review on bike-sharing: The factors affecting bike-sharing demand. Sustainable Cities and Society 54, 101882.

Harris, C.R. et al., 2020. Array programming with NumPy. Nature, 585, pp.357–362.

Kruchten, N., Mease, J., and Moritz, D. (2022) VegaFusion: Automatic Server-Side Scaling for Interactive 
Vega Visualizations. 2022 IEEE Visualization and Visual Analytics (VIS), Oklahoma City, OK, USA. pp. 11-15, doi: 10.1109/VIS54862.2022.00011.

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

Seoul Bike Sharing Demand [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5F62R .

Shaheen, S. A., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia: Past, Present, and Future. Transportation Research Record, 2143(1), 159-167. https://doi.org/10.3141/2143-20 

VanderPlas, J. et al., 2018. Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.
