<a name="top_notebook"></a>
[__Home__](../README.md) | [__Data Cleaning >>__](./02_Divvy_data_cleaning.ipynb)


# Divvy: Bike Sharing Forecast
## Initial Data Exploration

__Dataset:__ [Divvy Data](https://divvybikes.com/system-data) \
__Author:__ [Dmitry Luchkin](https://www.linkedin.com/in/dmitry-luchkin/) \
__Date:__ March 2025


### Introduction

![alt text](https://images.ctfassets.net/p6ae3zqfb1e3/1owS8BnG7K4KqL42WIik7H/c5e5e09496ccd5d308f2402440919cb9/Divvy_Plans_pricing_Hero_2x.jpg?w=2500&q=60&fm=webp)
Image Source: Official Divvy Website

The Divvy bicycle sharing service in Chicago offers an environmentally friendly and convenient transportation option, supported by a public-private partnership with Lyft. With a broad network of docking stations and a fleet of standard and electric bicycles, Divvy serves a diverse range of users. This project focuses on analyzing ride data to forecast the number of rides and to understand the impact of external factors such as weather conditions and public holidays. By examining these variables, the project aims to identify patterns in usage, determine peak times, and evaluate the effects of different conditions on ride frequency. The insights gained will assist in optimizing operations, planning for demand fluctuations, and enhancing overall service delivery, contributing to the promotion of sustainable urban mobility in Chicago.

#### Scope of the Project

The scope of this project encompasses the following key activities and analyses related to the Divvy bicycle sharing service in Chicago:

- __Data Collection and Preparation__

    - Gather historical ride data from Divvy, including details such as trip duration, start and end locations, and time of rides.
    - Collect additional data on external factors such as weather conditions (temperature, precipitation, etc.) and public holidays.

- __Exploratory Data Analysis (EDA)__

    - Analyze the collected data to identify patterns and trends in ride usage.
    - Determine peak usage times, popular routes, and any seasonal variations in the data.

- __Impact Analysis__

    - Assess the impact of weather conditions on the number of rides, identifying which weather factors most significantly influence usage.
    - Evaluate the effect of public holidays on ride frequency, including variations in usage patterns during these periods.

- __Forecasting Ride Numbers__

    - Develop and implement predictive models to forecast the number of rides, using historical data and identified influencing factors.
    - Test and validate the accuracy of these models using appropriate statistical and machine learning techniques.


- __Operational Insights and Optimization__

    - Provide actionable insights for optimizing the distribution of bicycles and managing docking stations based on predicted demand.
    - Recommend strategies for handling fluctuations in demand, particularly during adverse weather conditions or public holidays.


- __Reporting and Visualization__

    - Create detailed reports and visualizations to present findings, including interactive dashboards or visual aids for stakeholders.
    - Document the methodology, results, and conclusions drawn from the analyses.

#### Out of Scope
- Analysis of user demographic data, as this information is not included in the available dataset.
- Development of new infrastructure or direct operational changes, as the project is limited to data analysis and recommendations.
- Real-time monitoring and adjustment of services, as the focus is on historical data and predictive modeling rather than live system management.
- This scope outlines the boundaries and focus areas of the project, ensuring that all activities are aligned with the objectives and expected outcomes. It also clarifies what is not included, providing a clear understanding of the project's limitations.

#### Goals of the Project

- Analyze ride data to forecast the number of rides.
- Examine the impact of weather conditions on ride frequency.
- Assess the effect of public holidays on the number of rides.
- Identify patterns in usage, including peak times and popular routes.
- Provide insights for optimizing the operational aspects of the service.
- Support planning for demand fluctuations based on predictive analysis.
- Enhance service delivery and user experience through data-driven decisions.

#### Dataset  <a  class="anchor" name='dataset'></a>

This dataset is 53 .CSV files containing data points with information about rides in bike sharing service from **January 2020** to **December 2024**.

Each trip is anonymized and includes:

- Trip start day and time
- Trip end day and time
- Trip start station
- Trip end station
- Rider type (Member, Single Ride, and Day Pass)

The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).


##### Description of Attributes

| Attribute            | Description                                                      |
|----------------------|------------------------------------------------------------------|
| `ride_id`            | Ride ID                                                          |
| `rideable_type`      | Type of a bike                                                   |
| `started_at`         | Trip start day and time                                          |
| `ended_at`           | Trip end day and time                                            |
| `start_station_name` | Trip start station name                                          |
| `start_station_id`   | Trip start station ID                                            |
| `end_station_name`   | Trip end station name                                            |
| `end_station_id`     | Trip end station ID                                              |
| `start_lat`          | Latitude of a start point                                        |
| `start_lng`          | Latitude of a start point                                        |
| `end_lat`            | Latitude of a start point                                        |
| `end_lng`            | Latitude of a start point                                        |
| `member_casual`      | Rider type                                                       |


##### Lisense
This data is provided according to the [Divvy Data License Agreement](https://www.divvybikes.com/data-license-agreement) and released on a monthly schedule.

---

> **Attention**
>
> The data utilized in this project were sourced from the official public data provided by Divvy. All analysis, results, and insights presented in this notebook were conducted and formulated by Dmitry Luchkin. I declare that I am not affiliated with Divvy, Lyft, the City of Chicago, or any associated organizations. The views and interpretations expressed herein are solely my own and do not represent the positions or policies of these entities.

### Objectives <a name="objectives"></a>

- Gather data from various sources such as CSV files, databases, APIs, or data warehouses.
- Load data into the analysis environment.
- Review the dataset’s structure, including columns and data types.
- Identify uniquenees, missing, duplicated values
- For categorical data, determine the frequency of each category.

### Notebooks <a class="anchor" name='notebooks'></a>

+ __[Initial Data Exploration](./01_Divvy_data_exploration.ipynb)__
+ [Data Cleaning](./02_Divvy_data_cleaning.ipynb)
+ [Exploratory Data Analysis](./03_Divvy_exploratory_data_analysis.ipynb)
+ [Feature Engineering](./04_Divvy_feature_engineering.ipynb)
+ [Modeling & Validation](./05_Divvy_forecasting.ipynb)

### Import Libraries <a name='import-libraries'></a>

In [1]:
import datetime
import sys
import re
import os

import pandas as pd
import numpy as np
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sys.path.append('../02_scripts/')

%matplotlib inline

In [2]:
from IPython.display import Markdown
from IPython.core.magic import register_cell_magic


@register_cell_magic
def markdown(line, cell):
    return Markdown(cell.format(**globals()))

### Notebook Setup <a name='notebook-setup'></a>

In [3]:
# Pandas settings
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 60
pd.options.display.float_format = '{:,.4f}'.format

# Visualization settings
from matplotlib import rcParams
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = (16, 5)   
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['font.size'] = 12
rcParams['savefig.dpi'] = 300
plt.rc('xtick', labelsize=11)
plt.rc('ytick', labelsize=11)
%config InlineBackend.figure_format = 'retina'

### Loading Data <a name='loading-data'></a>

Have a quick look at the data.

In [4]:
%ls ../00_data/00_raw/*.csv

../00_data/00_raw/202004-divvy-tripdata.csv
../00_data/00_raw/202005-divvy-tripdata.csv
../00_data/00_raw/202006-divvy-tripdata.csv
../00_data/00_raw/202007-divvy-tripdata.csv
../00_data/00_raw/202008-divvy-tripdata.csv
../00_data/00_raw/202009-divvy-tripdata.csv
../00_data/00_raw/202010-divvy-tripdata.csv
../00_data/00_raw/202011-divvy-tripdata.csv
../00_data/00_raw/202012-divvy-tripdata.csv
../00_data/00_raw/202101-divvy-tripdata.csv
../00_data/00_raw/202102-divvy-tripdata.csv
../00_data/00_raw/202103-divvy-tripdata.csv
../00_data/00_raw/202104-divvy-tripdata.csv
../00_data/00_raw/202105-divvy-tripdata.csv
../00_data/00_raw/202106-divvy-tripdata.csv
../00_data/00_raw/202107-divvy-tripdata.csv
../00_data/00_raw/202108-divvy-tripdata.csv
../00_data/00_raw/202109-divvy-tripdata.csv
../00_data/00_raw/202110-divvy-tripdata.csv
../00_data/00_raw/202111-divvy-tripdata.csv
../00_data/00_raw/202112-divvy-tripdata.csv
../00_data/00_raw/202201-divvy-tripdata.csv
../00_data/00_raw/202202-divvy-t

In [5]:
[file for file in os.listdir('../00_data/00_raw/') if file.endswith('.csv')]

['202011-divvy-tripdata.csv',
 '202208-divvy-tripdata.csv',
 '202205-divvy-tripdata.csv',
 '202310-divvy-tripdata.csv',
 '202109-divvy-tripdata.csv',
 '202104-divvy-tripdata.csv',
 '202404-divvy-tripdata.csv',
 '202409-divvy-tripdata.csv',
 '202303-divvy-tripdata.csv',
 '202407-divvy-tripdata.csv',
 '202107-divvy-tripdata.csv',
 '202206-divvy-tripdata.csv',
 '202012-divvy-tripdata.csv',
 'Divvy_Trips_2020_Q1.csv',
 '202004-divvy-tripdata.csv',
 '202009-divvy-tripdata.csv',
 '202210-divvy-tripdata.csv',
 '202305-divvy-tripdata.csv',
 '202308-divvy-tripdata.csv',
 '202111-divvy-tripdata.csv',
 '202401-divvy-tripdata.csv',
 '202101-divvy-tripdata.csv',
 '202411-divvy-tripdata.csv',
 '202412-divvy-tripdata.csv',
 '202102-divvy-tripdata.csv',
 '202402-divvy-tripdata.csv',
 '202112-divvy-tripdata.csv',
 '202306-divvy-tripdata.csv',
 '202203-divvy-tripdata.csv',
 '202007-divvy-tripdata.csv',
 '202403-divvy-tripdata.csv',
 '202307-divvy-tripdata.csv',
 '202103-divvy-tripdata.csv',
 '202212-div

In [6]:
files = ['../00_data/00_raw/' + file for file in os.listdir('../00_data/00_raw/') if file.endswith('.csv')]

In [7]:
len(files)

58

In [8]:
def load(x):
    print(f'{x} loading...')
    return pd.read_csv(x)

data = pd.concat(map(load, files), ignore_index=True)

../00_data/00_raw/202011-divvy-tripdata.csv loading...
../00_data/00_raw/202208-divvy-tripdata.csv loading...
../00_data/00_raw/202205-divvy-tripdata.csv loading...
../00_data/00_raw/202310-divvy-tripdata.csv loading...
../00_data/00_raw/202109-divvy-tripdata.csv loading...
../00_data/00_raw/202104-divvy-tripdata.csv loading...
../00_data/00_raw/202404-divvy-tripdata.csv loading...
../00_data/00_raw/202409-divvy-tripdata.csv loading...
../00_data/00_raw/202303-divvy-tripdata.csv loading...
../00_data/00_raw/202407-divvy-tripdata.csv loading...
../00_data/00_raw/202107-divvy-tripdata.csv loading...
../00_data/00_raw/202206-divvy-tripdata.csv loading...
../00_data/00_raw/202012-divvy-tripdata.csv loading...
../00_data/00_raw/Divvy_Trips_2020_Q1.csv loading...
../00_data/00_raw/202004-divvy-tripdata.csv loading...
../00_data/00_raw/202009-divvy-tripdata.csv loading...
../00_data/00_raw/202210-divvy-tripdata.csv loading...
../00_data/00_raw/202305-divvy-tripdata.csv loading...
../00_data/0

### Data Exploration

In [9]:
print(f'Rows count: {data.shape[0]:_}, Columns count: {data.shape[1]}')

Rows count: 26_384_908, Columns count: 13


In [10]:
data.columns

Index(['ride_id', 'rideable_type', 'started_at', 'ended_at',
       'start_station_name', 'start_station_id', 'end_station_name',
       'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual'],
      dtype='object')

In [11]:
data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,BD0A6FF6FFF9B921,electric_bike,2020-11-01 13:36:00,2020-11-01 13:45:40,Dearborn St & Erie St,110.0,St. Clair St & Erie St,211.0,41.8942,-87.6291,41.8944,-87.6234,casual
1,96A7A7A4BDE4F82D,electric_bike,2020-11-01 10:03:26,2020-11-01 10:14:45,Franklin St & Illinois St,672.0,Noble St & Milwaukee Ave,29.0,41.891,-87.6353,41.9007,-87.6625,casual
2,C61526D06582BDC5,electric_bike,2020-11-01 00:34:05,2020-11-01 01:03:06,Lake Shore Dr & Monroe St,76.0,Federal St & Polk St,41.0,41.881,-87.6168,41.8721,-87.6296,casual
3,E533E89C32080B9E,electric_bike,2020-11-01 00:45:16,2020-11-01 00:54:31,Leavitt St & Chicago Ave,659.0,Stave St & Armitage Ave,185.0,41.8955,-87.682,41.9177,-87.6914,casual
4,1C9F4EF18C168C60,electric_bike,2020-11-01 15:43:25,2020-11-01 16:16:52,Buckingham Fountain,2.0,Buckingham Fountain,2.0,41.8765,-87.6204,41.8764,-87.6203,casual


In [12]:
data.tail()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
26384903,EF56D7D1D612AC11,electric_bike,2021-05-20 16:32:14,2021-05-20 16:35:39,Blackstone Ave & Hyde Park Blvd,13398,,,41.8026,-87.5902,41.8,-87.6,member
26384904,745191CB9F21DE3C,classic_bike,2021-05-29 16:40:37,2021-05-29 17:22:37,Sheridan Rd & Montrose Ave,TA1307000107,Michigan Ave & Oak St,13042,41.9617,-87.6546,41.901,-87.6238,casual
26384905,428575BAA5356BFF,electric_bike,2021-05-31 14:24:54,2021-05-31 14:31:38,Sheridan Rd & Montrose Ave,TA1307000107,,,41.9615,-87.6547,41.95,-87.65,member
26384906,FC8A4A7AB7249662,electric_bike,2021-05-25 16:01:33,2021-05-25 16:07:37,Sheridan Rd & Montrose Ave,TA1307000107,,,41.9617,-87.6547,41.98,-87.66,member
26384907,E873B8AA3EE84678,docked_bike,2021-05-12 12:22:14,2021-05-12 12:30:27,Sheridan Rd & Montrose Ave,TA1307000107,Clark St & Grace St,TA1307000127,41.9617,-87.6546,41.9508,-87.6592,casual


#### Check Data Types <a name='check-data-type'></a>

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26384908 entries, 0 to 26384907
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 2.6+ GB


#### Check uniqueness of data <a name='check-uniqueness-data'></a>

In [14]:
num_unique = data.nunique().sort_values()
num_unique.map('{:_}'.format)

member_casual                  2
rideable_type                  4
start_station_name         2_302
end_station_name           2_308
start_station_id           2_501
end_station_id             2_503
end_lng                  460_643
end_lat                  506_283
start_lng              1_068_640
start_lat              1_140_579
started_at            22_985_964
ended_at              22_994_930
ride_id               26_384_488
dtype: object

In [15]:
data.shape[0] - num_unique['ride_id']

420

__Observations:__
- Rows count is 22_929_303, but `ride_id` count is 22_928_883. Looks like there are 420 not unique `ride_id`. 

<span style="color: blue;">_# TODO: Analyze uniqueness of ride_id._</span>

In [16]:
print('--- Percentage Similarity of Values (%) ---')
print(100/num_unique)

--- Percentage Similarity of Values (%) ---
member_casual        50.0000
rideable_type        25.0000
start_station_name    0.0434
end_station_name      0.0433
start_station_id      0.0400
end_station_id        0.0400
end_lng               0.0002
end_lat               0.0002
start_lng             0.0001
start_lat             0.0001
started_at            0.0000
ended_at              0.0000
ride_id               0.0000
dtype: float64


__Observations__:

+ `ride_id`, `rideable_type`, `started_at`, `ended_at`, `start_station_name`, `start_station_id`, `end_station_name`, `end_station_id`, `member_casual` are __string__
+ `start_lat`, `start_lng`, `end_lat`, `end_lng` are __float__

`started_at` and `ended_at` should be a __datatime__ type instead.\
`rideable_type` and `member_casual` should be __categorical__ type instead.

<span style="color: blue;">_# TODO: Convert started_at and ended_at to datatime._</span>\
<span style="color: blue;">_# TODO: Convert rideable_type and member_casual to category._</span>

#### Check Missing Values <a name='check-missing-values'></a>

In [17]:
missing_values = data.isna().sum().sort_values(ascending=False)
missing_values.map('{:_}'.format)

end_station_id        3_777_250
end_station_name      3_776_648
start_station_id      3_568_951
start_station_name    3_568_196
end_lat                  29_106
end_lng                  29_106
ride_id                       0
rideable_type                 0
started_at                    0
ended_at                      0
start_lat                     0
start_lng                     0
member_casual                 0
dtype: object

In [18]:
missing_percentage = data.isna().mean().sort_values(ascending=False)
print('--- Percentage of missing values (%) ---')
if missing_percentage.sum():
    print(missing_percentage[missing_percentage > 0] * 100)
else:
    print('There are NO missing values')

--- Percentage of missing values (%) ---
end_station_id       14.3159
end_station_name     14.3137
start_station_id     13.5265
start_station_name   13.5236
end_lat               0.1103
end_lng               0.1103
dtype: float64


__Observations__:
+ `end_station_id` has ~14% missing values in the column.
+ `end_station_name` has ~14% missing values in the column.
+ `start_station_id` has ~13.5% missing values in the column.
+ `start_station_name` has ~13.5% missing values in the column.
+ `end_lat` has ~0.1% missing values in the column.
+ `end_lng` has ~0.1% missing values in the column.

<span style="color: blue;">_# TODO: Handle missing values in end_station_id, end_station_name, start_station_id, start_station_name, end_lat, end_lng._</span>

#### Check for duplicated rows <a name='check-duplicated-rows'></a>

In [19]:
print(f'No. of entirely duplicated rows: {data.duplicated().sum()}')

No. of entirely duplicated rows: 0


### Store the Data

In [20]:
data_divvy = data

# store the data
%store data_divvy

Stored 'data_divvy' (DataFrame)


In [21]:
%load_ext watermark

In [22]:
%watermark -d -t -v -p numpy,pandas,matplotlib,seaborn,sklearn,statsmodels

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

numpy      : 1.26.4
pandas     : 2.2.2
matplotlib : 3.9.1
seaborn    : 0.13.2
sklearn    : 1.5.1
statsmodels: 0.14.2




---
\
[__Home__](../README.md) | [__Data Cleaning >>__](./02_Divvy_data_cleaning.ipynb)
\
\
Divvy: Bike Sharing Forecast, _March 2025_