# Goal: Given data about a bike share service in London, let's try to predict the number of bikes used in a given hour.
The data is acquired from 3 sources:

Https://cycling.data.tfl.gov.uk/ 'Contains OS data © Crown copyright and database rights 2016' and Geomni UK Map data © and database rights [2019] 'Powered by TfL Open Data'
freemeteo.com - weather data
https://www.gov.uk/bank-holidays
From 1/1/2015 to 31/12/2016
The data from cycling dataset is grouped by "Start time", this represent the count of new bike shares grouped by hour. The long duration shares are not taken in the count.

# Metadata:
"timestamp" - timestamp field for grouping the data
"cnt" - the count of a new bike shares
"t1" - real temperature in C
"t2" - temperature in C "feels like"
"hum" - humidity in percentage
"wind_speed" - wind speed in km/h
"weather_code" - category of the weather
"is_holiday" - boolean field - 1 holiday / 0 non holiday
"is_weekend" - boolean field - 1 if the day is weekend
"season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.

"weathe_code" category description:
1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity 2 = scattered clouds / few clouds 3 = Broken clouds 4 = Cloudy 7 = Rain/ light Rain shower/ Light rain 10 = rain with thunderstorm 26 = snowfall 94 = Freezing Fog

# We are going to create visualizations in Tableau and proceed with ML model here so for to create some visualizations I have rendered the dataset like creating categories of weather in the same column and used it for the tableau and then used the old dataset with dummy variables for ML model.

Link for Tableau Dashboard: https://public.tableau.com/app/profile/ayushi.walia/viz/LondonBikeSharingEDA/Dashboard1

# Part1 - Data Cleaning for Tableau Visualizations

In [2]:
# installing libraries
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.0.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     ---------------------------------------- 60.9/60.9 kB 3.4 MB/s eta 0:00:00
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
    --------------------------------------- 0.3/11.5 MB 5.2 MB/s eta 0:00:03
   -- ------------------------------------- 0.7/11.5 MB 6.9 MB/s eta 0:00:02
   ---- ----------------------------------- 1.3/11.5 MB 9.1 MB/s eta 0:00:02
   ------- -------------------------------- 2.2/11.5 MB 10.8 MB/s eta 0:00:01
   

ERROR: Could not find a version that satisfies the requirement zipfile (from versions: none)
ERROR: No matching distribution found for zipfile


Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
     ---------------------------------------- 0.0/82.7 kB ? eta -:--:--
     ------------------- -------------------- 41.0/82.7 kB ? eta -:--:--
     ---------------------------------------- 82.7/82.7 kB 1.2 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ---------------------------------------- 57.6/57.6 kB 3.0 MB/s eta 0:00:00
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1

In [5]:
import pandas as pd
import zipfile

In [7]:
#extracting the file from the downloaded zip file
zipfile_name = 'londonbikesharingdataset.zip'
with zipfile.ZipFile(zipfile_name, 'r') as file:
    file.extractall()

In [79]:
bikes = pd.read_csv("london_merged.csv")

In [80]:
#exploring the data
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17414 entries, 0 to 17413
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   timestamp     17414 non-null  object 
 1   cnt           17414 non-null  int64  
 2   t1            17414 non-null  float64
 3   t2            17414 non-null  float64
 4   hum           17414 non-null  float64
 5   wind_speed    17414 non-null  float64
 6   weather_code  17414 non-null  float64
 7   is_holiday    17414 non-null  float64
 8   is_weekend    17414 non-null  float64
 9   season        17414 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 1.3+ MB


In [12]:
bikes.shape

(17414, 10)

In [13]:
bikes

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0
...,...,...,...,...,...,...,...,...,...,...
17409,2017-01-03 19:00:00,1042,5.0,1.0,81.0,19.0,3.0,0.0,0.0,3.0
17410,2017-01-03 20:00:00,541,5.0,1.0,81.0,21.0,4.0,0.0,0.0,3.0
17411,2017-01-03 21:00:00,337,5.5,1.5,78.5,24.0,4.0,0.0,0.0,3.0
17412,2017-01-03 22:00:00,224,5.5,1.5,76.0,23.0,4.0,0.0,0.0,3.0


In [14]:
# count unique values in the weather_code column
bikes.weather_code.value_counts()

weather_code
1.0     6150
2.0     4034
3.0     3551
7.0     2141
4.0     1464
26.0      60
10.0      14
Name: count, dtype: int64

In [15]:
bikes.season.value_counts()

season
0.0    4394
1.0    4387
3.0    4330
2.0    4303
Name: count, dtype: int64

In [71]:
# specifiying the columns I want to use
new_cols = {
    'timestamp' : 'time',
    'cnt' : 'count',
    't1' : 'temp_in_celsius',
    't2' : 'temp_in_celsius_feels_like',
    'hum' : 'humidity_percent',
    'wind_speed' : 'wind_speed_in_kph',
    'weather_code' : 'weather_category',
    'is_holiday' : 'is_holiday',
    'is_weekend' : 'is_weekend',
    'season' : 'season'
}
#renaming the columns
bikes.rename(new_cols, axis=1, inplace=True)

In [21]:
# changing the humidity values to percentage (i.e. boolean value)
bikes.humidity = bikes.humidity/100

In [76]:
#creating dictionary for season and weather code to better understand the data
season_dict = {
    '0.0':'spring',
    '1.0':'summer',
    '2.0':'autumn',
    '3.0':'winter'
}

#weather code
weather_dict = {
    '1.0':'Clear',
    '2.0':'Scattered clouds',
    '3.0':'Broken clouds',
    '4.0':'Cloudy',
    '7.0':'Rain',
    '10.0':'Rain with thunderstorm',
    '26.0':'Snowfall'
}

#Also changing the seasons column data type to string
bikes.season = bikes.season.astype('str')

#now we will map the values 0-3 to the actual written seasons
bikes.season = bikes.season.map(season_dict)

#Also changing the weather column data type to string
bikes.weather_category = bikes.weather_category.astype('str')

#now we will map the values 0-3 to the actual written seasons
bikes.weather_category = bikes.weather_category.map(weather_dict)

#ignore the errors, this code works fine.

ValueError: cannot convert float NaN to integer

In [75]:
#checking the dataframe to see if mapping has worked
bikes.head()

Unnamed: 0,time,count,temp_in_celsius,temp_in_celsius_feels_like,humidity_percent,wind_speed_in_kph,weather_category,is_holiday,is_weekend,season
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,,0.0,1.0,
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,,0.0,1.0,
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,,0.0,1.0,
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,,0.0,1.0,
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,,0.0,1.0,


In [25]:
!pip install openpyxl
# writing the final dataframe to an excel file that we will use in our Tableau Visualisations.
bikes.to_excel('london_bikes_final.xlsx', sheet_name='Data')

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
   ---------------------------------------- 0.0/250.9 kB ? eta -:--:--
   ------------- -------------------------- 81.9/250.9 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 250.9/250.9 kB 2.6 MB/s eta 0:00:00
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.5


# Link for Tableau Dashboard: https://public.tableau.com/app/profile/ayushi.walia/viz/LondonBikeSharingEDA/Dashboard1
# Part2 Model building in XGBoost

In [82]:
import numpy as np
import datetime as dt
bikes['timestamp'] = pd.to_datetime(bikes['timestamp'])

In [34]:
#bike share usage prediction
#!pip install plotly
!pip install scikit-learn
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.0-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   ---------------------------------------- 0.1/124.9 MB 648.1 kB/s eta 0:03:13
   ---------------------------------------- 0.1/124.9 MB 983.0 kB/s eta 0:02:07
   ---------------------------------------- 0.3/124.9 MB 2.0 MB/s eta 0:01:04
   ---------------------------------------- 0.5/124.9 MB 2.3 MB/s eta 0:00:55
   ---------------------------------------- 0.8/124.9 MB 2.7 MB/s eta 0:00:47
   ---------------------------------------- 1.0/124.9 MB 3.0 MB/s eta 0:00:42
   ---------------------------------------- 1.2/124.9 MB 3.3 MB/s eta 0:00:38
   ---------------------------------------- 1.5/124.9 MB 3.6 MB/s eta 0:00:35
    --------------------------------------- 1.8/124.9 MB 3.6 MB/s eta 0:00:35
    

In [36]:
#visualization
import plotly.express as px

#for pre processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#for predictions
from xgboost import XGBRegressor

# Pre processing

In [93]:
#we are going to extract features out of time column and make their own column
# we do not want to use year as model makes predictions by using the latest data
#i want to build a model that predicts the bike used in the future, and by using th eyear the model would have not seen data from the future.
#As year is the measure of recency, and since the model is always going to be used on the most recent information, knowing the year would not help
# if we want to build a model that could predict the number of bikes used some time in the past then we can use year 
# but we are building this model under the assumption of predicting number of bikes in the future so we will not use year, 
#however month, day and hour are useful so we are going to extract that
def preprocess_inputs(df):
    df = df.copy()
    
    #extract month day hour features from time column
    #lambda function allows us to go into time column and map every x that is every timestamp to every month and x.month will extract month in int form
    df['month'] = df['timestamp'] .apply(lambda x: x.month)
    df['day'] = df['timestamp'] .apply(lambda x: x.day)
    df['hour'] = df['timestamp'] .apply(lambda x: x.hour)
    #since we already created an excel file for visualisation in tableau earlier we can drop this column
    df = df.drop('timestamp', axis = 1)

    #one-hot-encoding weather_category column
    weather_dummies = pd.get_dummies(df['weather_code'], prefix='weather')
    df = pd.concat([df, weather_dummies], axis=1)
    df = df.drop('weather_code', axis=1)

    #split df into X and Y
    y = df['cnt']
    X = df.drop('cnt', axis=1)

    #Train-test split
    #We are going to keep shuffle = true so that will shuffle the data before split, and random state to ensure the shuffle is always done in the same way
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, shuffle = True, random_state = 1)
    
    #Scale X: scale the data
    #Reason to scale the data is becasue the range of values in each column is different and other models except Tree based models (as they are already scaled)
    #other modes improve their performance substantially, when every column has same range of values
    #So standard Scaler gives a shift in scale to every column a mean of 0 and variance of 1
    #XGBoost already scales the data as it is a tree based model, but we are doing it so that we can replace it with other model in future if necessary.
    scaler = StandardScaler()
    scaler.fit(X_train)#as we want to pretend we do have access to test set at the time of pre processing
    
    #we will overwrite x_train and x_test with a scaled version of it so; using pd.DF because transform function returns numpy array
    X_train = pd.DataFrame(scaler.transform(X_train), index = X_train.index, columns=X_train.columns) 
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns) 
    
    return X_train, X_test, y_train, y_test

In [94]:
X_train, X_test, y_train, y_test = preprocess_inputs(bikes)
X_train

Unnamed: 0,t1,t2,hum,wind_speed,is_holiday,is_weekend,season,month,day,hour,weather_1.0,weather_2.0,weather_3.0,weather_4.0,weather_7.0,weather_10.0,weather_26.0
1930,-0.793215,-0.900317,-0.592599,-0.372328,-0.149651,-0.632510,-1.336593,-1.021919,1.055657,0.068510,-0.738682,-0.54635,1.983638,-0.307663,-0.375177,-0.028655,-0.061548
14312,1.536335,1.439022,0.739356,-0.941867,-0.149651,-0.632510,-0.441295,0.424858,1.169583,-1.231778,-0.738682,1.83033,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548
2542,-1.330803,-1.051242,1.019767,-1.384842,-0.149651,-0.632510,-1.336593,-0.732564,0.486026,-0.798349,1.353763,-0.54635,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548
16732,-0.434823,-0.221154,1.089870,-1.637971,-0.149651,-0.632510,1.349300,1.582280,-1.108942,0.357464,-0.738682,-0.54635,1.983638,-0.307663,-0.375177,-0.028655,-0.061548
5815,0.281962,0.382546,-0.662702,-0.372328,-0.149651,-0.632510,0.454002,0.714214,-1.450721,1.513276,1.353763,-0.54635,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10955,-1.062009,-1.353093,0.073378,1.273008,-0.149651,-0.632510,-1.336593,-0.732564,-0.995016,-0.653872,-0.738682,-0.54635,-0.504124,-0.307663,2.665406,-0.028655,-0.061548
17289,-1.241205,-1.126705,0.809458,-1.194996,-0.149651,-0.632510,1.349300,1.582280,1.511362,1.079846,1.353763,-0.54635,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548
5192,1.177942,1.137172,-0.592599,-1.005149,-0.149651,1.581003,-0.441295,0.424858,-0.881089,1.657752,1.353763,-0.54635,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548
12172,0.013168,0.156159,0.879561,-0.878585,-0.149651,1.581003,-1.336593,-0.443208,1.397436,-1.520731,1.353763,-0.54635,-0.504124,-0.307663,-0.375177,-0.028655,-0.061548


In [95]:
# Training
#create the model
model = XGBRegressor()
model.fit(X_train, y_train)

In [97]:
#Results
y_pred = model.predict(X_test)
y_pred

array([ -61.11291,  688.3381 , 1388.496  , ...,  278.1223 ,  140.87457,
       2237.812  ], dtype=float32)

In [99]:
#we want a metric to understand how bad are our predictions by taking error ypred-ytest
#as the errors will be negative and positive so we cannot do average they will cancel out so we will take squareroot of ypred-ytest
(y_test - y_pred)**2

14999     68703.178231
5504      31920.083898
10259      2162.624650
15150    213124.013572
345       75134.142600
             ...      
11357       941.372435
9217       3021.717810
3733       5346.872871
2248       9048.846909
14506     29863.991394
Name: cnt, Length: 5225, dtype: float64

In [103]:
#mean squared error
np.mean((y_test - y_pred)**2)

np.float64(42783.84928421039)

In [104]:
#our units of the resukt is in squared number of bytes, so we take the square root to return units number of bike: rmse
#rmse: measure of how our model is doing in absolute; in te context of the task we are performing
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
rmse

np.float64(206.8425712570079)

In [105]:
# we are 206 bikes off every prediction
# now we will find r squared score: which is a measure of how much better our model is than the baseline model
#baseline model is mean of y_test that is only predictions based on target value test set
y_test.mean()

np.float64(1151.3703349282296)

In [106]:
#sum of squared error for the baseline model
np.sum((y_test - y_test.mean())**2)

np.float64(6161196246.401914)

In [107]:
#first we will find sum of squared error for our model by using ytest and pred
np.sum((y_test - y_pred)**2)

np.float64(223545612.50999928)

In [112]:
#second will be comparing model y dividing our model to baseline model: numerator our model and denominator baseline model
# this will have effect of becoming 0 when we have zero error or when our error is smaller than the baseline error
#when our error is larger than the baseline error then it will go to positive infinity
# so to not get positive inifinty or zero we will do 1-r2 score as it can be from 0 to 1
#r2 score will compar eour model to baseline model and tells how much better is our model
r2 = 1-(np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2))

In [114]:
# here r2score is 96.37% which says our model makes 96% predictions right
#print the values 
print("RMSE: {:.2f}".format(rmse))
print("R2 score: {:.4f}".format(r2))

RMSE: 206.84
R2 score: 0.9637


In [124]:
#plot - scatter plot using plotly express: visualize the predicted values against actual values
#!pip install plotly
import plotly.express as px
from plotly.offline import plot, init_notebook_mode
init_notebook_mode(connected=True)

fig = px.scatter(
    x=y_pred,
    y=y_test,
    labels={'x':"Predicted", 'y': "Actual"},
    title = "Actual v/s Predicted values",
    width = 700,
    height = 700
)

fig.show()

In [None]:
# The accuracy score is 96% that means our predicted values that means predicted bikes used per hour rate is 96%