<a href="https://colab.research.google.com/github/Wasimds/MLOps_Training/blob/main/MLOps_Evidently_Bicycle_Demand_Monitoring_WithEDA_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bike Rental Prediction and MLOps Demo**

### **Dataset Details**

The data set under study is related to 2-year Hourly usage log of a bike
sharing system namely Captial Bike Sharing (CBS) at Washington, D.C., USA. There are several reasons for this
data set to be a good fit for learning goals. **Firstly**, it includes at least
two full years of data and seems to be quite 
suitable for supervised and semi-supervised learning. **Secondly**, datasets contains external factors (ofcourse historical) corresponding environmental values such as weather conditions, weekdays and official holidays of Washington D.C. 
Please refer below section for detailed data dictionary. 



1.   dteday    : date of observation
2.   instance  : record index
3.   season    : season (1:Spring, 2:Summer, 3:Fall, 4:Winter)
4.   yr 		  : year of observation (0: 2011, 1:2012)
5.   mnth	  : 1 = January, ...., 12 = December
6.   hr 		  : hour (0 to 23)
7.   holiday   : whether the day was a holiday (1 = Yes, 0 = No)
8.   weekday	  : 0 = Sunday, .., 6 = Saturday
9.   workingday: whether the day was a work day (i.e., not a weekend or holiday) (1 = Yes; 0 = No)
10.   weathersit: type of weather
*   1 = clear, few clouds, partly cloudy
*   2 = mist & cloudy, mist & broken clouds, mist & few clouds, mist
*   3 = light snow, light rain & Thunderstorm & scattered clouds, light rain & scattered clouds
*   4 = heavy rain & ice pellets & thunderstorm & mist, snow & fog
11.   temp	  : Normalized temperature in Celsius. The values are divided to 41 (max)
12.   tempfeel  : Normalized feeling temperature in Celsius. The values are divided to 50 (max)
13.   hum		  : relative humidity in percent
14.   windspeed : windspeed in km/hour
15.   casual    : number of casual bike users
16.   registered: number of registered bike users
17.   cnt       : count of total rental bikes including both casual and registered

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
# from google.colab import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

Saving Bike_Rental_Data_Dictionary.jpg to Bike_Rental_Data_Dictionary (1).jpg
User uploaded file "Bike_Rental_Data_Dictionary.jpg" with length 161771 bytes


In [None]:
import cv2
img = cv2.imread("/content/Bike_Rental_Data_Dictionary.jpg.jpg")
cv2.imshow(img)

In [None]:
!pip install evidently

In [12]:
import pandas as pd
import numpy as np
import requests
import zipfile
import io

from datetime import datetime
from sklearn import datasets, ensemble

from evidently.dashboard import Dashboard
from evidently.pipeline.column_mapping import ColumnMapping
from evidently.tabs import DataDriftTab, NumTargetDriftTab, RegressionPerformanceTab

import seaborn as sns
import matplotlib.pyplot as plt

  defaults = yaml.load(f)


## **Reading Hourly Bike Rental Data**

In [17]:
content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
    raw_data = pd.read_csv(arc.open("hour.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')

## **Data Exploration**

In [36]:
raw_data.shape

(17379, 16)

In [39]:
list(raw_data.columns)

['instant',
 'season',
 'yr',
 'mnth',
 'hr',
 'holiday',
 'weekday',
 'workingday',
 'weathersit',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [29]:
print("Rawdata Information",raw_data.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 17379 entries, 2011-01-01 to 2012-12-31
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   season      17379 non-null  int64  
 2   yr          17379 non-null  int64  
 3   mnth        17379 non-null  int64  
 4   hr          17379 non-null  int64  
 5   holiday     17379 non-null  int64  
 6   weekday     17379 non-null  int64  
 7   workingday  17379 non-null  int64  
 8   weathersit  17379 non-null  int64  
 9   temp        17379 non-null  float64
 10  atemp       17379 non-null  float64
 11  hum         17379 non-null  float64
 12  windspeed   17379 non-null  float64
 13  casual      17379 non-null  int64  
 14  registered  17379 non-null  int64  
 15  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12)
memory usage: 2.3 MB
Rawdata Information None


In [32]:
raw_data.head(100)

Unnamed: 0_level_0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
dteday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,3,13,16
2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,8,32,40
2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,5,27,32
2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,3,10,13
2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2011-01-05,96,1,0,1,4,0,3,1,1,0.24,0.2273,0.48,0.2239,0,2,2
2011-01-05,97,1,0,1,5,0,3,1,1,0.22,0.2273,0.47,0.1642,0,3,3
2011-01-05,98,1,0,1,6,0,3,1,1,0.20,0.1970,0.47,0.2239,0,33,33
2011-01-05,99,1,0,1,7,0,3,1,1,0.18,0.1818,0.43,0.1940,1,87,88


In [22]:
raw_data.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


### Checking if there are any null values

In [34]:
raw_data.isnull().sum()

instant       0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [None]:
plt.figure(figsize = (10,10))
sns.heatmap(raw_data.corr(), annot = True)
plt.show()

### **There is clear and strong correlation between Month and Season from above heatmap.**

In [49]:
print("Record Count Summary By Each Season")
print(str(raw_data['season'].value_counts()).split('\n'))   #season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
print('#'* 80)
print("Record Count Summary By Each Day Of The Week")
print(str(raw_data['weekday'].value_counts()).split('\n'))  # 0 = Sunday, .., 6 = Saturday
print('#'* 80)
print("Record Count Summary By Each Month")
print(str(raw_data['mnth'].value_counts()).split('\n'))  # 0 = Sunday, .., 6 = Saturday

Record Count Summary By Each Season
['3    4496', '2    4409', '1    4242', '4    4232', 'Name: season, dtype: int64']
################################################################################
Record Count Summary By Each Day Of The Week
['6    2512', '0    2502', '5    2487', '1    2479', '3    2475', '4    2471', '2    2453', 'Name: weekday, dtype: int64']
################################################################################
Record Count Summary By Each Month
['7     1488', '5     1488', '12    1483', '8     1475', '3     1473', '10    1451', '6     1440', '11    1437', '9     1437', '4     1437', '1     1429', '2     1341', 'Name: mnth, dtype: int64']


### **Above summary clearly indicates a spike in bike rentals during Summer and Fall.**

In [None]:

def bargraphs(columns):
    for col in columns:
      plt.figure(figsize=(8,6))
      sns.set_palette("RdBu")      
      sns.countplot(x=col, data= raw_data)
      plt.show
bargraphs(['season', 'mnth', 'weathersit'])



*   **As noted earlier, bike rentals are high during Summer and Fall seasons from the above graphs for season and month.**
*   **It is evident from the above graph that, bike rentals are higher when the weather situation is good.**



In [None]:
def distributionplot(columns):
    for col in columns:
        plt.figure(figsize=(14,6))
        sns.distplot(x = raw_data[col]) # distribution
        plt.show
distributionplot(['casual', 'registered', 'cnt'])

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='weekday',y='cnt', data=raw_data)
plt.show()

### **As per the above the graph, data is positively skewed in each day of the weeek. Lets explore the presence fo outliers in data.**

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='season', y='cnt', data=raw_data)
plt.xlabel("Seasons (1= spring,  2= summer,  3= fall,  4= winter)", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='hr',y='cnt', data=raw_data) 
plt.show()

### **We can see that large number of people rent bikes during Morning and Evening hours.**

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='yr',y='cnt', data=raw_data) 
plt.xlabel("Year (0 = 2011,  1 = 2012)", fontsize=16)
plt.show()

In [89]:
(raw_data == 0).sum()

instant           0
season            0
yr             8645
mnth              0
hr              726
holiday       16879
weekday        2502
workingday     5514
weathersit        0
temp              0
atemp             2
hum              22
windspeed      2180
casual         1581
registered       24
cnt               0
dtype: int64

Lets fill rows of Windspeed column wherever it is currently Zero.

In [90]:
raw_data['windspeed'] = raw_data['w indspeed'].replace(0,np.NaN)

In [91]:
(raw_data == 0).sum()

instant           0
season            0
yr             8645
mnth              0
hr              726
holiday       16879
weekday        2502
workingday     5514
weathersit        0
temp              0
atemp             2
hum              22
windspeed         0
casual         1581
registered       24
cnt               0
dtype: int64

In [92]:
raw_data['windspeed'].isnull().sum()

2180

Let's now fill NaN with interpolate. Interpolate is using fill NaN value for time series data.

In [93]:
raw_data['windspeed'].fillna(method='bfill',inplace=True)
raw_data['windspeed'] = raw_data['windspeed'].interpolate()

In [94]:
raw_data['windspeed'].isnull().sum()

0

# **ML Model Training and Prediction**

In [None]:
# raw_data['month'] = raw_data.index.map(lambda x : x.month)
# raw_data['hour'] = raw_data.index.map(lambda x : x.hour)
# raw_data['weekday2'] = raw_data.index.map(lambda x : x.weekday() + 1)

In [99]:
target = 'cnt'
prediction = 'prediction'
numerical_features = ['temp', 'atemp', 'hum', 'windspeed', 'hr', 'weekday']
categorical_features = ['season', 'holiday', 'workingday']

In [122]:
reference = raw_data.loc['2011-01-01 00:00:00':'2012-08-31 23:00:00'] # Training Model On 19 Months Of Data
current = raw_data.loc['2012-09-01 00:00:00':'2012-11-30 23:00:00']   # Vaidation Model on 3 Months of Data (Sep'12, Oct'12 and Nov'12)
test = raw_data.loc['2012-12-01 00:00:00':'2012-12-31 23:00:00']   # Testing Model on Dec'12 Data

In [None]:
print("Reference Data Info Print",reference.info())
print('#'* 100)
print("Current Data Info Print",current.info())

In [125]:
regressor = ensemble.RandomForestRegressor(random_state = 0, n_estimators = 50)
model = regressor.fit(reference[numerical_features + categorical_features], reference[target])

In [126]:
ref_prediction = model.predict(reference[numerical_features + categorical_features])
current_prediction = model.predict(current[numerical_features + categorical_features])
test_prediction = model.predict(test[numerical_features + categorical_features])

In [127]:
reference['prediction'] = ref_prediction
current['prediction'] = current_prediction
test['prediction'] = test_prediction

In [None]:
reference.head(100)

In [None]:
current.head(100)

In [128]:
column_mapping = ColumnMapping()
column_mapping.target = target
column_mapping.prediction = prediction
column_mapping.numerical_features = numerical_features
column_mapping.categorical_features = categorical_features

In [129]:
column_mapping

ColumnMapping(target='cnt', prediction='prediction', datetime='datetime', id=None, numerical_features=['temp', 'atemp', 'hum', 'windspeed', 'hr', 'weekday'], categorical_features=['season', 'holiday', 'workingday'], target_names=None)

In [108]:
regression_perfomance_dashboard = Dashboard(tabs=[RegressionPerformanceTab()])
regression_perfomance_dashboard.calculate(reference, None, column_mapping=column_mapping)
regression_perfomance_dashboard.show()
# regression_perfomance_dashboard.save('/content/drive/My Drive//regression_performance_at_training.html')

In [111]:
regression_perfomance_dashboard = Dashboard(tabs=[RegressionPerformanceTab()])
regression_perfomance_dashboard.calculate(reference, current, column_mapping=column_mapping)
regression_perfomance_dashboard.show()
# regression_perfomance_dashboard.save('/content/drive/My Drive//regression_performance_train_vs_test.html')

# **Week 1**

In [132]:
regression_perfomance_dashboard.calculate(reference,test.loc['2012-12-01 00:00:00':'2012-12-07 23:00:00'],column_mapping=column_mapping)

In [None]:
regression_perfomance_dashboard.show()
#regression_perfomance_dashboard.save('reports/regression_performance_after_week1.html')

In [None]:
target_drift_dashboard = Dashboard(tabs=[NumTargetDriftTab()])
target_drift_dashboard.calculate(reference, test.loc['2012-12-01 00:00:00':'2012-12-07 23:00:00'],column_mapping=column_mapping)

target_drift_dashboard.show()
#target_drift_dashboard.save('reports/target_drift_after_week1.html')

# **Data Drift For Week 1 (For Reference Purpose Only)**

In [134]:
column_mapping = ColumnMapping()

column_mapping.numerical_features = numerical_features

In [None]:
data_drift_dashboard = Dashboard(tabs=[DataDriftTab()])
data_drift_dashboard.calculate(reference, test.loc['2012-12-01 00:00:00':'2012-12-07 23:00:00'],column_mapping=column_mapping)

data_drift_dashboard.show()
# data_drift_dashboard.save("reports/data_drift_dashboard_after_week1.html")