<a href="https://colab.research.google.com/github/aisha-partha/AIMLOps-MiniProjects/blob/mp_5/M3_NB2_MiniProject_1_PartA_Regression_and_Modularization_Aishwarya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization (Pipeline Building)

#### (Notebook-2)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* create custom classes required for data processing
* implement pipeline and train the model
* save the model/pipeline
* make prediction using the saved model/pipeline

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features.

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:**
    * spring
    * summer
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered [Target column]

In [2]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

bike-sharing-dataset.csv
Dataset downloaded successfully!


### Import Required Packages

In [3]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [4]:
# ========== NEW IMPORTS FOR PIPELINE BUILDING ========

# to create pipeline
from sklearn.pipeline import Pipeline

# for including custom preprocessors within pipeline
from sklearn.base import BaseEstimator, TransformerMixin

## **1. Pre-Pipeline-Steps:**

### 1.1 Load, Explore, and Prepare the Data Set

* Load the dataset
* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [6]:
# YOUR CODE HERE
# Reading Our Dataset
bikeshare = pd.read_csv('bike-sharing-dataset.csv')
print('The shape of the dataset:', bikeshare.shape)

print('The col info on the dataset:')
bikeshare.info()

The shape of the dataset: (17379, 14)
The col info on the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     16504 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  16121 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(3), object(7)
memory usage: 1.9+ MB


### 1.2 Working on `dteday` column to extract year and month

- Create a function to extract year and month from the date column and create two another columns
  

In [7]:
# YOUR CODE HERE
def get_year_and_month(dataframe):

    df = dataframe.copy()
    # convert 'dteday' column to Datetime datatype
    df['dteday'] = pd.to_datetime(df['dteday'], format='%Y-%m-%d')
    # Add new features 'yr' and 'mnth
    df['yr'] = df['dteday'].dt.year
    df['mnth'] = df['dteday'].dt.month_name()

    return df

In [9]:
bikeshare = get_year_and_month(bikeshare)
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   dteday      17379 non-null  datetime64[ns]
 1   season      17379 non-null  object        
 2   hr          17379 non-null  object        
 3   holiday     17379 non-null  object        
 4   weekday     16504 non-null  object        
 5   workingday  17379 non-null  object        
 6   weathersit  16121 non-null  object        
 7   temp        17379 non-null  float64       
 8   atemp       17379 non-null  float64       
 9   hum         17379 non-null  float64       
 10  windspeed   17379 non-null  float64       
 11  casual      17379 non-null  int64         
 12  registered  17379 non-null  int64         
 13  cnt         17379 non-null  int64         
 14  yr          17379 non-null  int32         
 15  mnth        17379 non-null  object        
dtypes: datetime64[ns](1), 

### 1.3 Find numerical and categorical variables

In [10]:
# YOUR CODE HERE
unused_colms = ['dteday', 'casual', 'registered']
target_col = ['cnt']

numerical_features = []
categorical_features = []

for col in bikeshare.columns:
    if col not in target_col + unused_colms:
        if bikeshare[col].dtypes == 'float64':
            numerical_features.append(col)
        else:
            categorical_features.append(col)


print('Number of numerical variables: {}'.format(len(numerical_features)),":" , numerical_features)

print('Number of categorical variables: {}'.format(len(categorical_features)),":" , categorical_features)

Number of numerical variables: 4 : ['temp', 'atemp', 'hum', 'windspeed']
Number of categorical variables: 8 : ['season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'yr', 'mnth']


## **2. Pipeline-Steps:**

Build custom classes which are compatible with Skearn pipeline for imputation, feature mapping, and any column specific operation.

### **A. Imputation**

#### Build a custom Imputation class compatible with Sklearn for handling missing values in `weekday` column.

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [14]:
class WeekdayImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weekday' column by extracting dayname from 'dteday' column """

    def __init__(self, col=None):
        self.col = col

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # YOUR CODE HERE
        return self

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = df.copy()
        self.wkday_null_idx = X[X['weekday'].isnull() == True].index
        X.loc[self.wkday_null_idx, 'weekday'] = X.loc[self.wkday_null_idx, 'dteday'].dt.day_name().apply(lambda x: x[:3])
        return X







In [15]:
bikeshare['weekday'].unique()

array(['Mon', 'Wed', 'Thu', 'Tue', nan, 'Fri', 'Sun', 'Sat'], dtype=object)

In [16]:
# Apply weekday imputer

# YOUR CODE HERE
cwt = WeekdayImputer(col='weekday')
data1 = cwt.fit_transform(bikeshare)

In [17]:
data1['weekday'].unique()

array(['Mon', 'Wed', 'Thu', 'Tue', 'Sun', 'Fri', 'Sat'], dtype=object)

#### Build another custom Imputation class compatible with Sklearn for handling missing values in `weathersit` column.

- Fill in the missing rows in this column with the most frequent category

In [18]:
bikeshare['weathersit'].unique()

array(['Mist', 'Clear', nan, 'Light Rain', 'Heavy Rain'], dtype=object)

In [19]:

class WeathersitImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weathersit' column by replacing them with the most frequent category value """

    def __init__(self, cols=None):
        self.cols = cols

    def transform(self, df):
        X = df.copy()
        for col in self.cols:
            X[col].fillna(X[col].value_counts().index[0], inplace=True)
        return X

    def fit(self, *_):
        return self


In [20]:
# Apply weathersit imputer

# YOUR CODE HERE
wsi = WeathersitImputer(cols=['weathersit'])
data2 = wsi.fit_transform(data1)


In [21]:
data2['weathersit'].unique()

array(['Mist', 'Clear', 'Light Rain', 'Heavy Rain'], dtype=object)

### **B. Mapping**

#### Build a Mapper class for mapping `yr`, `mnth`, `season`, `weathersit`, `holday`, `workingday`, and `hr` columns.

In [22]:


class Mapper(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self, variables: str, mappings: dict):

        if not isinstance(variables, str):
            raise ValueError("variables should be a str")

        self.variables = variables
        self.mappings = mappings

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # we need the fit statement to accomodate the sklearn pipeline
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        #for feature in self.variables:
        X[self.variables] = X[self.variables].map(self.mappings).astype(int)

        return X


In [33]:
# Instantiate mapper for all ordinal categorical features

# YOUR CODE HERE

col_mapping = {
"yr" : {2011: 0, 2012: 1},
"mnth" : {'January': 0, 'February': 1, 'December': 11, 'March': 2, 'November': 10, 'April': 3,
                'October': 9, 'May': 4, 'September': 8, 'June': 5, 'July': 6, 'August': 7},
"season" : {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3},
"weathersit" : {'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3},
"holiday" : {'Yes': 0, 'No': 1},
"workingday" : {'No': 0, 'Yes': 1},
"hr" : {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8,
                '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16,
                '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23}

}

In [29]:
data2.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,November
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,July
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,February
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,March
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,November


In [46]:


data3 = data2.copy()
# YOUR CODE HERE
for key, value in col_mapping.items():
    print(key, value)
    col_mapper = Mapper(key, value)
    data3 = col_mapper.fit(data2).transform(data3)


yr {2011: 0, 2012: 1}
mnth {'January': 0, 'February': 1, 'December': 11, 'March': 2, 'November': 10, 'April': 3, 'October': 9, 'May': 4, 'September': 8, 'June': 5, 'July': 6, 'August': 7}
season {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3}
weathersit {'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3}
holiday {'Yes': 0, 'No': 1}
workingday {'No': 0, 'Yes': 1}
hr {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8, '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16, '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23}


In [47]:
data3.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,1,6,1,Mon,1,2,6.1,3.0014,49.0,19.0012,4,135,139,1,10
1,2011-07-13,3,0,1,Wed,1,3,26.78,28.9988,58.0,16.9979,0,5,5,0,6
2,2012-02-09,0,11,1,Thu,1,3,3.28,-0.9982,52.0,15.0013,4,95,99,1,1
3,2012-03-22,2,12,1,Thu,1,2,14.56,15.0002,100.0,6.0032,29,332,361,1,2
4,2011-11-08,1,17,1,Tue,1,3,16.44,17.0,52.0,8.9981,28,175,203,0,10


### **C. Class for Specific operation**

#### Build a Class for handling outliers in numerical columns

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [51]:



class OutlierHandler(BaseEstimator,TransformerMixin):
    def __init__(self, colm: str, factor=1.5):
        self.factor = factor
        self.colm = colm

    def outlier_handler(self,X,y=None):
        df = X.copy()
        q1 = df.quantile(0.25)
        q3 = df.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - (self.factor * iqr)
        upper_bound = q3 + (self.factor * iqr)
        for i in df.index:
            print('Above')
            if df.loc[i,self.colm] > upper_bound:
                print('Below')
                df.loc[i,self.colm]= upper_bound
            if df.loc[i,self.colm] < lower_bound:
                df.loc[i,self.colm]= lower_bound

        return df

    def fit(self,X,y=None):
        return self

    def transform(self,X,y=None):
        return X.apply(self.outlier_handler)



In [52]:
data3.index

RangeIndex(start=0, stop=17379, step=1)

In [53]:
# Instantiate outlier handler for all numerical features
# Handle outliers for all numerical columns
for col in numerical_features:
    outhandler = OutlierHandler(col)
    data4 = outhandler.fit_transform(data3)

# YOUR CODE HERE

Above


IndexingError: Too many indexers

In [None]:


# YOUR CODE HERE
data4[numerical_features].boxplot()
plt.xticks(rotation= 60)
plt.show()

#### Build a Class to One-hot Encode `weekday` column

In [None]:

class WeekdayOneHotEncoder(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """

    def __init__():
        # YOUR CODE HERE

    def fit():
        # YOUR CODE HERE

    def transform():
        # YOUR CODE HERE


In [55]:
# Treat 'weekday' column as a Categorical variable, perform one-hot encoding

# YOUR CODE HERE
encoder = OneHotEncoder(sparse_output=False)
data_enc = encoder.fit(data3[['weekday']])

## **3. Build Pipeline**

Build a pipeline and implement all the above class transformers inside the pipeline along with the regressor.

In [None]:
# YOUR CODE HERE

df_pipe = Pipeline([

    ('weekday_imputation', WeekdayImputer(variables='weekday')),
    ##==========Mapper======##

    ('map_yr',Mapper('yr',{2011: 0, 2012: 1})),
    ('map_mnth',Mapper('mnth', {'January': 0, 'February': 1, 'December': 11, 'March': 2, 'November': 10, 'April': 3,
                'October': 9, 'May': 4, 'September': 8, 'June': 5, 'July': 6, 'August': 7} )),
    ('map_season',Mapper('season', {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3})),
    ('map_weathersit',Mapper('weathersit',{'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3})),
    ('map_holiday',Mapper('holiday', {'January': 0, 'February': 1, 'December': 11, 'March': 2, 'November': 10, 'April': 3,
                'October': 9, 'May': 4, 'September': 8, 'June': 5, 'July': 6, 'August': 7} )),
    ('map_workingday',Mapper('workingday', {'No': 0, 'Yes': 1})),
    ('map_hr',Mapper('hr', {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8,
                '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16,
                '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23})),

    # Transformation of age column
    ('outlier_handler', OutlierHandler(numerical_features)),
    ('onehot_encoder'), OneHotEncoder())
    # scale
    ('scaler', StandardScaler()),

    # Model fit
    ('model_rf', RandomForestClassifier(n_estimators=150, max_depth=5,random_state=42))
])

## **4. Fit Pipeline**

- Separate target and prediction features
- Split data into train and test set
- Fit pipeline on train set
- Get prediction on test set
- Calculate the mse and r2_score

In [None]:
# YOUR CODE HERE

### Check for package versions may be used for requirements.txt file

In [None]:
!pip -qq install pydantic
!pip -qq install strictyaml
!pip -qq install ruamel.yaml

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.9/123.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.5/109.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import numpy as np
import pandas as pd
import sklearn
import pydantic
import strictyaml
import ruamel.yaml
import joblib

In [None]:
# YOUR CODE HERE

## **5. Modularize the application**

- Convert the above regression application to a production environment format (.py files) inside VS code.

- Create different modules specific to functionality:
    - requirements
    - configuration
    - data manager
    - feature engineering
    - pipeline building
    - pipeline training
    - predict
