<a href="https://colab.research.google.com/github/desankha88/desankha88/blob/main/M5_NB2_MiniProject_1_PartA_Regression_and_Modularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization (Pipeline Building)

#### (Notebook-2)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* create custom classes required for data processing
* implement pipeline and train the model
* save the model/pipeline
* make prediction using the saved model/pipeline

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features.

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:**
    * spring
    * summer
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered [Target column]

In [71]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

bike-sharing-dataset.csv
bike-sharing-dataset.csv.1
bike-sharing-dataset.csv.2
Dataset downloaded successfully!


### Import Required Packages

In [72]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [73]:
# ========== NEW IMPORTS FOR PIPELINE BUILDING ========

# to create pipeline
from sklearn.pipeline import Pipeline

# for including custom preprocessors within pipeline
from sklearn.base import BaseEstimator, TransformerMixin

## **1. Pre-Pipeline-Steps:**

### 1.1 Load, Explore, and Prepare the Data Set

* Load the dataset
* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [74]:
# YOUR CODE HERE
df = pd.read_csv('/content/bike-sharing-dataset.csv')
df.head()


Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203


In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     16504 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  16121 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(3), object(7)
memory usage: 1.9+ MB


### 1.2 Working on `dteday` column to extract year and month

- Create a function to extract year and month from the date column and create two another columns
  

In [76]:
# YOUR CODE HERE
def extract_year_month_from_dteday(df):
  df['date'] = pd.to_datetime(df['dteday'])
  df['yr'], df['mnth'] = df['date'].dt.year, df['date'].dt.month
  df = df.drop(columns=['date'])
  return df


In [77]:
df = extract_year_month_from_dteday(df)
df.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,11
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,7
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,2
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,3
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,11


### 1.3 Find numerical and categorical variables

In [78]:
# YOUR CODE HERE
numerical_cols = df.select_dtypes(include=['number']).columns.to_list()
categorical_cols = df.select_dtypes(include=['object']).columns.to_list()
print(numerical_cols)
print(categorical_cols)

['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt', 'yr', 'mnth']
['dteday', 'season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit']


In [79]:
from datetime import datetime
nan_weekday_indices = df[df['weekday'].isnull()].index.to_list()
for x in nan_weekday_indices :
    if x <= 20 :
        print(df.iloc[x, df.columns.get_loc("weekday")])
        print(datetime.strptime(df.iloc[x, df.columns.get_loc('dteday')],'%Y-%m-%d').strftime("%a"))

nan
Sun
nan
Tue


In [80]:
df.head(20)

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,11
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,7
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,2
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,3
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,11
5,2012-06-17,summer,9am,No,,No,Clear,18.32,18.9998,68.0,11.0014,91,183,274,2012,6
6,2011-06-03,summer,2am,No,Fri,Yes,Clear,18.32,18.9998,43.0,15.0013,0,12,12,2011,6
7,2012-11-25,winter,2am,No,Sun,No,,2.34,-2.998,44.0,23.9994,1,27,28,2012,11
8,2011-01-14,spring,4pm,No,Fri,Yes,Clear,2.34,-0.0016,41.0,7.0015,3,87,90,2011,1
9,2012-12-23,spring,8am,No,Sun,No,Clear,-1.42,-4.9978,69.0,7.0015,5,43,48,2012,12


## **2. Pipeline-Steps:**

Build custom classes which are compatible with Skearn pipeline for imputation, feature mapping, and any column specific operation.

### **A. Imputation**

#### Build a custom Imputation class compatible with Sklearn for handling missing values in `weekday` column.

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [81]:
import numpy as np
from datetime import datetime

In [82]:
class WeekdayImputer(BaseEstimator, TransformerMixin):
    """Custom imputer to fill missing 'weekday' values using 'dteday' column."""

    def __init__(self, date_column="dteday", weekday_column="weekday"):
        self.date_column = date_column
        self.weekday_column = weekday_column

    def fit(self, X: pd.DataFrame, y=None):
        """
        The fit method does not need to store any statistics.
        Just verifies that the required columns exist.
        """
        if self.date_column not in X.columns or self.weekday_column not in X.columns:
            raise ValueError(f"Data must contain '{self.date_column}' and '{self.weekday_column}' columns")

        print('Number of Nan values found in weekday ', len(X[X['weekday'].isnull()].index.to_list()))

        return self  # Returning self to allow method chaining

    def transform(self, X: pd.DataFrame):
        """
        Replaces missing values in the 'weekday' column by extracting the
        corresponding weekday from the 'dteday' column.
        """
        X = X.copy()  # Avoid modifying the original dataframe

        # Identify NaN indices in weekday column
        nan_indices = X[X[self.weekday_column].isnull()].index

        # Convert 'dteday' column to datetime and extract full weekday name
        X.loc[nan_indices, self.weekday_column] = (
            X.loc[nan_indices, self.date_column]
            .apply(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%a'))  # Extract short weekday name
        )

        return X  # Return the transformed dataframe



In [83]:
# Apply weekday imputer

# YOUR CODE HERE

wkd_imputr=WeekdayImputer(date_column='dteday',weekday_column="weekday")
wkd_imputr.fit(df)
#print('Nan value indices found for weekday column', df[df['weekday'].isnull()].index.to_list())
data1=wkd_imputr.transform(df)
print('After transform calculating Null values ', len(data1[data1['weekday'].isnull()].index.to_list()) )



Number of Nan values found in weekday  875
After transform calculating Null values  0


#### Build another custom Imputation class compatible with Sklearn for handling missing values in `weathersit` column.

- Fill in the missing rows in this column with the most frequent category

In [84]:

class WeathersitImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weathersit' column by replacing them with the most frequent category value """

    def __init__(self, weathersit_column_name = 'weathersit'):
        # YOUR CODE HERE
        if not isinstance(weathersit_column_name, str):
            ValueError('weathersit_column_name is not defined or not a string!!')
        else:
            self.weathersit_column_name = weathersit_column_name

    def fit(self, X : pd.DataFrame ):
        # YOUR CODE HERE
        if self.weathersit_column_name not in X.columns:
            raise ValueError(f"Data must contain '{self.weathersit_column_name}'")
        else :
            print('Number of Nan values found ', X[self.weathersit_column_name].isnull().sum())
            self.fill_value = X[self.weathersit_column_name].mode()[0]


    def transform(self, X : pd.DataFrame ):
        # YOUR CODE HERE
        X = X.copy()
        X[self.weathersit_column_name]=X[self.weathersit_column_name].fillna(self.fill_value)
        return X


In [85]:
# Apply weathersit imputer

# YOUR CODE HERE

wi = WeathersitImputer(weathersit_column_name = 'weathersit')
wi.fit(data1)
data1 = wi.transform(data1)
print('After Imputation on weathersit Null value counts : ',data1['weathersit'].isnull().sum())

Number of Nan values found  1258
After Imputation on weathersit Null value counts :  0


In [86]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     17379 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  17379 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
 14  yr          17379 non-null  int32  
 15  mnth        17379 non-null  int32  
dtypes: float64(4), int32(2), int64(3), object(7)
memory usage: 2.0+ MB


In [87]:
data1.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,11
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,7
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,2
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,3
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,11


### **B. Mapping**

#### Build a Mapper class for mapping `yr`, `mnth`, `season`, `weathersit`, `holday`, `workingday`, and `hr` columns.

In [88]:

class Mapper(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self , column_name : str, mappings: dict):

        if not isinstance(column_name, str):
            raise ValueError("variables should be a str")

        # YOUR CODE HERE
        self.column_name = column_name
        self.mappings = mappings

    def fit(self,X : pd.DataFrame):
        # YOUR CODE HERE
        if self.column_name not in X.columns:
            ValueError('{column_name} not present in DataFrame ')
        return self

    def transform(self, X:pd.DataFrame):
        # YOUR CODE HERE
        X = X.copy()
        X[self.column_name] = X[self.column_name].map(self.mappings).astype(int)
        return X


In [89]:
map_hr = data1['hr'].unique()
map_dict1 = {str(x)+'am':x for x in range(1,12)}
map_dict2 = {str(x-12)+'pm':x for x in range(13,24)}
print(map_dict1)
print(map_dict2)
map_hr_dict = {'12am':0} | map_dict1 | {'12pm':12} | map_dict2
print(map_hr_dict)

{'1am': 1, '2am': 2, '3am': 3, '4am': 4, '5am': 5, '6am': 6, '7am': 7, '8am': 8, '9am': 9, '10am': 10, '11am': 11}
{'1pm': 13, '2pm': 14, '3pm': 15, '4pm': 16, '5pm': 17, '6pm': 18, '7pm': 19, '8pm': 20, '9pm': 21, '10pm': 22, '11pm': 23}
{'12am': 0, '1am': 1, '2am': 2, '3am': 3, '4am': 4, '5am': 5, '6am': 6, '7am': 7, '8am': 8, '9am': 9, '10am': 10, '11am': 11, '12pm': 12, '1pm': 13, '2pm': 14, '3pm': 15, '4pm': 16, '5pm': 17, '6pm': 18, '7pm': 19, '8pm': 20, '9pm': 21, '10pm': 22, '11pm': 23}


In [90]:
# Instantiate mapper for all ordinal categorical features

map_season = {'winter' : 0 , 'fall' : 1 ,'spring' : 2, 'summer' : 3}
map_hr = data1['hr'].unique()
map_dict1 = {str(x)+'am':x for x in range(1,12)}
map_dict2 = {str(x-12)+'pm':x for x in range(13,24)}

map_hr = {'12am':0} | map_dict1 | {'12pm':12} | map_dict2
print(map_hr)
map_holiday = { 'No': 0, 'Yes' : 1 }
#map_weekday = {'Mon': 1, 'Wed':3 , 'Thu': 4, 'Tue':2, 'Sun':0 ,'Fri': 5, 'Sat' : 6 }
map_workingday = {'Yes':1, 'No': 0}
map_weathersit = {'Mist':1 ,'Clear':0, 'Light Rain':2 ,'Heavy Rain':3}

# YOUR CODE HERE
for c in categorical_cols :
    if (c != 'dteday') and (c != 'weekday'):
        print('column_name -', c, '|| unique values = ',data1[c].unique())
        map_obj = Mapper(c,eval('map_' + c))
        map_obj.fit(data1)
        data1 = map_obj.transform(data1)
        print('After Transformation :column_name -', c, '|| unique values = ',data1[c].unique())
        print('********************')


{'12am': 0, '1am': 1, '2am': 2, '3am': 3, '4am': 4, '5am': 5, '6am': 6, '7am': 7, '8am': 8, '9am': 9, '10am': 10, '11am': 11, '12pm': 12, '1pm': 13, '2pm': 14, '3pm': 15, '4pm': 16, '5pm': 17, '6pm': 18, '7pm': 19, '8pm': 20, '9pm': 21, '10pm': 22, '11pm': 23}
column_name - season || unique values =  ['winter' 'fall' 'spring' 'summer']
After Transformation :column_name - season || unique values =  [0 1 2 3]
********************
column_name - hr || unique values =  ['6am' '4am' '11am' '7am' '12pm' '9am' '2am' '4pm' '8am' '1am' '3am' '1pm'
 '10pm' '7pm' '8pm' '2pm' '5pm' '5am' '3pm' '9pm' '10am' '6pm' '12am'
 '11pm']
After Transformation :column_name - hr || unique values =  [ 6  4 11  7 12  9  2 16  8  1  3 13 22 19 20 14 17  5 15 21 10 18  0 23]
********************
column_name - holiday || unique values =  ['No' 'Yes']
After Transformation :column_name - holiday || unique values =  [0 1]
********************
column_name - workingday || unique values =  ['Yes' 'No']
After Transformatio

In [91]:
# Map values for all ordinal categorical features

# YOUR CODE HERE



### **C. Class for Specific operation**

#### Build a Class for handling outliers in numerical columns

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [92]:

class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Change the outlier values:
        - to upper-bound, if the value is higher than upper-bound, or
        - to lower-bound, if the value is lower than lower-bound respectively.
    """

    def __init__(self, column_name : str ):
        # YOUR CODE HERE
        if not isinstance(column_name, str):
            ValueError("variables should be a str")

        self.colm = column_name

    def fit(self, X: pd.DataFrame):
        # YOUR CODE HERE
        return self

    def transform(self , X : pd.DataFrame ):
        # YOUR CODE HERE
        X = X.copy()
        q1 = X.describe()[self.colm].loc['25%']
        q3 = X.describe()[self.colm].loc['75%']
        iqr = q3 - q1
        lower_bound = q1 - (1.5 * iqr)
        upper_bound = q3 + (1.5 * iqr)
        for i in X.index:
            if X.loc[i,self.colm] > upper_bound:
                X.loc[i,self.colm]= upper_bound
            if X.loc[i,self.colm] < lower_bound:
                X.loc[i,self.colm]= lower_bound
        return X



In [93]:
# Instantiate outlier handler for all numerical features

# YOUR CODE HERE
for c in numerical_cols :
    outlr = OutlierHandler(c)
    outlr.fit(data1)
    data1 = outlr.transform(data1)

  X.loc[i,self.colm]= upper_bound


In [111]:
numerical_cols

['temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt',
 'yr',
 'mnth']

In [94]:
# Handle outliers for all numerical columns

# YOUR CODE HERE

In [95]:
data1.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth
0,2012-11-05,0,6,0,Mon,1,1,6.1,3.0014,49.0,19.0012,4,135,139.0,2012,11
1,2011-07-13,1,4,0,Wed,1,0,26.78,28.9988,58.0,16.9979,0,5,5.0,2011,7
2,2012-02-09,2,11,0,Thu,1,0,3.28,-0.9982,52.0,15.0013,4,95,99.0,2012,2
3,2012-03-22,3,7,0,Thu,1,1,14.56,15.0002,100.0,6.0032,29,332,361.0,2012,3
4,2011-11-08,0,12,0,Tue,1,0,16.44,17.0,52.0,8.9981,28,175,203.0,2011,11


In [106]:
X1 = data1
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(X[['weekday']])
encoded_weekday = encoder.transform(X[['weekday']])
enc_wkday_features = encoder.get_feature_names_out(['weekday'])
enc_wkday_features


array(['weekday_0.0', 'weekday_1.0'], dtype=object)

In [107]:
X1[enc_wkday_features] = encoded_weekday
X1.shape
X1.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth,weekday_0.0,weekday_1.0
0,2012-11-05,0,6,0,1.0,1,1,6.1,3.0014,49.0,19.0012,4,135,139.0,2012,11,0.0,1.0
1,2011-07-13,1,4,0,1.0,1,0,26.78,28.9988,58.0,16.9979,0,5,5.0,2011,7,0.0,1.0
2,2012-02-09,2,11,0,1.0,1,0,3.28,-0.9982,52.0,15.0013,4,95,99.0,2012,2,0.0,1.0
3,2012-03-22,3,7,0,1.0,1,1,14.56,15.0002,100.0,6.0032,29,332,361.0,2012,3,0.0,1.0
4,2011-11-08,0,12,0,1.0,1,0,16.44,17.0,52.0,8.9981,28,175,203.0,2011,11,0.0,1.0


#### Build a Class to One-hot Encode `weekday` column

In [108]:

class WeekdayOneHotEncoder(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """

    def __init__(self, column_name : str):
        # YOUR CODE HERE
        if not isinstance(column_name,str):
            ValueError("Column_name should be a str")
        self.column_name = column_name
        self.encoder = OneHotEncoder(sparse_output=False)


    def fit(self, X : pd.DataFrame):
        # YOUR CODE HERE
        if self.column_name not in X.columns :
            ValueError("{self.column_name} not present in dataframe")
        self.encoder.fit(X[[self.column_name]])
        return self

    def transform(self, X : pd.DataFrame):
        # YOUR CODE HERE
        encoded_weekday = self.encoder.transform(X[[self.column_name]])
        enc_wkday_features = encoder.get_feature_names_out([self.column_name])
        X[enc_wkday_features] = encoded_weekday
        return X


In [109]:
# Treat 'weekday' column as a Categorical variable, perform one-hot encoding

# YOUR CODE HERE
wkdOHE = WeekdayOneHotEncoder('weekday')
wkdOHE.fit(data1)
data1 = wkdOHE.transform(data1)


In [110]:
data1.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,yr,mnth,weekday_0.0,weekday_1.0
0,2012-11-05,0,6,0,1.0,1,1,6.1,3.0014,49.0,19.0012,4,135,139.0,2012,11,0.0,1.0
1,2011-07-13,1,4,0,1.0,1,0,26.78,28.9988,58.0,16.9979,0,5,5.0,2011,7,0.0,1.0
2,2012-02-09,2,11,0,1.0,1,0,3.28,-0.9982,52.0,15.0013,4,95,99.0,2012,2,0.0,1.0
3,2012-03-22,3,7,0,1.0,1,1,14.56,15.0002,100.0,6.0032,29,332,361.0,2012,3,0.0,1.0
4,2011-11-08,0,12,0,1.0,1,0,16.44,17.0,52.0,8.9981,28,175,203.0,2011,11,0.0,1.0


## **3. Build Pipeline**

Build a pipeline and implement all the above class transformers inside the pipeline along with the regressor.

In [115]:
for c in ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered']:
    print(f"('{c} Outliers' ,OutlierHandler({c}),")

('temp Outliers' ,OutlierHandler(temp),
('atemp Outliers' ,OutlierHandler(atemp),
('hum Outliers' ,OutlierHandler(hum),
('windspeed Outliers' ,OutlierHandler(windspeed),
('casual Outliers' ,OutlierHandler(casual),
('registered Outliers' ,OutlierHandler(registered),


In [117]:
# YOUR CODE HERE

bike_rental_pipe = Pipeline([

    ('Weekday_imputation', WeekdayImputer(date_column='dteday',weekday_column="weekday")),
    ('Weathersit_imputation', WeathersitImputer(weathersit_column_name = 'weathersit')),

    ##==========Mapper======##
    ('map_season',{'winter' : 0 , 'fall' : 1 ,'spring' : 2, 'summer' : 3}),
    ('map_hr', {'12am': 0, '1am': 1, '2am': 2, '3am': 3, '4am': 4, '5am': 5, '6am': 6, '7am': 7, '8am': 8, '9am': 9, '10am': 10, '11am': 11, '12pm': 12, '1pm': 13, '2pm': 14, '3pm': 15, '4pm': 16, '5pm': 17, '6pm': 18, '7pm': 19, '8pm': 20, '9pm': 21, '10pm': 22, '11pm': 23}),
    ('map_holiday', { 'No': 0, 'Yes' : 1 }),
    ('map_workingday', {'Yes':1, 'No': 0}),
    ('map_weathersit', {'Mist':1 ,'Clear':0, 'Light Rain':2 ,'Heavy Rain':3}),

    # Transformation of age column
    ('temp Outliers' ,OutlierHandler('temp')),
    ('atemp Outliers' ,OutlierHandler('atemp')),
    ('hum Outliers' ,OutlierHandler('hum')),
    ('windspeed Outliers' ,OutlierHandler('windspeed')),
    ('casual Outliers' ,OutlierHandler('casual')),
    ('registered Outliers' ,OutlierHandler('registered')),

    # scale
    ('scaler', StandardScaler()),

    # Model fit
    ('model_rf', RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42))
])

## **4. Fit Pipeline**

- Separate target and prediction features
- Split data into train and test set
- Fit pipeline on train set
- Get prediction on test set
- Calculate the mse and r2_score

In [120]:
# YOUR CODE HERE
x=data1.drop('cnt', axis=1)
y=data1['cnt']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
X_train.shape, X_test.shape

((13903, 17), (3476, 17))

### Check for package versions may be used for requirements.txt file

In [121]:
!pip -qq install pydantic
!pip -qq install strictyaml
!pip -qq install ruamel.yaml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.9/123.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.7/117.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m739.1/739.1 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [122]:
import numpy as np
import pandas as pd
import sklearn
import pydantic
import strictyaml
import ruamel.yaml
import joblib

In [None]:
# YOUR CODE HERE

## **5. Modularize the application**

- Convert the above regression application to a production environment format (.py files) inside VS code.

- Create different modules specific to functionality:
    - requirements
    - configuration
    - data manager
    - feature engineering
    - pipeline building
    - pipeline training
    - predict
