<a href="https://colab.research.google.com/github/amitchug/ALMlops/blob/main/M5_NB1_MiniProject_1_PartA_Regression_and_Modularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization

#### (Notebook-1)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform data exploration and visualization
* perform Data preprocessing
* apply  ML algorithms on **Bike Sharing** dataset
* calculate the MSE value of regression techniques

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset). This dataset contains the hourly count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features.

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:**
    * spring
    * summer
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered [Target column]

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

### Import Required Packages

In [None]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

## **1.** Load, Explore and **Prepare the Data Set**

* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [None]:
# Reading Our Dataset
bikeshare = pd.read_csv('bike-sharing-dataset.csv')
bikeshare.shape

In [None]:
bikeshare.head(5)

In [None]:
# Getting information about the dataset
bikeshare.info()

From above, it can be seen that there are mising values in `weekday` and `weathersit` columns.

## **2. Data Processing**


 ### 2.1 Working on `dteday` column to extract year and month

Extract year and month from the date column and create two another columns

  

In [None]:
def get_year_and_month(dataframe):

    df = dataframe.copy()
    # convert 'dteday' column to Datetime datatype
    df['dteday'] = pd.to_datetime(df['dteday'], format='%Y-%m-%d')
    # Add new features 'yr' and 'mnth
    df['yr'] = df['dteday'].dt.year
    df['mnth'] = df['dteday'].dt.month_name()

    return df

In [None]:
bikeshare = get_year_and_month(bikeshare)
bikeshare.info()

In [None]:
bikeshare.head()

## **3. Data Exploration**

### 3.1 Find numerical and categorical variables

In [None]:
unused_colms = ['dteday', 'casual', 'registered']   # unused columns will be removed at later stage
target_col = ['cnt']

numerical_features = []
categorical_features = []

for col in bikeshare.columns:
    if col not in target_col + unused_colms:
        if bikeshare[col].dtypes == 'float64':
            numerical_features.append(col)
        else:
            categorical_features.append(col)


print('Number of numerical variables: {}'.format(len(numerical_features)),":" , numerical_features)

print('Number of categorical variables: {}'.format(len(categorical_features)),":" , categorical_features)

### 3.2 Find missing values in variables

In [None]:
# First in numerical variables
bikeshare[numerical_features].isnull().sum()

In [None]:
# Now in categorical variables
bikeshare[categorical_features].isnull().sum()

### 3.3 Determine cardinality of categorical variables

In [None]:
# Count of unique values
bikeshare[categorical_features].nunique()

### 3.4 Determine the distribution of numerical variables

In [None]:
# Visualize distribution using histplot

fig, ax = plt.subplots(2, 2, figsize=(10, 8))
sns.histplot(ax = ax[0, 0], x = bikeshare[numerical_features[0]], kde=True)
sns.histplot(ax = ax[0, 1], x = bikeshare[numerical_features[1]], kde=True)
sns.histplot(ax = ax[1, 0], x = bikeshare[numerical_features[2]], kde=True)
sns.histplot(ax = ax[1, 1], x = bikeshare[numerical_features[3]], kde=True)
plt.show()

### 3.5 Check for any outliers in numerical variables

Hint: [Boxplot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html)

In [None]:
# Check for outliers in continuous features
bikeshare[numerical_features].boxplot()
plt.xticks(rotation= 60)
plt.show()

Outliers are present in some numerical columns.

### 3.6 Visualize the hour (`hr`) column with an appropriate plot, and find the busy hours of bike sharing

In [None]:
bikeshare.head(2)

In [None]:
# Group the dataset w.r.t hour
grouped_by_hr = bikeshare.groupby('hr').sum('cnt')
grouped_by_hr.head()

In [None]:
# Visualize total bike rental count for per hour

hour_sequence = ['12am', '1am', '2am', '3am', '4am', '5am', '6am', '7am', '8am', '9am', '10am', '11am',
                 '12pm', '1pm', '2pm', '3pm', '4pm', '5pm', '6pm', '7pm', '8pm', '9pm', '10pm', '11pm']

sns.barplot(x = hour_sequence, y = grouped_by_hr.loc[hour_sequence, 'cnt'], hue = hour_sequence)
plt.xticks(rotation=90)
plt.show()

The count of bike rentals are higher in the morning (\~8am) and evening (\~5pm) hours.

### 3.7 Visualize the distribution of count, casual and registered variables

In [None]:
# distribution of casual
sns.histplot(bikeshare['casual'], kde=True);
plt.show()

In [None]:
# distribution of registered
sns.histplot(bikeshare['registered'], kde=True);

In [None]:
# distribution of count
sns.histplot(bikeshare['cnt'], kde=True);

### 3.8 Describe the relation of weekday, holiday and working day

In [None]:
# Unique values of 'workingday'
bikeshare['workingday'].unique()

In [None]:
# Check which weekdays are working days (Mon - Fri)
bikeshare[bikeshare.workingday=='Yes'].weekday.unique()

In [None]:
# Check on which weekdays, holiday is possible
bikeshare[bikeshare.holiday=='Yes'].weekday.unique()

In [None]:
# Not a holiday, not a working day (Sat, Sun)
bikeshare[(bikeshare.holiday=='No') & (bikeshare.workingday=='No')].weekday.unique()

### 3.9 Visualize the monthly wise count of both casual and registered rentals for the year 2011 and 2012 separately.

Hint: [Stacked barchart](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)

In [None]:
# stacked bar chart for year 2011
mnth_sequence = ['January', 'February', 'March', 'April', 'May', 'June',
                 'July', 'August', 'September', 'October', 'November', 'December']

grouped_by_mnth = bikeshare[bikeshare.yr==2011].groupby('mnth').sum(['casual','registered'])

grouped_by_mnth.loc[mnth_sequence, ['casual','registered']].plot.bar(stacked=True);
plt.title("Casual and Registered in 2011")
plt.show()

In [None]:
# stacked bar chart for year 2012
mnth_sequence = ['January', 'February', 'March', 'April', 'May', 'June',
                 'July', 'August', 'September', 'October', 'November', 'December']

grouped_by_mnth = bikeshare[bikeshare.yr==2012].groupby('mnth').sum(['casual','registered'])

grouped_by_mnth.loc[mnth_sequence, ['casual','registered']].plot.bar(stacked=True);
plt.title("Casual and Registered in 2012")
plt.show()

## **4. Split the data into train and test set**

**Note:** Apply all your data preprocessing steps in the train set first (to avoid any data leakage), and keep the test set aside.

In [None]:
# Separate target and prediction features
X = bikeshare.drop(target_col, axis=1)
y = bikeshare[target_col]

X.shape, y.shape

In [None]:
# Apply train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

X_train.shape, X_test.shape

## **5. Feature Engineering**

### 5.1 Handling missing values in `weekday` column:

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [None]:
# Values present in 'weekday' column
X_train['weekday'].unique()

In [None]:
# Function to impute weekday by extracting day name from the date column

def impute_weekday(dataframe):

    df = dataframe.copy()
    wkday_null_idx = df[df['weekday'].isnull() == True].index
    # print(len(wkday_null_idx))
    df.loc[wkday_null_idx, 'weekday'] = df.loc[wkday_null_idx, 'dteday'].dt.day_name().apply(lambda x: x[:3])

    return df

In [None]:
# Impute weekday
X_train = impute_weekday(X_train)

X_train['weekday'].unique()

In [None]:
# Recheck missing values
X_train.isnull().sum()

### 5.2 Handling missing values in `weathersit` column:

- Fill in the missing rows in this column with the most frequent category

In [None]:
# Values present in 'weekday' column
X_train['weathersit'].unique()

In [None]:
# Unique values and their counts
X_train['weathersit'].value_counts()

In [None]:
# Most frequent category
X_train['weathersit'].mode()[0]

In [None]:
# Fill missing values in weathersit
X_train['weathersit'].fillna('Clear', inplace=True)

X_train['weathersit'].unique()

In [None]:
X_train.isnull().sum()

### 5.3 Handling outliers in numerical columns:

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [None]:
# Function to handle outliers for a single column

def handle_outliers(dataframe, colm):

    df = dataframe.copy()
    q1 = df.describe()[colm].loc['25%']
    q3 = df.describe()[colm].loc['75%']
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    for i in df.index:
        if df.loc[i,colm] > upper_bound:
            df.loc[i,colm]= upper_bound
        if df.loc[i,colm] < lower_bound:
            df.loc[i,colm]= lower_bound

    return df

In [None]:
# Handle outliers for all numerical columns

for col in numerical_features:
    X_train = handle_outliers(X_train, col)

In [None]:
# Re-check for outliers in continuous features
X_train[numerical_features].boxplot()
plt.xticks(rotation= 60)
plt.show()

### 5.4 Map `yr` (year) column


In [None]:
# Create a temporary copy of X_train, and add target column to it, for exploration
tmp_df = X_train.copy()
tmp_df['cnt'] = y_train
tmp_df.head(2)

In [None]:
# Visualize the total bike rental count per year
feature = 'yr'
grouped_by_yr = tmp_df.groupby(feature).sum('cnt')
sns.barplot(x = grouped_by_yr.index, y = grouped_by_yr['cnt'], hue = grouped_by_yr.index)
plt.show()

In [None]:
# Treating 'yr' column as Ordinal categorical variable, assign higher value to 2012

yr_mapping = {2011: 0, 2012: 1}
X_train['yr'] = X_train['yr'].apply(lambda x: yr_mapping[x])

In [None]:
X_train.head(2)

### 5.5 Map `mnth` (month) column


In [None]:
# Visualize the total bike rental count per month
feature = 'mnth'
grouped_by_mnth = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_mnth.index, y = grouped_by_mnth['cnt'], hue = grouped_by_mnth.index)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Treat 'mnth' column as Ordinal categorical variable, and assign values accordingly

mnth_mapping = {'January': 0, 'February': 1, 'December': 2, 'March': 3, 'November': 4, 'April': 5,
                'October': 6, 'May': 7, 'September': 8, 'June': 9, 'July': 10, 'August': 11}

X_train['mnth'] = X_train['mnth'].apply(lambda x: mnth_mapping[x])

In [None]:
X_train.head(2)

### 5.6 Map `season` column

In [None]:
# Visualize the total bike rental count per season
feature = 'season'
grouped_by_season = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_season.index, y = grouped_by_season['cnt'], hue = grouped_by_season.index)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Treat 'season' column as Ordinal categorical variable, and assign values accordingly

season_mapping = {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3}

X_train['season'] = X_train['season'].apply(lambda x: season_mapping[x])

In [None]:
X_train.head(2)

### 5.7 Map `weathersit` column

In [None]:
# Visualize the total bike rental count per weather situation
feature = 'weathersit'
grouped_by_weather = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_weather.index, y = grouped_by_weather['cnt'], hue = grouped_by_weather.index)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Map weather situation

weather_mapping = {'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3}

X_train['weathersit'] = X_train['weathersit'].apply(lambda x: weather_mapping[x])

In [None]:
X_train.head(2)

### 5.8 Map `holiday` column

In [None]:
# Visualize the total bike rental count based on whether the day is holiday
feature = 'holiday'
grouped_by_holiday = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_holiday.index, y = grouped_by_holiday['cnt'], hue = grouped_by_holiday.index)
plt.show()

In [None]:
# Map holiday

holiday_mapping = {'Yes': 0, 'No': 1}
X_train['holiday'] = X_train['holiday'].apply(lambda x: holiday_mapping[x])

In [None]:
X_train.head(2)

### 5.9 Map `workingday` column

In [None]:
# Visualize the total bike rental count based on whether the day is a workingday
feature = 'workingday'
grouped_by_wrkday = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_wrkday.index, y = grouped_by_wrkday['cnt'], hue = grouped_by_wrkday.index)
plt.show()

In [None]:
# Map workingday

workingday_mapping = {'No': 0, 'Yes': 1}

X_train['workingday'] = X_train['workingday'].apply(lambda x: workingday_mapping[x])

In [None]:
X_train.head(2)

### 5.10 Map `hr` (hour) column

In [None]:
# Visualize the total bike rental count per hour
feature = 'hr'
grouped_by_hr = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_hr.index, y = grouped_by_hr['cnt'], hue = grouped_by_hr.index)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Map hour

hour_mapping = {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8,
                '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16,
                '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23}

X_train['hr'] = X_train['hr'].apply(lambda x: hour_mapping[x])

In [None]:
X_train.head(2)

### 5.11 One-hot Encode `weekday` column

In [None]:
# Visualize the total bike rental count per weekday
feature = 'weekday'
grouped_by_wkday = tmp_df.groupby(feature).sum('cnt').sort_values('cnt')
sns.barplot(x = grouped_by_wkday.index, y = grouped_by_wkday['cnt'], hue = grouped_by_wkday.index)
plt.show()

In [None]:
# Treating 'weekday' column as a Nominal categorical variable, perform one-hot encoding

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(X_train[['weekday']])

In [None]:
encoded_weekday = encoder.transform(X_train[['weekday']])
encoded_weekday.shape

In [None]:
# Get encoded feature names
enc_wkday_features = encoder.get_feature_names_out(['weekday'])
enc_wkday_features

In [None]:
# Append encoded weekday features to X_train
X_train[enc_wkday_features] = encoded_weekday
X_train.shape

In [None]:
X_train.head(2)

### 5.12 Remove unnecessary columns

In [None]:
# List of unused columns
unused_colms.append('weekday')
unused_colms

In [None]:
X_train.shape

In [None]:
# Drop columns from X_train
X_train.drop(labels = unused_colms, axis = 1, inplace = True)
X_train.shape

In [None]:
X_train.head(2)

#### Analyze the correlation between features with heatmap

In [None]:
sns.heatmap(X_train.iloc[:,:].corr(numeric_only=True), cmap='RdBu')
plt.show()

Among the features showing high correlation, any one can be considered for model training.

For now, consider all the features.

### 5.13 Apply Standard Scalar

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled[0,:]

### 5.14 Create a single function for preprocessing the test set (X_test) and apply it

**Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
## Create a function for pre-processing test set

def pre_process(dataframe):

    df = dataframe.copy()
    df = impute_weekday(df)
    df['weathersit'].fillna('Clear', inplace=True)

    for col in numerical_features:
        df = handle_outliers(df, col)

    df['yr'] = df['yr'].apply(lambda x: yr_mapping[x])
    df['mnth'] = df['mnth'].apply(lambda x: mnth_mapping[x])
    df['season'] = df['season'].apply(lambda x: season_mapping[x])
    df['weathersit'] = df['weathersit'].apply(lambda x: weather_mapping[x])
    df['holiday'] = df['holiday'].apply(lambda x: holiday_mapping[x])
    df['workingday'] = df['workingday'].apply(lambda x: workingday_mapping[x])
    df['hr'] = df['hr'].apply(lambda x: hour_mapping[x])

    encoded_weekday_test = encoder.transform(df[['weekday']])
    df[enc_wkday_features] = encoded_weekday_test

    df.drop(labels = unused_colms, axis = 1, inplace = True)

    return df


In [None]:
# Applying above function on X_test
x_test = pre_process(X_test)
x_test.info()
x_test.head()

### 5.15 Apply Standard Scalar transformation to x_test





In [None]:
x_test_scaled = scaler.transform(x_test)
x_test_scaled[0,:]

## **6.** Apply multiple ML algorithms and display the Mean squared error and $R^2$ score

### 6.1 LinearRegression

Hint: [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
regr_linear = linear_model.LinearRegression()
regr_linear.fit(X_train_scaled, y_train.values.ravel())

In [None]:
# Prediction for test set
y_pred_lr = regr_linear.predict(x_test_scaled)

# Calculate the score/error
print("R2 score:", r2_score(y_test, y_pred_lr))
print("Mean squared error:", mean_squared_error(y_test, y_pred_lr))

### 6.2 SGD Regressor

Hint: [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)

In [None]:
sgd = linear_model.SGDRegressor()
sgd = sgd.fit(X_train_scaled, y_train.values.ravel())

In [None]:
# Prediction for test set
y_pred_sgd = sgd.predict(x_test_scaled)

# Calculate the score/error
print("R2 score:", r2_score(y_test, y_pred_sgd))
print("Mean squared error:", mean_squared_error(y_test, y_pred_sgd))

#### 6.3 Random Forest

Hint: [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [None]:
model_rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

In [None]:
# Fit the model
model_rf.fit(X_train_scaled, y_train.values.ravel())

In [None]:
# Prediction for test set
y_pred = model_rf.predict(x_test_scaled)

# Calculate the score/error
print("R2 score:", r2_score(y_test, y_pred))
print("Mean squared error:", mean_squared_error(y_test, y_pred))