This notebook explains how we can go about explore and prepare data for model building.The notebook is structured in the following way 

 - Problem Definition
 - Data Gathering and Import
 - Data Wrangling/Cleaning
 - Exploratory Data Analysis
 - Data Modeling
 - Prediction

 References/Source:
    - https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
    - https://github.com/usm-cos422-522/courseMaterials/blob/main/Labs/titanic-workflow.ipynb
    - https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile
    - https://www.kaggle.com/miteshyadav/comprehensive-eda-with-xgboost-top-10-percentile

## Problem Definition

#### Goal

To forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. by combining historical usage patterns with weather data in order to forecast bike rental demand. 

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

#### **Data Fields**

* dteday - date
* season -  1 = spring, 2 = summer, 3 = fall, 4 = winter
* yr - year
* mnth - month
* hr - hour
* holiday - whether the day is considered a holiday
* weekday
* workingday - whether the day is neither a weekend nor holiday
* weathersit -
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* temp - temperature in Celsius
* atemp - "feels like" temperature in Celsius
* humidity - relative humidity
* windspeed - wind speed
* casual - number of non-registered user rentals initiated
* registered - number of registered user rentals initiated
* cnt - number of total rentals (Dependent Variable)



##  Data Gathering and Import

In [None]:
import pylab
import calendar
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from datetime import datetime
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# libraries for reading url based files
import os
import tarfile
import urllib

# libraries for recoding fields and pipeline construction
from sklearn.impute import SimpleImputer 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder

# libraries for model building
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import mean_squared_error

pd.options.mode.chained_assignment = None
%matplotlib inline

### **Read In The Dataset from the UCI data repository**

The data file is located at https://archive.ics.uci.edu/ml/machine-learning-databases/00275/  We will have to download the zip file and then extract the hourly data

In [None]:

DOWNLOAD_ROOT = "https://archive.ics.uci.edu/ml/machine-learning-databases/00275/"
LOCAL_DATA_PATH = os.path.join("datasets", "bikeshare") + "/"
FILE_NAME = "Bike-Sharing-Dataset.zip"

def fetch_bikeshare_data(file_name = FILE_NAME, bikeshare_url=DOWNLOAD_ROOT,  bikeshare_path=LOCAL_DATA_PATH): 
    os.makedirs(bikeshare_path, exist_ok=True)
    xpt_path = os.path.join(bikeshare_path, file_name) 
    url = bikeshare_url + file_name
    urllib.request.urlretrieve(url, xpt_path)

In [None]:
fetch_bikeshare_data()

In [None]:
!unzip -o datasets/bikeshare/Bike-Sharing-Dataset -d ./datasets/bikeshare
!ls datasets/bikeshare

In [None]:
df = pd.read_csv('./datasets/bikeshare/hour.csv',parse_dates=['dteday'])
df.head()

### Exploring Data Structure and Features


As a first step lets do three simple steps on the dataset

 - Size of the dataset
 - Get a glimpse of data by printing few rows of it.
 - What type of variables contribute our data

#### **Shape Of The Dataset**

In [None]:
df.shape

#### **Sample Of First Few Rows**

In [None]:
df.head(10)

#### **Variables Data Type**

In [None]:
df.dtypes

 #### Do we have missing values ?
  find out whether we have any missing values in our data. Luckily we dont have any missing value in the dataset.

In [None]:
pd.DataFrame({'Number of Missing Values (Training)': df.isna().sum(),
              '% of Missing Values (Training)': (df.isna().sum()/df.shape[0] * 100).round(2)})

### Visualize Distribution Of Data
As it is visible from the below figures that "count" variable is skewed towards right. It is desirable to have Normal distribution as most of the machine learning techniques require dependent variable to be Normal. One possible solution is to take log transformation on "count" variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.

In [None]:
df.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
#Histogram for count; this is our dependent variable ... let's look close
sns.set_style('darkgrid')
sns.histplot(df['cnt'], bins = 100, color = 'green')
plt.show()

In [None]:
#Boxplot for count
# The whiskers extend from the box by 1.5x the inter-quartile range (IQR)
import matplotlib.pyplot as plt
sns.boxplot(x = 'cnt', data = df, color = 'mediumpurple')
plt.show()

These three charts above can tell us a lot about our target variable.

Our target variable, count is not normally distributed.
There are multiple outliers in the variable. We could get rid of outside the 1.5x IQR of 3 standard deviations. We choose the later

### Create training and test dataframes

Strategy : use the first 24 days of the month as training data and the remaining days are test data


In [None]:
cutoff_day = 24
train_df = df[df.dteday.dt.day <=cutoff_day]
test_df = df[df.dteday.dt.day>cutoff_day]
print("training rows", train_df.shape[0])
print("test rows", test_df.shape[0])
print("training ratio", train_df.shape[0]/df.shape[0])

#### **Lets Remove Outliers In The Count Column**

In [None]:
outliers = train_df[np.abs(train_df["cnt"]-train_df["cnt"].mean())>(3*train_df["cnt"].std())]
print((len(outliers)/len(train_df))*100)                                                  

In [None]:
print(outliers.shape)
print(train_df.shape)

In [None]:
#Data without the outliers in count
train_df = train_df[~train_df.isin(outliers)].dropna()
train_df.shape

#### Visualizing Distribution Of Count Data after removing outliers
As it is visible from the below figures that "count" variable is skewed towards right. It is desirable to have Normal distribution as most of the machine learning techniques require dependent variable to be Normal. One possible solution is to take log transformation on "count" variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.

In [None]:
fig,axes = plt.subplots(ncols=2,nrows=1)
fig.set_size_inches(6, 5)
sns.histplot(train_df["cnt"],ax=axes[0])
sns.histplot(np.log(train_df["cnt"]),ax=axes[1])
axes[0].set(xlabel='number of rentals', ylabel='Count',title="cnt histogram")
axes[1].set(xlabel='log number of rentals',title="log cnt histogram")

#### Correlation Analysis
One common way to understand how a dependent variable is influenced by features (numerical) is to build a correlation matrix. 


In [None]:
# select a subset of variables we are interested in 

corr = train_df[['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed','cnt']].corr()
corr['cnt'].sort_values(ascending=False)

 - temp and humidity features have positive and negative correlation
   with count respectively.Although the correlation between them are not
   very prominent still the count variable has little dependency on
   "temp" and "humidity".
 - windspeed is not really useful numerical feature and it is visible from it correlation value with "count"
 - "atemp" is variable is not taken into since "atemp" and "temp" has got strong correlation with each other. During model building any one of the variable has to be dropped since they will exhibit multicollinearity in the data.
 - "Casual" and "Registered" are also not taken into account since they are leakage variables in nature and need to dropped during model building.


#### Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)

 - It is quiet obvious that people tend to rent bike during summer
   season since it is really conducive to ride bike at that
   season.Therefore June, July and August has got relatively higher
   demand for bicycle.
 - On weekdays more people tend to rent bicycle around 7AM-8AM and 5PM-6PM. As we mentioned earlier this can be attributed to regular school and office commuters.
 - Above pattern is not observed on "Saturday" and "Sunday".More people tend to rent bicycle between 10AM and 4PM.
 - The peak user count around 7AM-8AM and 5PM-6PM is purely contributed by registered user.

In [None]:
fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4)
fig.set_size_inches(12,20)
hueOrder = ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]

monthAggregated = pd.DataFrame(train_df.groupby("mnth")["cnt"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="cnt",ascending=False)
sns.barplot(data=monthSorted,x="mnth",y="cnt",ax=ax1)
ax1.set(xlabel='Month', ylabel='Average Count',title="Average Count By Month")

ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')

hourAggregated = pd.DataFrame(train_df.groupby(["hr","weekday"],sort=True)["cnt"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hr"], y=hourAggregated["cnt"],hue=hourAggregated["weekday"], data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')

hourTransformed = pd.melt(train_df[["hr","casual","registered"]], id_vars=['hr'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hr","variable"],sort=True)["value"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hr"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')

**So we have visualized the data to a greater extent.So lets go and  build some models and see how close we can predict the results.**

###  Drop, recode, and normalize columns

In [None]:
#categoricalFeatures = ["season","holiday","workingday","weathersit","weekday","mnth","yr","hr"]
#numericalFeatures = ["hum","windspeed","atemp"]
#dropFeatures = ['instant','casual',"dteday","registered","temp"]

categoricalFeatures = ["weathersit","holiday","season","workingday","weekday","mnth","yr","hr"]
numericalFeatures = ["hum","windspeed","atemp"]
dropFeatures = ['instant','casual',"dteday","registered","temp"]

In [None]:
df_sub = train_df.drop(dropFeatures, axis=1)
df_num = df_sub[numericalFeatures]
df_cat = df_sub[categoricalFeatures]

In [None]:
df_sub.head()

In [None]:
df_num.head()

In [None]:
df_cat.head()

In [None]:
bike_y = train_df['cnt']
bike_y.shape

In [None]:
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
        ])

In [None]:
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, numericalFeatures),
        ("cat", OneHotEncoder(), categoricalFeatures),
        ])
bike_X = full_pipeline.fit_transform(train_df)

In [None]:
# the bike creates a sparse matrix, lets look at the first rows
bike_X.todense()[:2,:]

In [None]:
bike_X.shape

## Data Modeling

### **Linear Regression Model** ##

In [None]:
# Initialize logistic regression model
lModel = LinearRegression()

# Train the model
lModel.fit(X = bike_X,y = bike_y)

# Make predictions
count_preds = lModel.predict(X=bike_X)
lin_mse = mean_squared_error(bike_y, count_preds)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
print ("RMSLE Value For Linear Regression: ",lin_rmse)

### Better Evaluation Using Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lModel, bike_X, bike_y,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
rmse_scores

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:

display_scores(rmse_scores)

## Prediction