# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 23: Final Project I. Regression Problem: Can you predict how many riders there will be on one path given how many are on another? 

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### 0. Setup

In [1]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# - data
import numpy as np
import pandas as pd
from datetime import datetime

# - os
import os

# - ml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - parameters
%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()

# - rng
rng = np.random.default_rng(1234)

# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

### 1. The dataset

In this exercise you will be using `sklearn.ensemble.RandomForestRegressor` (the Random Forest ensemble model for Regression) to train a model to predict the number of rider on a given path from the numbers of riders present at other cycling paths and some time-stamped data.

The data set for this exercise is provided in your `_data` directory as `dss2023_finalProject_01.csv`.

It is based on the **Montreal bike lanes: Use of bike lanes in Montreal city in 2015** data from Kaggle [source](https://www.kaggle.com/datasets/pablomonleon/montreal-bike-lanes). We did some data preparation but you will have to take care about the rest. All necessary steps will be formulated precisely: the rest is you and Python!  

#### 1.1 Load the dataset

Load the `dss2023_finalProject_01.csv`; make sure to use `index_col=[0]` in your call to `pd.read_csv()`.

In [1]:
### << YOUR CODE HERE >>

#### 1.2 Produce new categorical predictors from the `Date` variable.

All values from `Date` are from 2015, so we can disregard the year information safely. However, we want to extract two categorical features from the `Date` column:

- the day of week (e.g. "Monday", "Thursday", "Sunday", etc)
- month (e.g. "January", "February", etc).

If we do not do this, our data would be considered a pure time-series - and we do not want to use linear models (such as Poisson, for example) on auto-correlated data! The model will need some information on at least monthly and daily data in order to figure out the time dependencies.

So:

- use all your available knowlegde to figure out how to extract the values for two new columns in `data_set` from the `Data` column:
- the first new column will be `dayOfWeek` (e.g. "Monday", "Thursday", "Sunday", etc)
- the second new column will be `month` (e.g. "January", "February", etc);
- Google (!) to find out how to turn dates from a Pandas DataFrame class into these values - you can also ask ChatGPT to help you if you prefer and until **you are able to understand the code that it suggests, test it, and figure out if it reall does what is required**!

In [2]:
### << YOUR CODE HERE >>

#### 1.3 Dummy Coding: produce dummy coding for `dayOfWeek` and `month`.

**N.B.** We want to use `dayOfWeek_Monday` and `month_January` as references (baselines) in the respective categorical predictors. How would you approach this problems from `pd.get_dummies()`? 

In [3]:
### << YOUR CODE HERE >>

#### 1.4 Split into 20% validation and 80% training data

Notice the following from the Setup section: 

`from sklearn.model_selection import train_test_split`

Now, it is extremely easy to make a 80/20 data split with `sklearn`: Google and figure out how to do it. You need to produce two new DataFrames, `train_set` (80% of data) and `validation_set` (20 % of data). Do it:

In [4]:
### << YOUR CODE HERE >>

#### 1.5 Perform a 5-Fold CV of an Random Forest Regressor for the problem at hand

In order to solve this task you will need to combine your understand of 

- the `sklearn.ensemble.RandomForestRegressor` (Session 22)
- and of [`sklearn` pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that we have used to cross-validate the Poisson Regressor in Session 19.

You need to perform the following steps:

- use only `train_set` for this:
- break-down `data_set` into `X` (your feature matrix) and `y` (your outcome)
- create a pipeline with one `regressor`: `RandomForestRegressor`, and so to use the `criterion` argument set to `Poisson` (c.f. the [sklearn.ensemble.RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) documentation)
- Define the cross-validation grid in the following way:
   - vary `n_estimators` as [50, 100, 150, 200, 300, 500],
   - vary `max_depth` as [3, 4, 5, 6]
   - vary `min_samples_leaf` as [5, 10, 15, 30]
   - vary `max_features` as [5, 10, 15];
   - all these hyperparameters are well-documented in scikit-learn, so read through the documentation thorouhly!
- define your `GridSearchCV()` object and fit it;
- print out the best parameters and the best model score obtained!

In [5]:
### << YOUR CODE HERE >>

#### 1.6 Now Refit the best model on the whole  `training_set` w/o cross-validation!

Define your `RandomForestRegressor()` with the best obtained hyperparameters from CV and re-train it on the whole training set w/o cross-validation.

In [6]:
### << YOUR CODE HERE >>

What is the model score?
**BTW**, what is the default score used in `RandomForestRegressor()`?

In [7]:
### << YOUR CODE HERE >>

If you take a look at the Setup section of this notebook, you might notice the following:

`from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score`

Read trough the relevant documentation (Google!), then compute 

- MSE
- MAE, and
- R2

for the best model (hint: you will need to use `predict()` first).

In [8]:
### << YOUR CODE HERE >>

Plot the `observed` vs. `predicted` values from the best obtained model

In [9]:
### << YOUR CODE HERE >>

### 2. And what about the `validation_set`..?

#### 2.1 Provide the best model MSE, MAE, and R2 for the `validation_set`:

In [10]:
### << YOUR CODE HERE >>

#### 2.2 Plot the `observed` vs. `predicted` values from the best obtained model for the `validation_set`

In [11]:
### << YOUR CODE HERE >>

***

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>