# TPM034A Machine Learning for socio-technical systems 
## `Lab session 04:  Explainable AI and energy prediction`

**Delft University of Technology**<br>
**Q2 2023**<br>
**Instructor:** Giacomo Marangoni <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Lab sessions aim to:**<br>
* Show and reinforce how models and ideas presented in class are used in practice.<br>
* Help you gather hands-on machine learning skills.<br>

**Lab sessions are:**<br>
* Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.<br> 
* Not graded and do not have to be submitted. 
### `Use of AI tools`
AI tools, such as ChatGPT and Co-pilot, are great tools to assist with programming. Moreover, in your later careers you will work in a world where such tools are widely available. As such, we **encourage** you to use AI tools **effectively** (both in the lab sessions and assignments). However, be careful not to overestimate the capacity of AI tools! AI tools cannot replace you: you still have to conceptualise the problem, dissect it and structure it, to conduct proper analysis and modelling. We recommend being especially **reticent** with using AI tools for the more conceptual and reflective oriented questions. 
### `Workspace set-up`

**Option 1: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements_lab04.txt

**Option 2: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2023
#!pip install -r Q2_2023/requirements_lab04.txt
#!mv "/content/Q2_2023/Lab_sessions/lab_session_04/data" /content/data

### `Application: Explaining the prediction of the next 24-hour electricity load` <br>

#### **Introduction**
In this lab session we will train a Linear regression (LR) and a Random forest regressor (RF) to predict the **next 24-hour electricity load** of a household. We will then use the tools of Explainable AI, and in particular SHAP, to better understand how specific prediction are influenced by different features.

#### **Data**

1. `data/load.pkl`: A pickle file with a pandas.DataFrame of overall Wh hourly energy consumption of a household from the REFIT Electrical Load Measurements dataset, collected over a period of about two years.
2. `data/devices.pkl`: A pickle file with a pandas.DataFrame of normalized weather variables: `dwpt` is Dew Point (related to moisture), `rhum` is relative humidity, `temp` is temperature, `wdir` is wind direction, `wspd` is wind speed.


### Load and preprocess the datasets

In [None]:
# Use pd.read_pickle to load 'data/load.pkl'

# Check that you get a similar pd.Series:
# Time
# 2013-09-25 19:00:00    410.766578
# 2013-09-25 20:00:00    417.421053
# 2013-09-25 21:00:00    508.165821
# ...

In [None]:
# How many observations does the dataset have? What is the overall time range?

In [None]:
# Use pd.read_pickle to load 'data/weather.pkl'

# Check that you have a pd.DataFrame with columns
# ['temp', 'dwpt', 'rhum', 'wdir', 'wspd']
# and index equal to the load index

In [None]:
# Add an "hour" column to the load df, corresponding to the integer hour of the index
# e.g. 2013-09-25 19:00:00 --> 19
# Hint: convert index into a pd.Series and use the .hour accessor

In [None]:
# Add 3 columns to the load df
# load_lag_24, load_lag_48 and load_lag_72
# corresponding to the value of load 24, 48 and 72 hourse before, respectively
# Hint: use the .shift function

# check as example the first five 21:00 samples to make sure you computed the lags correctly

In [None]:
# Add 7 dummy columns, each one with value 0 or 1 for each day of the week of the corresponding index,
# named "day_name_monday", "day_name_tuesday", ... , "day_name_saturday" (skipping Sunday)
# Hint: use index.dayofweek and pd.get_dummies

# check as example the first seven 21:00 samples to make sure you computed the day_name columns correctly

In [None]:
# Merge the weather and the load df
# Hint: use df.join

In [None]:
# Plot the load of a single day, e.g. 2013-12-12
# Does the shape make sense?

In [None]:
# Plot the average and confidence interval of hourly load for weekdays vs weekends from '2013-11-01' onwards
# Hint: use seaborn.lineplot
# Do you observe any difference?

In [None]:
# Add dummies "hour_1", "hour_2", ... "hour_23" (skipping "hour_0"), equal to 1 for the corresponding index hour
# Hint: use pd.get_dummies

### Train a linear regression model

In [None]:
# Train a linear regression model to predict load of 2014-12-12 based on features computed above.
# Use the period 2013-11-01 to 2014-12-11 for training.

In [None]:
# Compute the root mean squared error

### Explain a LR model with SHAP

In [None]:
# Plot the SHAP values for the test set (load values for the hours of the day 2014-12-12)
# Hint: use shap.Explainer and shap.plots.beeswarm

### Train a Random Forest Regressor

In [None]:
# Train a random forest regressor model to predict load of 2014-12-12.
# Use the period 2013-11-01 to 2014-12-11 for training.
# Do not use the dummies computed above. Use the following features instead:
# ['hour', 'load_lag_24', 'load_lag_48', 'load_lag_72', 'temp',
#    'dwpt', 'rhum', 'wdir', 'wspd', 'weekday']
# where hour and weekday are int

In [None]:
# What is the root mean squared error for the test set? Does the accuracy improve with respect to the linear model?

### Explain a RF model with SHAP

In [None]:
# What is the expected energy consumption knowing the weekday?
# What is the weekday with the highest/lowest expected consumption?
# 
# Hint: use shap.partial_dependence_plot

In [None]:
# Compare with the mean load of the train set grouped by weekday: are they different? If so, why?

In [None]:
# What is the expected energy consumption knowing the hour? What about temperature? Does the model behavior make sense?

In [None]:
# Plot the SHAP values of each feature for each of the test samples.
# Hint: use shap.plots.beeswarm

In [None]:
# What is the most predictive feature? What is the least predictive? How do you interpret it?
# Hint: use shap.plots.bar

In [None]:
# Explain the 15:00 and 20:00 prediction of the test. What observations can you make? 
# Hint: use shap.plots.waterfall

### Reflections

In [None]:
# Reflect on how the information above could be useful for environmental reasons.