# TPM034A Machine Learning for socio-technical systems 
## `Assignment 04: Explainable AI and appliance usage prediction`

**Delft University of Technology**<br>
**Q2 2023**<br>
**Instructor:** Giacomo Marangoni <br>
**TAs:**  Francisco Garrido-Valenzuela & Lucas Spierenburg <br>

## `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Workspace set-up`

**Option 1: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements_lab04.txt

**Option 2: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2023
#!pip install -r Q2_2023/requirements_lab04.txt
#!mv "/content/Q2_2023/Assignments/assignment_04/data" /content/data

# `Application: Explainable AI and appliance usage prediction` <br>

#### **Introduction**

In this notebook you are going to train and explain a Random Forest Classifier model to predict the probability of using a given appliance in the next 24 hours.

#### **Data**

1. `data/devices.pkl`: A pickle file with a pandas.DataFrame of Wh hourly energy consumption by appliance within a household of the REFIT dataset, over a period of about two years.
2. `data/devices.pkl`: A pickle file with a pandas.DataFrame of normalized weather variables: `dwpt` is Dew Point (related to moisture), `rhum` is relative humidity, `temp` is temperature, `wdir` is wind direction, `wspd` is wind speed.
3. `data/price.pkl`: A pickle file with a pandas.Series with electricity day-ahead prices in GBP/MWh.


### **Tasks and grading**

Your assignment is divided into 5 subtasks: (1) Data preparation, (2) Data exploration, (3) Model training, (4) Model explanation and (5) Reflections. In total, 10 points can be earned in this assignment. The weight per subtask is indicated below. 

1. **Data preparation** [1.0 pnt]<br>

2. **Data exploration** [2.0 pnt]<br>
    
3. **Model training** [2.0 pnt]<br>
    
4. **Model explanation** [3.0 pnt]<br>

5. **Reflections** [2.0 pnt]<br>

### **Submission**
- The deadline for this assignment is **Monday 18/12/2023 23:59** 
- Use **Python 3.10 or above**
- You have to submit your work in zip file with the ipynb **(fully executed)**

### Data preparation

In [None]:
# Load 'data/devices.pkl', 'data/weather.pkl' and 'data/price.pkl'.
# Weather is the same as in the lab session.
# Devices contains Wh consumed by given devices at each timestamp.
# Price contains electricity prices for each timestamp.

In [None]:
# Add a colum "Load" to devices as the sum of all appliances consumption

In [None]:
# Merge all the datasets in one dataframe

In [None]:
# Consider zeros in temperature ("temp" column) as NAs, and interpolate the resulting missing values linearly

### Data exploration

In [None]:
# Plot NAs count per day over the whole time range. Are there any evident missing periods?

# Hint: use isna(), groupby(), index.date and sum()

In [None]:
# In 1 year, from 2014-03-10 to 2015-03-09, which appliance cumulatively consumed the most energy?

In [None]:
# Which appliance was turned on for the highest number of hours (i.e. consumption > 1Wh)?

In [None]:
# Which appliance consumes per hour the highest?

In [None]:
# Plot the fraction of days (y-axis) in the given year by which each appliance (color) is used for each hour (x-axis)

### Model training

In [None]:
# Prepare train and test datasets with the following characteristics:
# Train data period: from 2014-03-10 to 2015-03-09
# Test data period: 2015-03-10
# y feature: usage of television (i.e. consumption > 1)
# X features:
# - hour (int)
# - weekday (int)
# - weather variables
# - usage 24h before (1 if television was used at the same hour the day before)
# - activity 24h before (1 if any appliance was used at the same hour the day before)
# - usage yesterday (1 if television was used at least 1 hour during the whole day before)
# - price
# Drop NAs

# Hint for computing "usage yesterday": group usage by date, take the max, shift, then reindex to hourly using forward fill 

In [None]:
# Train a Random Forest Classifier according to the directions given above.

In [None]:
# Plot the test vs the predicted Usage

### Model explanation

In [None]:
# Compute the SHAP values for the 24 hours of the test dataset

In [None]:
# Plot the SHAP values for each feature and test sample

In [None]:
# What are the 3 most predictive features?

In [None]:
# Explain the 14:00 and 20:00 prediction of the test day. How could you interpret the difference?

In [None]:
# In what hours is the expected probability of watching television highest, across the train dataset? 

In [None]:
# Compare the partial dependence plot of expected probability of watching television (y-axis) by hour (x-axis) with
# a scatter plot of SHAP values (y-axis) for the 24 hours test samples, by hour (x-axis).
# Comment on their similarity/difference.

# Hint: use shap.plots.scatter for the latter

In [None]:
# What is the expected probability of watching television given the electricity price throughout the train dataset?
# How could one interpret this relationship?

### Reflections

In [None]:
# What strategies could you use to improve the accuracy of predicting TV usage? What are the implications for interpretability?

In [None]:
# What could be the benefits to a user of a XAI-informed model for predicting appliances usage? What could be the risks?