# ML Test for Canditates (2024) - Regression

## Objective

Make a predictive model for electricity consumption

## Task

An energy supplier provides electricity to it's customers in the Canton of Zurich. The customers are mainly businesses, offices and small industries. In order to buy the energy in the wholesale electricity market at the day before the delivery, the supplier needs to predict the total consumption of it's customers. 

The prediction takes place in the morning of every day and the next day's consumption needs to be forecasted.
For example on **17.09.2023 08:00** we need to forecast the consumption from **18.09.2023 00:00 - 19.09.2023 00:00**

The data available at the moment the prediction takes place are:
- the weather forecasts for the next day (you can assume the forecast is perfect for this excersise)
- the consumption of these customers up until the midnight of this day (eg. on **17.09.2023 08:00** the consumption up until **17.09.2023 00:00** is already known)

## Dataset

A dataset with the following columns:
- `datetime_utc_from`: Datetime in UTC (the beginning of the hour)
- `consumption_MWh`: The total electricity consumption of the energy supplier's customers (MWh)
- `temperature_celsius`: Average temperature in the Canton of Zurich (°C)
- `global_radiation_J`: Average solar radiation in the Canton of Zurich (J)

## Binder link

https://mybinder.org/v2/gh/VassilisSaplamidis/interview_tasks_quant/main

## Step 0: Load neccesary libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt


## Step 1: Feature engineering

### What would be good features to use for this prediction?
- Does temperature affect the electricity consumption patterns?
- Does the amount of sun affect the electricity consumption patterns?
- The customers are mainly industries and offices. Is there some expectation about their consumption pattern over different days/times of year?
- Can you think of other features that might correlate with the electricity consumption of any given day?

You should create these features now to use them in the model afterwards.

### 1. Load the dataset and set the index to datetime

The dataset is loaded here. Pay attention that the index is UTC time.
We also created a second datetime column that is in the local time zone of the customers, in case you find it useful

In [None]:
# Load the dataset
data = pd.read_csv('data_raw_regression.csv', delimiter=',')
data.set_index('datetime_utc_from', inplace=True)
data.index = pd.to_datetime(data.index, utc=True)
data['datetime_local_from'] = data.index.tz_convert('Europe/Zurich')

### 2. Create features

## Step 2: Create the dataset for the model

1. Split the data into features (X) and target (y)<br>
The target column should be `y = data['consumption_MWh']`
2. One-hot encode categorical features (optional)<br>
Why may this be needed?
3. Any other preprocessing you want 

### 1. Separate features (X) and target (y)

In [None]:
# Split the data into features (X) and target (y)
feature_cols = ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', ...]
X = data[feature_cols]
y = data['consumption_MWh']

### 2. One-hot encode categorical features

In [None]:
# One-hot encode categorical features
categorical_features = ['feature4', 'feature6']
encoder = OneHotEncoder(sparse_output=False)
encoded_cats = encoder.fit_transform(X[categorical_features])
encoded_df = pd.DataFrame(encoded_cats, index=X.index, columns=encoder.get_feature_names_out(categorical_features))
X_encoded = pd.concat([X.drop(columns=categorical_features), encoded_df], axis=1)

### 3. Other pre-proccessing

## Step 3: Model Building

Remember that the goal is to have a *good predictive model that is robust and can be used to predict unseen data*
You can play around with different models, methods, objectives and parameters

- What type of model did you chose? Why? 
- Does the model have any tunable parameters? How did you set their value?

If you need to scale columns, the `StandardScaler` might be helpful.
If you need to train models with different parameters/objectives etc, the `GridSearchCV` function might be useful.

### Train the model

## Step 4: Evaluation

- Can you estimate how good your model performed? 
- Do you think it can be used to predict unseen data? Why? Why not?a
- What improvements would you do?