# ML Test for Canditates (2024) - Classification

## Objective

Make a model that predicts price spikes

## Task

A battery operator wants to profit from the high prices in the aFRR+ energy market. The batteries however, only have limited amount of energy, and they cannot be used for more than a couple of hours before they need to be recharged.
Luckily, the operator identified that every now and then there are "spikes" of the aFRR+ energy prices, that the batteries can profit from. When the aFRR+ energy price is > 350€/MWh, it's defined as a spike.

These spikes are seemingly random but the operator thinks they might be correlated with the wholesale electricity prices (spot) and maybe also the solar energy production.

Can you make a model that is able to relaibly predict these spikes?

## Dataset

A dataset with the following columns:
- `datetime_utc_from`: Datetime in UTC (the beginning of the hour)
- `spot_ch_eurpmwh`: The wholesale electricity price in Switzerland (€/MWh)
- `global_radiation_J`: Average solar radiation in Switzerland (J)
- `activation_price_pos_eurpmwh`: aFRR energy prices 

## Binder link

https://mybinder.org/v2/gh/VassilisSaplamidis/interview_tasks_quant/main

## Step 0: Load neccesary libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Feature engineering

### What would be good features to use for this prediction?
- The operator thinks that the spot prices and the solar radiation play a role.
- Can there be other time-of-day, day-of-week dependencies?
- Can you think of other features that might correlate with the electricity consumption of any given day?

You should create these features now to use them in the model afterwards.

### 1. Load the dataset and set the index to datetime

The dataset is loaded here. Pay attention that the index is UTC time.
We also created a second datetime column that is in the local time zone of the customers, in case you find it useful.

The "price spike" class is also defined here

In [None]:
# Load the dataset
data = pd.read_csv('data_raw_classification.csv', delimiter=',')
data.set_index('datetime_utc_from', inplace=True)
data.index = pd.to_datetime(data.index, utc=True)

# add column with local time
data['datetime_local_from'] = data.index.tz_convert('Europe/Zurich')

# add column with the target "spike" price class
data['pos_act_price_spike'] = data['activation_price_pos_eurpmwh'].apply(lambda x: 1 if x > 350 else 0)

### 2. Create features

## Step 2: Create the dataset for the model

1. Split the data into features (X) and target (y)<br>
The target column should be `y = data['pos_act_price_spike']`
2. One-hot encode categorical features (optional)<br>
Why may this be needed?
3. Any other preprocessing you want 

### 1. Separate features (X) and target (y)

In [None]:
features = ['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6', ...]
target = 'pos_act_price_spike' 
X = data[features]
y = data[target]


### 2. One-hot encode categorical features

In [None]:
# One-hot encode categorical features
categorical_features = ['hour', 'weekday', 'month']
encoder = OneHotEncoder(sparse_output=False)
encoded_cats = encoder.fit_transform(X[categorical_features])
encoded_df = pd.DataFrame(encoded_cats, index=X.index, columns=encoder.get_feature_names_out(categorical_features))
X_encoded = pd.concat([X.drop(columns=categorical_features), encoded_df], axis=1)

### 3. Other pre-proccessing

## Step 3: Model Building

Remember that the goal is to have a *good predictive model that is robust and can be used to predict unseen data*
You can play around with different models, methods, objectives and parameters

- What type of model did you chose? Why? 
- Does the model have any tunable parameters? How did you set their value?

If you need to scale columns, the `StandardScaler` might be helpful.
If you need to train models with different parameters/objectives etc, the `GridSearchCV` function might be useful.

### Train the model

## Step 4: Evaluation

- Can you estimate how good your model performed? 
- Do you think it can be used to predict unseen data? Why? Why not?a
- What improvements would you do?