# Model Development 

In this Jupyter notebook, we will gradually develop our models. This entails:   
- **preparatory work**: transforming data as needed (one-hot encoding, scaling...), train-test-split  
- **model implementation**: creating baseline models  
- **model tuning**: hyperparameter tuning to get the best version of each model   

In [3]:
#import all sorts of packages 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor

## SARIMAX

### Data Import

In [None]:
# Import the data
data_ons = pd.read_csv("data/final_onshore_data_2017_2025.csv")
data_ofs = pd.read_csv("data/final_offshore_data_2017_2025.csv")

# quick inspection
#data_ons.head()


Unnamed: 0,year_mon_day,hour,wind_dir_avg_10,wind_speed_h_avg,wind_speed_avg_10,air_pressure,humidity,full_datetime,capacity,volume,percentage,emission,emissionfactor,correct_days
0,20170101,1,207.708194,49.666667,49.666667,10234.526316,98.076923,2017-01-01-01,679334,679334,0.78873,0,0,2017-01-01-01
1,20170101,2,205.010321,50.0,51.333333,10227.789474,98.153846,2017-01-01-02,677462,677462,0.786558,0,0,2017-01-01-02
2,20170101,3,202.701006,51.666667,51.0,10219.473684,98.230769,2017-01-01-03,653746,653746,0.759025,0,0,2017-01-01-03
3,20170101,4,201.007553,52.333333,54.666667,10211.368421,98.038462,2017-01-01-04,705882,705882,0.819552,0,0,2017-01-01-04
4,20170101,5,200.325015,52.666667,53.333333,10203.526316,97.461538,2017-01-01-05,716738,716738,0.832158,0,0,2017-01-01-05


### Train-Test-Split
Since we are presumably going to work with k-fold cross-validation, we will not create a specific validation set. Instead, we will use a common 80/20 split. Before doing so, we of course store our label and our predictors seperately as X and y.

In [9]:
# Split data into X (features) and y (label)
X_ons = data_ons.drop("volume", axis=1)
y_ons = data["volume"]

In [11]:
# perform train test split
X_train_ons, X_test_ons, y_train_ons, y_test_ons = train_test_split(X_ons, y_ons, test_size=20, random_state=42)

### Notes on Pre-Processing
For SARIMAX, categorical features need to be one-hot encoded, but since our data does not possess such feature, we can ignore this here. The data does not need to be scaled either. However, we need to define the **seasonal frequency** before we can build a model.

## LSTM

In [None]:
data_ons = pd.read_csv("data/final_onshore_data_2017_2025.csv")
data_ofs = pd.read_csv("data/final_offshore_data_2017_2025.csv")

## Random Forest

In [None]:
data_ons = pd.read_csv("data/final_onshore_data_2017_2025.csv")
data_ofs = pd.read_csv("data/final_offshore_data_2017_2025.csv")

## XGBoost

In [None]:
data_ons = pd.read_csv("data/final_onshore_data_2017_2025.csv")
data_ofs = pd.read_csv("data/final_offshore_data_2017_2025.csv")