In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime

from sklearn.cluster import KMeans

# Goal

The goal of this notebook is to detect and replace the values of any loss of data

## Sensor dataset

We have two types of outliers in out dataset, missing data aka NaN or data putted as 0.
For both temperatures, it's pretty straightforwards because the temps never goes below 20 degrees (roughly), so detecting missing values is easy.

But for irradiation, we have to detect where values at 0 shouldn't be there.

UPDDATE 1: Noticed that I was wrong, no real value at 0 but lots of missing values, not NaN but missing, so need to remake the dataset to add those dates and times at those points.

To do that, we'll "split" the dataset use Kmeans clustering, and detect all values at 0 in the middle of the day.

Afterwards, we'll be using the "cleaned" set to apply feature engineering for each columns, and finally chose, train and use a model to replace each missing/wrong values

Finally, we'll save the new dataset as parquet file.


In [5]:
from src.data_prep import *
from src.feature_engineering import *
# Load both training and dataset used for end completion. 
dfs = load_sensor("dataset/Plant_1_Weather_Sensor_Data.csv", save=False)
# Get the X and y for all expectedd values
irr_X, irr_y = sensor_for_irradiation(dfs[0])
atemp_X, atemp_y = sensor_for_ambient_temp(dfs[0])
mtemp_X, mtemp_y = sensor_for_module_temp(dfs[0])

# Now that we have all starting values we take a quick glance at one of them
irr_X

Unnamed: 0,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,time_id,relationshipWgoal,vClustering,hClustering,SumClusterings,PC1
0,25.184316,22.857507,0,-0.092391,1,1,2,48.322093
1,24.170166,21.631490,0,-0.105033,1,1,2,48.436286
2,21.155691,20.599400,0,-0.026295,1,1,2,48.623228
3,22.610982,20.609906,0,-0.088500,1,1,2,48.561452
4,22.465285,20.111123,0,-0.104791,1,1,2,48.596692
...,...,...,...,...,...,...,...,...
3177,22.205029,20.459212,95,-0.078623,1,0,1,-46.176893
3178,23.418154,22.447845,95,-0.041434,1,0,1,-46.343965
3179,23.641211,23.051286,95,-0.024953,1,0,1,-46.388565
3180,22.892004,21.216600,95,-0.073187,1,0,1,-46.249978


### IRRADIATION (Chosing best models)

As suspecting non-linear relationships, I'll focus on those first models to take a grasp of our goals:
- SVMs
- DecisionTree
- Random Forest
- Gradient Boosting

Additionnal models that can be noted for further tests:
- Neural Nets
- XGBoost
- LightGBM

These metrics will be used to get a good overview of models performances on requested tasks:
- R-Squared
- MAE
- MSE
- RMSE


In [8]:
from src.model import * 
for m in ['SVM', 'DT', 'RF', 'GB']:
    model, metrics = model_testing(irr_X,irr_y, m)
    print(m)
    print(f"r2:{metrics['r2']}\nMAE:{metrics['MAE']}\nMSE:{metrics['MSE']}\nRMSE:{metrics['RMSE']}")  

SVM
r2:0.9215352482490099
MAE:0.06320630530062542
MSE:0.005547701619623761
RMSE:0.07448289481232426
DT
r2:0.9222200432771472
MAE:0.041455440150260744
MSE:0.006832252284035722
RMSE:0.08265743937502372
RF
r2:0.9587730486670925
MAE:0.02935808470839487
MSE:0.0034997885752967328
RMSE:0.05915901093913532
GB
r2:0.9593462646674881
MAE:0.03027432250291098
MSE:0.0034686670832589136
RMSE:0.058895391018813296


Without surprise, Gradient boosting is one of the best models and goes toe to toe with randomforest (And slightly better if we take all metrics into consideration).
And quite surprisingly, SVM and Decision Tree are also toe-to-toe but on the lower end.

Now we'll try out adding some shallow tuning to see if there's any differences.

In [None]:
# In case this take long, creating this list to retrieve and display models, best params in next cell
models = []
for m in ['SVM', 'DT', 'RF', 'GB']:
    print(f"----------{m}----------")
    model, metrics = model_testing(irr_X,irr_y, m)
    models.append(model)
    print(f"r2:{metrics['r2']}\nMAE:{metrics['MAE']}\nMSE:{metrics['MSE']}\nRMSE:{metrics['RMSE']}")  

In [9]:
# TODO: Save params into files instead of awfull printing 