In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime


# Goal

The goal of this notebook is to detect and replace the values of any loss of data

## Sensor dataset

We have two types of outliers in out dataset, missing data aka NaN or data putted as 0.
For both temperatures, it's pretty straightforwards because the temps never goes below 20 degrees (roughly), so detecting missing values is easy.

But for irradiation, we have to detect where values at 0 shouldn't be there.

UPDDATE 1: Noticed that I was wrong, no real value at 0 but lots of missing values, not NaN but missing, so need to remake the dataset to add those dates and times at those points.

To do that, we'll "split" the dataset use Kmeans clustering, and detect all values at 0 in the middle of the day.

Afterwards, we'll be using the "cleaned" set to apply feature engineering for each columns, and finally chose, train and use a model to replace each missing/wrong values

Finally, we'll save the new dataset as parquet file.


In [2]:
from src.data_prep import *
from src.feature_engineering import *
# Load both training and dataset used for end completion. 
dfs = load_sensor("dataset/Plant_1_Weather_Sensor_Data.csv", save=False)
# Get the X and y for all expectedd values
irr_X, irr_y = sensor_for_irradiation(dfs[0])
atemp_X, atemp_y = sensor_for_ambient_temp(dfs[0])
mtemp_X, mtemp_y = sensor_for_module_temp(dfs[0])

# Now that we have all starting values we take a quick glance at one of them
irr_X

Unnamed: 0,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,time_id,relationshipWgoal,vClustering,hClustering,SumClusterings,PC1
0,25.184316,22.857507,0,-0.092391,0,1,1,48.331454
1,24.170166,21.631490,0,-0.105033,0,1,1,48.445796
2,21.155691,20.599400,0,-0.026295,0,1,1,48.632922
3,22.610982,20.609906,0,-0.088500,0,1,1,48.571107
4,22.465285,20.111123,0,-0.104791,0,1,1,48.606401
...,...,...,...,...,...,...,...,...
3177,22.205029,20.459212,95,-0.078623,0,2,2,-46.165190
3178,23.418154,22.447845,95,-0.041434,0,2,2,-46.332494
3179,23.641211,23.051286,95,-0.024953,0,2,2,-46.377161
3180,22.892004,21.216600,95,-0.073187,0,2,2,-46.238370


### IRRADIATION (Chosing best models)

As suspecting non-linear relationships, I'll focus on those first models to take a grasp of our goals:
- SVMs
- DecisionTree
- Random Forest
- Gradient Boosting

Additionnal models that can be noted for further tests:
- Neural Nets
- XGBoost
- LightGBM

These metrics will be used to get a good overview of models performances on requested tasks:
- R-Squared
- MAE
- MSE
- RMSE


In [4]:
from src.model import * 
mods = []
for m in ['SVM', 'DT', 'RF', 'GB']:
    model, metrics = model_testing(irr_X,irr_y, m)
    mods.append(mods)
    print(f'-----------{m}-----------')
    print(f"r2:{metrics['r2']}\nMAE:{metrics['MAE']}\nMSE:{metrics['MSE']}\nRMSE:{metrics['RMSE']}")  

Fitting 2 folds for each of 40 candidates, totalling 80 fits
{'C': 16, 'gamma': 0.1, 'kernel': 'linear'}
-----------SVM-----------
r2:0.9570077536976176
MAE:0.0392994736688664
MSE:0.0036580574443085536
RMSE:0.060481876990620534
Fitting 2 folds for each of 16 candidates, totalling 32 fits
{'criterion': 'squared_error', 'max_depth': 11, 'min_samples_split': 12, 'splitter': 'random'}
-----------DT-----------
r2:0.9566498064182145
MAE:0.030151802033353967
MSE:0.003701326460104872
RMSE:0.06083852776082663
Fitting 2 folds for each of 12 candidates, totalling 24 fits
{'criterion': 'squared_error', 'max_depth': 2, 'max_features': 'log2', 'min_samples_split': 2, 'n_estimators': 85}
-----------RF-----------
r2:0.9378672041932105
MAE:0.045347806172349986
MSE:0.004697701540228066
RMSE:0.06853978071330595
Fitting 2 folds for each of 24 candidates, totalling 48 fits
{'learning_rate': 0.1, 'max_depth': 1, 'min_samples_split': 2, 'n_estimators': 160, 'subsample': 0.6}
-----------GB-----------
r2:0.960

Without surprise, Gradient boosting is one of the best models and goes toe to toe with randomforest (And slightly better if we take all metrics into consideration).
And quite surprisingly, SVM and Decision Tree are also toe-to-toe but on the lower end.

Now we'll try out adding some shallow tuning to see if there's any differences.

**UPDATE 1:**
Has it look like the computation time for my grids are way too high, reason are combination of many possibilities AND numbers are too high.

**Example with SVR**: Computing C as 100 is twice or even thrices the computing time of 50.

SO i'm putting my expectations way down to get a first line of sight of the parameters and improvement I could do.
For some example of computations time tested (SO far for SVR (sight)) are twice 800 mins and one 1200, each reviewing the grid by nearly halfing it.

**UPDATE 2:**
The SVR computationnal issue was situated on the fact that a poly kernel doesn't fit well with gamma (need to expand on why exactly though)
This was taking ages for nothings.
Now should be resolved.|

**UPDATE 3:**

After simple tuning, we can bost SVM & DT to a way better fitting, for them to be closer to GB & RF.
On the other hands, parameters used for fitting RF are not sufficient or too shallow, hence giving an overall worse prediction.

Fitting 2 folds for each of 360 candidates, totalling 720 fits
{'C': 20, 'gamma': 0.1, 'kernel': 'linear'}
-----------SVM-----------
r2:0.9570638213108729
MAE:0.038579613991283235
MSE:0.0035631767338816845
RMSE:0.05969235071499266
Fitting 2 folds for each of 16 candidates, totalling 32 fits
{'criterion': 'absolute_error', 'max_depth': 11, 'min_samples_split': 12, 'splitter': 'random'}
-----------DT-----------
r2:0.9549085977688332
MAE:0.03099150040360254
MSE:0.003886872691172663
RMSE:0.06234478880526153











Fitting 2 folds for each of 360 candidates, totalling 720 fits
{'C': 20, 'gamma': 0.1, 'kernel': 'linear'}
-----------SVM-----------
r2:0.9570638213108729
MAE:0.038579613991283235
MSE:0.0035631767338816845
RMSE:0.05969235071499266
Fitting 2 folds for each of 16 candidates, totalling 32 fits
{'criterion': 'absolute_error', 'max_depth': 11, 'min_samples_split': 12, 'splitter': 'random'}
-----------DT-----------
r2:0.9550036802538894
MAE:0.03162794702825078
MSE:0.0039016627969264673
RMSE:0.06246329159535596
Fitting 2 folds for each of 548720 candidates, totalling 1097440 fits