# Goal

The goal of this notebook is to detect and replace the values of any loss of data

## Sensor dataset

We have two types of outliers in out dataset, missing data aka NaN or data putted as 0.
For both temperatures, it's pretty straightforwards because the temps never goes below 20 degrees (roughly), so detecting missing values is easy.

But for irradiation, we have to detect where values at 0 shouldn't be there.

UPDDATE 1: Noticed that I was wrong, no real value at 0 but lots of missing values, not NaN but missing, so need to remake the dataset to add those dates and times at those points.

To do that, we'll "split" the dataset use Kmeans clustering, and detect all values at 0 in the middle of the day.

Afterwards, we'll be using the "cleaned" set to apply feature engineering for each columns, and finally chose, train and use a model to replace each missing/wrong values

Finally, we'll save the new dataset as parquet file.


In [1]:
from src.data_prep import *
from src.feature_engineering import *
from sklearn.model_selection import train_test_split

# Load both training and dataset used for end completion. 
dfs = load_sensor("dataset/Plant_2_Weather_Sensor_Data.csv", save=False)
# Get the X and y for all expectedd values
irr_X, irr_y = lone_sensor_for_irradiation(dfs[0])
atemp_X, atemp_y = lone_sensor_for_ambient_temp(dfs[0])
mtemp_X, mtemp_y = lone_sensor_for_module_temp(dfs[0])

random_state_nb = 42
test_size = 0.2

irrX_train, irrX_dev, irry_train, irry_dev = train_test_split(irr_X, irr_y, test_size=test_size, random_state=random_state_nb)
atempX_train, atempX_dev, atempy_train, atempy_dev = train_test_split(atemp_X, atemp_y, test_size=test_size, random_state=random_state_nb)
mtempX_train, mtempX_dev, mtempy_train, mtempy_dev =  train_test_split(mtemp_X, mtemp_y, test_size=test_size, random_state=random_state_nb)


Xs = {
    'IRRADIATION': (irrX_train, irrX_dev),
    'AMBIENT_TEMPERATURE': (atempX_train, atempX_dev),
    'MODULE_TEMPERATURE': (mtempX_train, mtempX_dev),
}

ys = {
    'IRRADIATION': (irry_train, irry_dev) ,
    'AMBIENT_TEMPERATURE': (atempy_train, atempy_dev),
    'MODULE_TEMPERATURE': (mtempy_train, mtempy_dev) ,
}

### IRRADIATION (Chosing best models)

As suspecting non-linear relationships, I'll focus on those first models to take a grasp of our goals:
- SVMs
- DecisionTree
- Random Forest
- Gradient Boosting

Additionnal models that can be noted for further tests:
- Neural Nets
- XGBoost
- LightGBM

These metrics will be used to get a good overview of models performances on requested tasks:
- R-Squared
- MAE
- MSE
- RMSE


In [2]:
# Define and optimize the best model
from src.optimization import *

irr_model = GradientBoostingRegressor()
mtemp_model = GradientBoostingRegressor()
atemp_model = GradientBoostingRegressor()

model_d = {
    'IRRADIATION': irr_model,
    'AMBIENT_TEMPERATURE': mtemp_model,
    'MODULE_TEMPERATURE': atemp_model,
}


In [3]:
# Applying mdels

for k,v in model_d.items():
    params, model_d[k] = lightgb_opt(Xs[k][0], ys[k][0])



Fitting 2 folds for each of 50000 candidates, totalling 100000 fits


In [6]:
changed_df = write_df(model_d, dfs[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  na_df[column] = np.round(trained_model.predict(temp_df), 3)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  na_df[column] = np.round(trained_model.predict(temp_df), 3)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  na_df[column] = np.round(trained_model.predict(temp_df), 3)


In [4]:
changed_df

Unnamed: 0,DATE_TIME,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION,date,time,time_id
0,2020-05-15 00:00:00,25.184316,22.857507,0.0,2020-05-15,00:00:00,0
1,2020-05-16 00:00:00,22.711000,20.912000,0.0,2020-05-16,00:00:00,0
2,2020-05-17 00:00:00,24.170166,21.631490,0.0,2020-05-17,00:00:00,0
3,2020-05-18 00:00:00,21.155691,20.599400,0.0,2020-05-18,00:00:00,0
4,2020-05-19 00:00:00,22.610982,20.609906,0.0,2020-05-19,00:00:00,0
...,...,...,...,...,...,...,...
3260,2020-06-13 23:45:00,22.205029,20.459212,0.0,2020-06-13,23:45:00,95
3261,2020-06-14 23:45:00,23.418154,22.447845,0.0,2020-06-14,23:45:00,95
3262,2020-06-15 23:45:00,23.641211,23.051286,0.0,2020-06-15,23:45:00,95
3263,2020-06-16 23:45:00,22.892004,21.216600,0.0,2020-06-16,23:45:00,95


In [7]:
changed_df.to_parquet("dataset/parquets/plant_2_updated_sensor")

In [5]:

from src.model import * 
mods = []

m = 'GB'

mIrr = GradientBoostingRegressor()
mAt = GradientBoostingRegressor()

mIrr.fit(Xs['IRRADIATION'][0], ys['IRRADIATION'][0])
mAt.fit(Xs['AMBIENT_TEMPERATURE'][0], ys['AMBIENT_TEMPERATURE'][0])
I_p = mIrr.predict(Xs['IRRADIATION'][1])
a_p = mAt.predict(Xs['AMBIENT_TEMPERATURE'][1])

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print(mean_absolute_error(ys['IRRADIATION'][1], I_p))
print(mean_absolute_error(ys['AMBIENT_TEMPERATURE'][1], a_p))
print("-----------")
print(mean_squared_error(ys['IRRADIATION'][1], I_p, squared=True))
print(mean_squared_error(ys['AMBIENT_TEMPERATURE'][1], a_p, squared=True))
print("-----------")
print(mean_squared_error(ys['IRRADIATION'][1], I_p, squared=False))
print(mean_squared_error(ys['AMBIENT_TEMPERATURE'][1], a_p, squared=False))
print("-----------")
print(r2_score(ys['IRRADIATION'][1], I_p))
print(r2_score(ys['AMBIENT_TEMPERATURE'][1], a_p))
print("-----------")


# for m in ['SVM', 'DT', 'RF', 'GB']:
#     model, metrics = model_testing(irr_X,irr_y, m)
#     mods.append(mods)
#     print(f'-----------{m}-----------')
#     print(f"r2:{metrics['r2']}\nMAE:{metrics['MAE']}\nMSE:{metrics['MSE']}\nRMSE:{metrics['RMSE']}")  

0.030222407385012785
0.645704153718081
-----------
0.0034274583141784866
0.7828788764509523
-----------
0.058544498581664245
0.8848044283631001
-----------
0.9599291133296247
0.933176971326146
-----------


Without surprise, Gradient boosting is one of the best models and goes toe to toe with randomforest (And slightly better if we take all metrics into consideration).
And quite surprisingly, SVM and Decision Tree are also toe-to-toe but on the lower end.

Now we'll try out adding some shallow tuning to see if there's any differences.

**UPDATE 1:**
Has it look like the computation time for my grids are way too high, reason are combination of many possibilities AND numbers are too high.

**Example with SVR**: Computing C as 100 is twice or even thrices the computing time of 50.

SO i'm putting my expectations way down to get a first line of sight of the parameters and improvement I could do.
For some example of computations time tested (SO far for SVR (sight)) are twice 800 mins and one 1200, each reviewing the grid by nearly halfing it.

**UPDATE 2:**
The SVR computationnal issue was situated on the fact that a poly kernel doesn't fit well with gamma (need to expand on why exactly though)
This was taking ages for nothings.
Now should be resolved.|

**UPDATE 3:**

After simple tuning, we can bost SVM & DT to a way better fitting, for them to be closer to GB & RF.
On the other hands, parameters used for fitting RF are not sufficient or too shallow, hence giving an overall worse prediction.

In [1]:
# Trying out other libraries' 
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

x_train, x_dev, y_train, y_dev = train_test_split(irr_X, irr_y, test_size=0.2, random_state=42)
extreme = xgb.XGBRegressor()
light = lgb.LGBMRegressor()
percep = MLPRegressor()

extreme.fit(x_train, y_train)
light.fit(x_train, y_train)
percep.fit(x_train, y_train)

preds  = []
preds.append(extreme.predict(x_dev))
preds.append(light.predict(x_dev))
preds.append(percep.predict(x_dev))

for index, p in enumerate(preds):
    print(index)
    print(r2_score(p, y_dev))
    print(r2_score(y_dev, p))
    print(mean_absolute_error(p, y_dev))
    print(mean_absolute_error(y_dev, p))
    print(mean_squared_error(y_dev,p, squared=True))
    print(mean_squared_error(y_dev,p, squared=False))

# Finishing by deeply tuning the expected candidate for best fitting.


NameError: name 'irr_X' is not defined

Conclusion: Plain lightgbm is working nearly as well as shallow tuned GB

#### Final Tuning

For finishing the full tuning process I'll be using random search technique as a part of the last parameter tuning while keeping in mind that the tuning part should be done within a night and/or within additional few hours.

Will do it for Common Gradient Boosting and LightGBM as well.


In [1]:
from sklearn.model_selection import train_test_split
from src.optimization import *
from src.data_prep import *
from src.feature_engineering import *

dfs = load_sensor("dataset/Plant_1_Weather_Sensor_Data.csv", save=False)
irr_X, irr_y = sensor_for_irradiation(dfs[0])

x_train, x_dev, y_train, y_dev = train_test_split(irr_X, irr_y, test_size=0.2, random_state=42)
params, model = normgb_opt(x_train, y_train)

# Getting dependencies issues for numpy and skopt, might need to change back to grid/randomsearch  

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

#### Testing & Displaying

## Testing if getting the same values 

After thinking about making a predictions of outputed AC power from irradiation or other, since wew don't have clear and effective links with modules, we could resume this by taking the irradiation's average line, and apply a coefficient onto it to found out AC power.

We could eventually could predict a tendency from days to days from Irradiation of the plants as well as the previous days' yield, but I believe we don't have enough data to predict such things. 

What left to be done:
1. fit and load predicted new datasets into a parquet file
2. Search if we can have a specific model to calculate a single coefficient between two values (Linear regression maybe as formula state that y = mx + b)
3. If we can find a good model, create a quick & clean way to predict the coef.

Since I don't have any ways to generate more data and not enough data