In [1]:
import xarray as xr
import numpy as np
import os
import pycaret

In the notebook EDA.ipynb, it was determined that some of the variables in the dataset were highly correlated with each other, namely the surface mass concentration (SMASS) variables. As such, those will be dropped before any further analysis. Additionally, DUCMASS and DUCMASS25, as well as SSCMASS and SSSMASS25, were found to be highly correlated, so only DUCMASS and SSCMASS will be kept.

In [2]:
ds = xr.open_dataset('/home/giantstep5/rjones98/meteorology/ESS569/ai_ready/AI_ready_dataset.nc')
ds = ds.drop_vars(['BCSMASS', 'DUCMASS25', 'DUSMASS', 'DUSMASS25', 'OCSMASS', 'SO2SMASS', 'SO4SMASS', 'SSCMASS25', 'SSSMASS', 'SSSMASS25', 'energy'])
ds

To highlight differences in training time, in this notebook, data from January, February and March are all used, compared to just January in the notebooks AutoML_Hyperparameter_Tuning.ipynb and Model_Training_Assessment.ipynb.

In [3]:
df = ds.to_dataframe().reset_index()
df_jan = df.head(1212224)
df_jfm = df.head(3519360)
df_jfm.head()

Unnamed: 0,time,lat,lon,cape,precipitation,BCCMASS,DUCMASS,OCCMASS,SO2CMASS,SO4CMASS,SSCMASS,ltg
0,2023-01-01,24.5,-125.0,9.55014e-13,0.0,2.505434e-07,1e-06,9.709565e-07,1.464562e-07,1e-06,8e-06,0.0
1,2023-01-01,24.5,-124.375,0.1126244,0.0,2.501359e-07,1e-06,9.646701e-07,1.444383e-07,1e-06,7e-06,0.0
2,2023-01-01,24.5,-123.75,0.2252488,0.0,2.515717e-07,1e-06,9.645149e-07,1.418384e-07,1e-06,6e-06,0.0
3,2023-01-01,24.5,-123.125,0.2778068,0.0,2.527164e-07,1e-06,9.641268e-07,1.414309e-07,1e-06,5e-06,0.0
4,2023-01-01,24.5,-122.5,0.6719603,0.0,2.52212e-07,1e-06,9.604015e-07,1.408294e-07,1e-06,4e-06,0.0


In [4]:
#January Only
from pycaret.regression import *

regression_setup_jan = setup(data=df_jan, target='ltg', session_id=123, 
                         normalize=True, use_gpu=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,ltg
2,Target type,Regression
3,Original data shape,"(1212224, 12)"
4,Transformed data shape,"(1212224, 14)"
5,Transformed train set shape,"(848556, 14)"
6,Transformed test set shape,"(363668, 14)"
7,Numeric features,10
8,Date features,1
9,Preprocess,True


In [5]:
best_model = compare_models(exclude='rf')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0004,0.0001,0.0082,0.5551,0.0061,8.5615,7.149
lightgbm,Light Gradient Boosting Machine,0.0004,0.0001,0.0082,0.554,0.0061,7.2005,1.154
gbr,Gradient Boosting Regressor,0.0005,0.0001,0.009,0.4652,0.0068,10.1144,35.916
knn,K Neighbors Regressor,0.0004,0.0001,0.0092,0.4362,0.007,6.9901,2.375
lr,Linear Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.311
br,Bayesian Ridge,0.0015,0.0001,0.0115,0.1289,0.009,17.4168,0.139
ridge,Ridge Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.243
lar,Least Angle Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.107
omp,Orthogonal Matching Pursuit,0.0012,0.0001,0.0115,0.1167,0.009,12.491,0.106
dt,Decision Tree Regressor,0.0005,0.0001,0.0116,0.0975,0.0087,7.6535,2.168


In [6]:
#January, February and March
from pycaret.regression import *

regression_setup_jfm = setup(data=df_jfm, target='ltg', session_id=123, 
                         normalize=True, use_gpu=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,ltg
2,Target type,Regression
3,Original data shape,"(3519360, 12)"
4,Transformed data shape,"(3519360, 14)"
5,Transformed train set shape,"(2463552, 14)"
6,Transformed test set shape,"(1055808, 14)"
7,Numeric features,10
8,Date features,1
9,Preprocess,True


In [7]:
best_model = compare_models(exclude='rf')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0004,0.0001,0.009,0.4805,0.0065,8.5354,33.738
lightgbm,Light Gradient Boosting Machine,0.0004,0.0001,0.0092,0.4581,0.0067,7.3899,2.661
knn,K Neighbors Regressor,0.0004,0.0001,0.0099,0.3791,0.0072,6.6043,8.785
gbr,Gradient Boosting Regressor,0.0005,0.0001,0.0099,0.379,0.0072,8.8275,109.068
lr,Linear Regression,0.0015,0.0001,0.0118,0.1144,0.0089,16.5806,0.495
br,Bayesian Ridge,0.0015,0.0001,0.0118,0.1144,0.0089,16.58,0.379
ridge,Ridge Regression,0.0015,0.0001,0.0118,0.1144,0.0089,16.5806,0.303
lar,Least Angle Regression,0.0015,0.0001,0.0118,0.1144,0.0089,16.5806,0.286
omp,Orthogonal Matching Pursuit,0.0011,0.0001,0.0119,0.1027,0.009,11.1558,0.283
en,Elastic Net,0.0009,0.0002,0.0125,-0.0,0.0096,1.2638,0.315


Using this expanded dataset, the Extra Trees Regressor and Light Gradient Boosting Model remain the top two choices. The error metrics are slightly different, as can be seen in the far right column, the training time has increased dramatically for the Extra Trees Regressor. 

While these are not large periods of time for this example (2.5 seconds for the Light Gradient Boosting Machine versus 32 seconds for the Extra Trees Regressor), it should be pointed out that this is an extremely small subsection of data compared to what would be used in real models. Other lightning models typically contain multiple years of data, and sometimes are able to include greater spatial extents of data than what are being used in this notebook. As such, this points to the Light Gradient Boosting Machine being a much better choice than the Extra Trees Regressor, as the different in performance is miniscule, but the Extra Trees Regressor takes around 13 times longer to train.