In [1]:
import xarray as xr
import numpy as np
import os
import pycaret

In the notebook EDA.ipynb, it was determined that some of the variables in the dataset were highly correlated with each other, namely the surface mass concentration (SMASS) variables. As such, those will be dropped before any further analysis. Additionally, DUCMASS and DUCMASS25, as well as SSCMASS and SSSMASS25, were found to be highly correlated, so only DUCMASS and SSCMASS will be kept.

In [2]:
ds = xr.open_dataset('/home/giantstep5/rjones98/meteorology/ESS569/ai_ready/AI_ready_dataset.nc')
ds = ds.drop_vars(['BCSMASS', 'DUCMASS25', 'DUSMASS', 'DUSMASS25', 'OCSMASS', 'SO2SMASS', 'SO4SMASS', 'SSCMASS25', 'SSSMASS', 'SSSMASS25', 'energy'])
ds_jja = ds.sel(time=slice('2023-06-01', '2023-08-31'))
ds_jja

To highlight differences in training time, in this notebook, data from June, July and August are all used, compared to just January in the notebooks AutoML_Hyperparameter_Tuning.ipynb and Model_Training_Assessment.ipynb. The diference in month choices are driven by increased lightning amounts in the Continental United States during this time when compared to January.

In [3]:
df_jja = ds_jja.to_dataframe().reset_index()
df_jja.head()

Unnamed: 0,time,lat,lon,cape,precipitation,BCCMASS,DUCMASS,OCCMASS,SO2CMASS,SO4CMASS,SSCMASS,ltg
0,2023-06-01,24.5,-125.0,6.524974,0.0,3.14668e-07,4e-05,1e-06,1.58538e-07,2e-06,3e-06,0.0
1,2023-06-01,24.5,-124.375,11.108364,0.0,3.154829e-07,3.8e-05,1e-06,1.529694e-07,2e-06,4e-06,0.0
2,2023-06-01,24.5,-123.75,10.432619,3.3e-05,3.147456e-07,3.7e-05,1e-06,1.533769e-07,2e-06,4e-06,0.0
3,2023-06-01,24.5,-123.125,6.817513,0.000207,3.14474e-07,3.7e-05,1e-06,1.540366e-07,2e-06,5e-06,0.0
4,2023-06-01,24.5,-122.5,4.350946,0.0,3.147456e-07,3.7e-05,1e-06,1.583439e-07,2e-06,6e-06,0.0


In [4]:
#June, July and August
from pycaret.regression import *

regression_setup_jja = setup(data=df_jja, target='ltg', session_id=123, 
                         normalize=True, use_gpu=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,ltg
2,Target type,Regression
3,Original data shape,"(3597568, 12)"
4,Transformed data shape,"(3597568, 14)"
5,Transformed train set shape,"(2518297, 14)"
6,Transformed test set shape,"(1079271, 14)"
7,Numeric features,10
8,Date features,1
9,Preprocess,True


In [5]:
best_model = compare_models(exclude='rf')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0012,0.0001,0.0099,0.4626,0.0081,6.5552,36.992
lightgbm,Light Gradient Boosting Machine,0.0012,0.0001,0.0102,0.4242,0.0084,5.7538,2.288
gbr,Gradient Boosting Regressor,0.0013,0.0001,0.0105,0.3876,0.0086,5.8868,113.032
knn,K Neighbors Regressor,0.0011,0.0001,0.0107,0.3622,0.0089,4.9784,8.175
lr,Linear Regression,0.0021,0.0001,0.0121,0.1895,0.01,7.9203,0.606
br,Bayesian Ridge,0.0021,0.0001,0.0121,0.1895,0.01,7.9201,0.386
ridge,Ridge Regression,0.0021,0.0001,0.0121,0.1895,0.01,7.9203,0.48
lar,Least Angle Regression,0.0021,0.0001,0.0121,0.1895,0.01,7.9203,0.3
omp,Orthogonal Matching Pursuit,0.0015,0.0001,0.0122,0.1801,0.0101,6.3863,0.309
huber,Huber Regressor,0.0011,0.0002,0.0132,0.0322,0.0111,1.2247,4.564


Using this expanded dataset, the Extra Trees Regressor and Light Gradient Boosting Model remain the top two choices. The error metrics are slightly different, as can be seen in the far right column, the training time has increased dramatically for the Extra Trees Regressor. 

While these are not large periods of time for this example (2.3 seconds for the Light Gradient Boosting Machine versus 37 seconds for the Extra Trees Regressor), it should be pointed out that this is an extremely small subsection of data compared to what would be used in real models. Other lightning models typically contain multiple years of data, and sometimes are able to include greater spatial extents of data than what are being used in this notebook. As such, this points to the Light Gradient Boosting Machine being a much better choice than the Extra Trees Regressor, as the different in performance is miniscule, but the Extra Trees Regressor takes around 18 times longer to train.