In [1]:
import xarray as xr
import numpy as np
import os
import pycaret

In the notebook EDA.ipynb, it was determined that some of the variables in the dataset were highly correlated with each other, namely the surface mass concentration (SMASS) variables. As such, those will be dropped before any further analysis. Additionally, DUCMASS and DUCMASS25, as well as SSCMASS and SSSMASS25, were found to be highly correlated, so only DUCMASS and SSCMASS will be kept.

In [2]:
ds = xr.open_dataset('/home/giantstep5/rjones98/meteorology/ESS569/ai_ready/AI_ready_dataset.nc')
ds = ds.drop_vars(['BCSMASS', 'DUCMASS25', 'DUSMASS', 'DUSMASS25', 'OCSMASS', 'SO2SMASS', 'SO4SMASS', 'SSCMASS25', 'SSSMASS', 'SSSMASS25', 'energy'])
ds

PyCaret requires the use of DataFrames, so the Xarray Dataset must be transformed into a Pandas DataFrame. For computational efficiency, only the month of January is used in this notebook, as using the entire dataset causes some of the models to run for extensive amounts of time.

In [3]:
df = ds.to_dataframe().reset_index()
df_jan = df.head(1212224)
df_jan

Unnamed: 0,time,lat,lon,cape,precipitation,BCCMASS,DUCMASS,OCCMASS,SO2CMASS,SO4CMASS,SSCMASS,ltg
0,2023-01-01 00:00:00,24.5,-125.000,9.550140e-13,0.000000,2.505434e-07,0.000001,9.709565e-07,1.464562e-07,0.000001,0.000008,0.0
1,2023-01-01 00:00:00,24.5,-124.375,1.126244e-01,0.000000,2.501359e-07,0.000001,9.646701e-07,1.444383e-07,0.000001,0.000007,0.0
2,2023-01-01 00:00:00,24.5,-123.750,2.252488e-01,0.000000,2.515717e-07,0.000001,9.645149e-07,1.418384e-07,0.000001,0.000006,0.0
3,2023-01-01 00:00:00,24.5,-123.125,2.778068e-01,0.000000,2.527164e-07,0.000001,9.641268e-07,1.414309e-07,0.000001,0.000005,0.0
4,2023-01-01 00:00:00,24.5,-122.500,6.719603e-01,0.000000,2.522120e-07,0.000001,9.604015e-07,1.408294e-07,0.000001,0.000004,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1212219,2023-01-31 21:00:00,50.0,-69.375,1.365153e-12,0.000000,2.579516e-07,0.000006,8.143725e-07,1.394857e-06,0.000002,0.000001,0.0
1212220,2023-01-31 21:00:00,50.0,-68.750,1.365177e-12,0.000000,2.546919e-07,0.000006,8.076204e-07,1.254383e-06,0.000002,0.000001,0.0
1212221,2023-01-31 21:00:00,50.0,-68.125,3.003348e-02,0.000000,2.576993e-07,0.000006,8.182531e-07,1.291946e-06,0.000002,0.000002,0.0
1212222,2023-01-31 21:00:00,50.0,-67.500,1.365177e-12,0.000000,2.713587e-07,0.000006,8.732787e-07,1.734635e-06,0.000002,0.000002,0.0


In [4]:
#Set up PyCaret environment:
from pycaret.regression import *

regression_setup = setup(data=df_jan, target='ltg', session_id=123, 
                         normalize=True, use_gpu=False)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,ltg
2,Target type,Regression
3,Original data shape,"(1212224, 12)"
4,Transformed data shape,"(1212224, 14)"
5,Transformed train set shape,"(848556, 14)"
6,Transformed test set shape,"(363668, 14)"
7,Numeric features,10
8,Date features,1
9,Preprocess,True


Here are the list of models that are tested.

In [5]:
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Linear Regression,sklearn.linear_model._base.LinearRegression,True
lasso,Lasso Regression,sklearn.linear_model._coordinate_descent.Lasso,True
ridge,Ridge Regression,sklearn.linear_model._ridge.Ridge,True
en,Elastic Net,sklearn.linear_model._coordinate_descent.Elast...,True
lar,Least Angle Regression,sklearn.linear_model._least_angle.Lars,True
llar,Lasso Least Angle Regression,sklearn.linear_model._least_angle.LassoLars,True
omp,Orthogonal Matching Pursuit,sklearn.linear_model._omp.OrthogonalMatchingPu...,True
br,Bayesian Ridge,sklearn.linear_model._bayes.BayesianRidge,True
ard,Automatic Relevance Determination,sklearn.linear_model._bayes.ARDRegression,False
par,Passive Aggressive Regressor,sklearn.linear_model._passive_aggressive.Passi...,True


A subsection of these models are compared, as no GPU is available to be used. However, 17 models are still tested, which provides a wide variety of test cases.

In [6]:
best_model = compare_models(exclude='rf')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0004,0.0001,0.0082,0.5551,0.0061,8.5615,7.341
lightgbm,Light Gradient Boosting Machine,0.0004,0.0001,0.0082,0.554,0.0061,7.2005,1.174
gbr,Gradient Boosting Regressor,0.0005,0.0001,0.009,0.4652,0.0068,10.1144,35.34
knn,K Neighbors Regressor,0.0004,0.0001,0.0092,0.4362,0.007,6.9901,2.528
lr,Linear Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.315
br,Bayesian Ridge,0.0015,0.0001,0.0115,0.1289,0.009,17.4168,0.14
ridge,Ridge Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.239
lar,Least Angle Regression,0.0015,0.0001,0.0115,0.1289,0.009,17.4182,0.11
omp,Orthogonal Matching Pursuit,0.0012,0.0001,0.0115,0.1167,0.009,12.491,0.107
dt,Decision Tree Regressor,0.0005,0.0001,0.0116,0.0975,0.0087,7.6535,2.287


The Extra Trees Regressor was evaluated to be the best model. The Extra Trees Regressor is a machine learning algorithm for regression problems, that uses multiple randomized decision trees to make predictions. It is similar to a Random Forest model, but with a higher level of randomness which comes from randomly choosing split points at each node opposed to the optimal point. The Light Gradient Boosting Machine is almost as good, with equal scores to the Extra Trees Regressor in MAE, MSE, RMSE and RMSLE. Additionally, the R$^2$ value is only one thousandth off. The training time is much faster and the MAPE is also better, so both the Extra Trees Regressor and the Light Gradient Boosting Machine will be used.

In [7]:
et = create_model('et')
lightgbm = create_model('lightgbm')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0004,0.0001,0.0085,0.5141,0.0063,8.5404
1,0.0004,0.0001,0.008,0.5389,0.0063,12.4512
2,0.0004,0.0001,0.0073,0.5846,0.0059,9.0657
3,0.0004,0.0001,0.0079,0.5687,0.0057,6.1768
4,0.0004,0.0001,0.0087,0.5907,0.0063,7.389
5,0.0004,0.0001,0.0087,0.4828,0.0064,9.4099
6,0.0004,0.0001,0.0086,0.549,0.0064,9.2486
7,0.0004,0.0001,0.0097,0.5434,0.0067,9.1494
8,0.0004,0.0001,0.0075,0.5372,0.0057,6.275
9,0.0004,0.0,0.0068,0.642,0.0055,7.9093


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0004,0.0001,0.0075,0.6179,0.0059,7.2422
1,0.0004,0.0001,0.008,0.5349,0.0063,9.2731
2,0.0004,0.0001,0.0079,0.5153,0.0061,7.2892
3,0.0004,0.0001,0.0079,0.5734,0.0057,5.3011
4,0.0004,0.0001,0.0088,0.5846,0.0064,7.0323
5,0.0004,0.0001,0.0085,0.5074,0.0063,7.4631
6,0.0004,0.0001,0.009,0.5077,0.0067,8.041
7,0.0004,0.0001,0.0094,0.5685,0.0065,8.418
8,0.0004,0.0001,0.0075,0.5446,0.0058,5.6358
9,0.0004,0.0001,0.0073,0.5854,0.0058,6.309


In [8]:
evaluate_model(et)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [9]:
evaluate_model(lightgbm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [11]:
tuned_lightgbm = tune_model(lightgbm)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0006,0.0001,0.0096,0.3871,0.0075,6.6842
1,0.0006,0.0001,0.0093,0.3681,0.0074,9.7474
2,0.0006,0.0001,0.0097,0.2734,0.0075,9.4868
3,0.0006,0.0001,0.0101,0.2999,0.0074,6.7642
4,0.0006,0.0001,0.0114,0.3009,0.0084,8.5861
5,0.0006,0.0001,0.01,0.3165,0.0076,8.2804
6,0.0006,0.0001,0.0103,0.3492,0.0078,11.2808
7,0.0006,0.0001,0.0116,0.3509,0.0081,7.8521
8,0.0006,0.0001,0.0088,0.3648,0.0068,5.4946
9,0.0006,0.0001,0.0091,0.3567,0.0073,6.7735


Fitting 10 folds for each of 10 candidates, totalling 100 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.056237 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2220
[LightGBM] [Info] Number of data points in the train set: 763700, number of used features: 11
[LightGBM] [Info] Start training from score 0.000467
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.556951 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2220
[LightGBM] [Info] Number of data points in the train set: 763701, number of used features: 11
[LightGBM] [Info] Start training from score 0.000468
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.091715 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enou

In [14]:
print(tuned_lightgbm.get_params())

{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0}
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.210848 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2220
[LightGBM] [Info] Number of data points in the train set: 763701, number of used features: 11
[LightGBM] [Info] Start training from score 0.000468
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016091 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] 

A tuned version of the model was run with the Light Gradient Booosting Machine, and the tuned model does not return a better output than the original model, so going forward, the original model will continue to be used. Attempting to tune the Extra Trees Regressor results takes excessive amounts of time, so this model will not be tuned.

With regards to the hyperparameters tuned, it appears that nothing was actually changed when tuning the model, as the parameters appear to be the same between the untuned and tuned versions of the model.