# Auto Machine Learning Model Selection and Hyperparameter Tuning

This notebook uses the Python package `pycaret` to determine the best performing model for our regression purposes. After selecting the best model, we can then tune the hyperparameters to best optimize the models performance and physical understanding of the problem.

First, we need to load the data. If the data is still in `.zip` format, we need to run the cell below to unzip it. Be sure to adjust `my_path` to the path you cloned this repository to.

In [1]:
# Replace with your path
my_path = '/global/u2/s/skygale/'

In [2]:
# Load the ready_sic_sst_data.zip file and extract the data
import zipfile
import os

zip_file_path = os.path.join(my_path, 'MLGEO2024_SeaIcePrediction/data/ready_sic_sst_data.zip')

if os.path.exists(zip_file_path):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall('.')
    print('Data extracted successfully!')
else:
    print(f'File not found: {zip_file_path}')

Data extracted successfully!


In [3]:
# Load the .nc file
import xarray as xr

ds_path = os.path.join(my_path, 'MLGEO2024_SeaIcePrediction/data/ready_sic_sst_data.nc')
ds = xr.open_dataset(ds_path)
ds

For computational efficiency, we will use a specific region of the Arctic to compare models and regress on for sea surface temperature (SST). We will chose the Barents-Kara Sea, which lies north of Norway and Siberia. It's longitude and latitutde coordinates run from 30°–90°E and 65°–85°N, respectively.

In [4]:
# Select data for Barents-Kara Sea region
ds_sel = ds.sel(latitude=slice('85.01', '65'), longitude=slice('30', '90.01'))
ds_sel

In [5]:
# Remove rows with missing values in the target column 'sst' (in case)
df = ds_sel.to_dataframe().reset_index()
df = df.dropna(subset=['sst'])

Now we can use `pycaret` to determine the best model for our purposes.

In [6]:
# Use pycaret to train and select models
from pycaret.regression import *

# Setup the data
s = setup(data=df, target='sst')

# Compare models
best = compare_models()

Unnamed: 0,Description,Value
0,Session id,4851
1,Target,sst
2,Target type,Regression
3,Original data shape,"(1449736, 7)"
4,Transformed data shape,"(1449736, 9)"
5,Transformed train set shape,"(1014815, 9)"
6,Transformed test set shape,"(434921, 9)"
7,Numeric features,4
8,Date features,1
9,Categorical features,1


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
dt,Decision Tree Regressor,0.037,0.0096,0.0982,0.9988,0.0004,0.0001,0.519
knn,K Neighbors Regressor,0.0428,0.0108,0.1038,0.9987,0.0004,0.0002,0.552
lightgbm,Light Gradient Boosting Machine,0.2372,0.176,0.4195,0.9784,0.0015,0.0009,187.187
gbr,Gradient Boosting Regressor,0.3869,0.4425,0.6652,0.9457,0.0024,0.0014,10.937
ada,AdaBoost Regressor,0.7918,1.111,1.053,0.8637,0.0038,0.0029,7.643
lr,Linear Regression,1.1561,2.0712,1.4392,0.7459,0.0052,0.0042,1.856
ridge,Ridge Regression,1.1561,2.0712,1.4392,0.7459,0.0052,0.0042,1.62
lar,Least Angle Regression,1.1561,2.0712,1.4392,0.7459,0.0052,0.0042,0.193
br,Bayesian Ridge,1.1561,2.0712,1.4392,0.7459,0.0052,0.0042,0.239
en,Elastic Net,1.1703,2.1425,1.4637,0.7372,0.0053,0.0043,1.062


Since we have compared all of the models, we can see that the best model is **Decision Tree Regressor**. The total time taken to evaluate for the best model was 50 minutes. We can now evaluate our model using the built in function `evaluate_model`. 

In [None]:
# Show model evaluation
evaluate_model(best)

We can tune the hyperparameters of this model using `tune_model`.

In [None]:
# Tune the best model
tuned_best = tune_model(best, fold=3)

Finally, we can save and plot the best model for future use and visual checks.

In [None]:
# Plot the model
plot_model(tuned_best, plot='error')

# Save the model
save_model(tuned_best, 'sst_model')

For example, let's load the model and make predictions.

In [None]:
# Load the model
loaded_model = load_model('sst_model')

# Make predictions
predictions = predict_model(loaded_model, df)

print(predictions)