# Project: Kaggle Flood Competition

This project utilizes the AutoML framework AutoGluon to compete in the Kaggle competition [Regression with a Flood Prediction Dataset](https://www.kaggle.com/competitions/playground-series-s4e5/overview). Below are the frameworks and platforms used for this project.

- AutoGluon - A AutoML framework for prototyping ML ensemble models. 
- SageMaker Studio - Jupyter Lab and Code Editor was used to developer this notebook and the accompanying scripts. 
- GitHub - The code repository used for this project.
- SageMaker Sklearn Docker Image - A prebuilt Amazon SageMaker container used to run AutoGluon model training jobs on AWS spot instances. 

Results: Using SageMaker training job reduced training costs by almost 50%. This is a regression task and the performance metric is $R^2$. My model's score was 0.86884. The top score was 0.86905.  


# Update Python Packages

The code below installs the latest version of AutoGluon. SageMaker uses Conda for package management.

In [2]:
!conda list autogluon

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
autogluon                 0.8.2              pyhd8ed1ab_4    conda-forge
autogluon.common          0.8.2              pyhd8ed1ab_4    conda-forge
autogluon.core            0.8.2           light_py310h2d11d36_6    conda-forge
autogluon.features        0.8.2              pyhd8ed1ab_3    conda-forge
autogluon.multimodal      0.8.2              pyha770c72_5    conda-forge
autogluon.tabular         0.8.2              pyha770c72_4    conda-forge
autogluon.timeseries      0.8.2              pyhd8ed1ab_4    conda-forge


In [3]:
!pip uninstall -y autogluon
!pip uninstall -y autogluon.multimodal
!pip uninstall -y autogluon.tabular
!pip uninstall -y autogluon.timeseries 
!pip install -qU bokeh==2.0.1 --progress-bar off
!pip install -qU autogluon.tabular[lightgbm,catboost,xgboost] --progress-bar off

Found existing installation: autogluon 0.8.2
Uninstalling autogluon-0.8.2:
  Successfully uninstalled autogluon-0.8.2
Found existing installation: autogluon.multimodal 0.8.2
Uninstalling autogluon.multimodal-0.8.2:
  Successfully uninstalled autogluon.multimodal-0.8.2
Found existing installation: autogluon.tabular 0.8.2
Uninstalling autogluon.tabular-0.8.2:
  Successfully uninstalled autogluon.tabular-0.8.2
Found existing installation: autogluon.timeseries 0.8.2
Uninstalling autogluon.timeseries-0.8.2:
  Successfully uninstalled autogluon.timeseries-0.8.2


The version of Lightgbm used by AutoGluon 1.1.0 is LGBM 4.3.0, so the default install is updated below. 

In [4]:
!conda list lightgbm

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
lightgbm                  3.3.5           py310heca2aa9_0    conda-forge


In [5]:
!pip install -qU lightgbm==4.3.0 --progress-bar off

# Setup

The necessary packages are imported below. 

In [None]:
import pickle
import tarfile

import boto3
import pandas as pd
import sagemaker
from autogluon.tabular import TabularPredictor
from sagemaker import get_execution_role, image_uris
from sagemaker.sklearn import SKLearn
from util import DataUtil

project_bucket = '< YOUR S3 BUCKET HERE >'
train_bucket = 'train'
train_file = 'train.csv'
test_file = 'test.csv'
model_folder = 'model'
instance_type = 'ml.m5.12xlarge'
n_jobs = 48
target_variable = 'FloodProbability'

VERSION = 'v7'
TRAIN_FRACTION = 1.0

model_output = 's3://{}/{}'.format(project_bucket, model_folder)

Retrieve the Scikit-learn AWS docker image that will be used to train the AutoGluon model. 

In [7]:
image_uri = image_uris.retrieve(framework='sklearn', region='us-east-1',
                    version='1.2-1', py_version='py3',
                    image_scope='training',
                    instance_type=instance_type)
image_uri

'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3'

# Train AutoGluon Models

Submit a training job for the AutoGluon model. The training script can be found in the `container_scripts` directory that is part of this distribution.

The training script creates a AutoGluon `TabularPredictor` instance. By default, AutoGluon prototypes many algorithms including LGBM, CatBoost, XGB, random forest, KNN, Linear, etc. For this project, I limited the the models prototypes to LGBM, CatBoost, and XGB. Using the best performers, AutoGluon will also create bagged, stacker and weighted ensembles. See the `train_v7.py` script in the `container_scripts` directory for details. 

Setting the `use_spot_instances` to true ensures AWS EC2 spot instances were used for training.

After training, SageMaker will placed the trained models in the S3 bucker you specify.

Below the training job is submitted.

In [8]:
%%time
aws_role = get_execution_role()
sagemaker_session = sagemaker.Session()

env = {'SAGEMAKER_REQUIREMENTS': 'requirements.txt'}

model = SKLearn(
    role=aws_role,
    sagemaker_session=sagemaker_session,
    output_path=model_output,
    code_location=model_output,
    entry_point='train_' + VERSION + '.py',
    source_dir='./container_scripts',
    env=env,
    image_uri=image_uri,
    instance_count=1,
    instance_type=instance_type,
    hyperparameters={"n_jobs": n_jobs, 
                     'training_fraction': TRAIN_FRACTION, 
                     'time_limit': 1800*5,
                     'version': VERSION},
    use_spot_instances=True,
    max_run=2000*5, 
    max_wait=2000*5,
)

model.fit()

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2024-05-28-23-41-26-071


2024-05-28 23:42:05 Starting - Starting the training job...
2024-05-28 23:42:20 Starting - Preparing the instances for training...
2024-05-28 23:42:51 Downloading - Downloading input data...
2024-05-28 23:43:27 Training - Training image download completed. Training in progress......2024-05-28 23:44:06,416 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2024-05-28 23:44:06,419 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-05-28 23:44:06,421 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2024-05-28 23:44:06,436 sagemaker_sklearn_container.training INFO     Invoking user training script.
2024-05-28 23:44:08,284 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/miniconda3/bin/python -m pip install -r requirements.txt
Collecting ray==2.10.0 (from -r requirements.txt (line 2))
  Downloading ray-2.10.0-cp38-cp38-manylinux2014_x86_64.whl.me

Trained has finished. As you can see, it took almost 2 hours to train all the modlels. 

# Retrieve and Load Trained AutoGluon Model

Below the trained model is retrieved from S3, loaded, and the model leader board displayed. 

## Get archive from S3.

In [9]:
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')

contents = s3_client.list_objects_v2(Bucket=project_bucket, Prefix=model_folder).get('Contents', [])
last_sklearn_model = None
for content in contents:
    if 'sagemaker-scikit-learn' in content['Key'] \
      and 'model.tar.gz' in content['Key']:
      last_sklearn_model = content['Key']

print(last_sklearn_model)
s3_resource.meta.client.download_file(project_bucket,
                                      last_sklearn_model,
                                      './model.tar.gz')
t = tarfile.open('./model.tar.gz', 'r:gz')
t.extractall()

model/sagemaker-scikit-learn-2024-05-28-23-41-26-071/output/model.tar.gz


## Load model

In [10]:
predictor = TabularPredictor.load('./AutoGluonBuild_' + VERSION, check_packages=False,
                                  require_py_version_match=False)

Found 1 mismatches between original and current metadata:


## View AutoGluon Model Leader Board

The best model was a weighted ensemble. In the training script, the best model was retrained on all the data and is the default model called when make predictions.

In [11]:
print(predictor.model_best)
print(predictor.model_names())

WeightedEnsemble_L3_FULL
['LightGBMXT_BAG_L1', 'LightGBM_BAG_L1', 'LightGBM_2_BAG_L1', 'CatBoost_BAG_L1', 'CatBoost_2_BAG_L1', 'XGBoost_BAG_L1', 'XGBoost_2_BAG_L1', 'LightGBMLarge_BAG_L1', 'WeightedEnsemble_L2', 'LightGBMXT_BAG_L2', 'LightGBM_BAG_L2', 'LightGBM_2_BAG_L2', 'CatBoost_BAG_L2', 'CatBoost_2_BAG_L2', 'XGBoost_BAG_L2', 'XGBoost_2_BAG_L2', 'LightGBMLarge_BAG_L2', 'WeightedEnsemble_L3', 'LightGBMXT_BAG_L1_FULL', 'LightGBM_BAG_L1_FULL', 'LightGBM_2_BAG_L1_FULL', 'CatBoost_BAG_L1_FULL', 'CatBoost_2_BAG_L1_FULL', 'XGBoost_BAG_L1_FULL', 'XGBoost_2_BAG_L1_FULL', 'LightGBMLarge_BAG_L1_FULL', 'WeightedEnsemble_L2_FULL', 'LightGBMXT_BAG_L2_FULL', 'LightGBM_BAG_L2_FULL', 'LightGBM_2_BAG_L2_FULL', 'CatBoost_BAG_L2_FULL', 'CatBoost_2_BAG_L2_FULL', 'XGBoost_BAG_L2_FULL', 'XGBoost_2_BAG_L2_FULL', 'LightGBMLarge_BAG_L2_FULL', 'WeightedEnsemble_L3_FULL']


In [12]:
predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,-0.018298,root_mean_squared_error,258.863275,1155.066262,0.016854,2.228569,3,True,18
1,WeightedEnsemble_L2,-0.018299,root_mean_squared_error,69.108697,486.015297,0.016777,1.516208,2,True,9
2,LightGBM_2_BAG_L1,-0.018302,root_mean_squared_error,63.855961,120.923009,63.855961,120.923009,1,True,3
3,LightGBM_BAG_L2,-0.018303,root_mean_squared_error,256.432776,1151.194993,1.734181,17.237268,2,True,11
4,LightGBM_2_BAG_L2,-0.018303,root_mean_squared_error,283.059128,1255.822067,28.360533,121.864342,2,True,12
5,XGBoost_BAG_L2,-0.018304,root_mean_squared_error,258.846421,1152.837693,4.147826,18.879969,2,True,15
6,LightGBMLarge_BAG_L1,-0.018305,root_mean_squared_error,4.919061,33.932035,4.919061,33.932035,1,True,8
7,LightGBMLarge_BAG_L2,-0.018305,root_mean_squared_error,258.101056,1164.244439,3.402462,30.286715,2,True,17
8,CatBoost_2_BAG_L2,-0.01831,root_mean_squared_error,255.041685,1331.452339,0.343091,197.494615,2,True,14
9,CatBoost_BAG_L1,-0.018312,root_mean_squared_error,0.316898,329.644045,0.316898,329.644045,1,True,4


# Create Kaggle Submission

Below the Kaggle submission data is loaded and and predictions of flood probability made.

In [8]:
util = DataUtil(project_bucket,
                train_bucket,
                train_file,
                target_variable,
                test_file)
ds = util.get_data_sagemaker()
train = ds['train']
if TRAIN_FRACTION < 1.0:
    train = train.sample(frac=TRAIN_FRACTION)
trainX = train.copy()
trainy = trainX.pop('FloodProbability')
testX = ds['test']
if TRAIN_FRACTION < 1.0:
    testX = testX.sample(frac=TRAIN_FRACTION)

Lets take a look as some of the test data.

In [10]:
testX.describe()

Unnamed: 0,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,Encroachments,...,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,autoFE_f_0,autoFE_f_1,autoFE_f_2,autoFE_f_3,autoFE_f_4,autoFE_f_5
count,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,...,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0
mean,4.91561,4.930288,4.960027,4.946084,4.938424,4.933524,4.958468,4.927651,4.945308,4.95062,...,4.926957,4.948424,4.940204,4.943918,9.904552,24.408708,9.842568,0.500085,24.229858,9.853714
std,2.056295,2.094117,2.071722,2.052602,2.081816,2.059243,2.089312,2.06811,2.073404,2.08175,...,2.073692,2.065891,2.079128,2.087387,2.913573,15.049644,2.906872,0.285499,14.924683,2.912735
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002194,0.0,0.0
25%,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,4.0,...,3.0,4.0,3.0,3.0,8.0,14.0,8.0,0.189003,14.0,8.0
50%,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,10.0,21.0,10.0,0.541247,21.0,10.0
75%,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,12.0,32.0,12.0,0.712094,32.0,12.0
max,16.0,17.0,16.0,17.0,17.0,17.0,16.0,16.0,16.0,17.0,...,19.0,22.0,16.0,16.0,27.0,180.0,27.0,1.0,156.0,26.0


In [11]:
testX.head()

Unnamed: 0_level_0,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,Encroachments,...,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,autoFE_f_0,autoFE_f_1,autoFE_f_2,autoFE_f_3,autoFE_f_4,autoFE_f_5
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1117957,4,6,3,5,6,7,8,7,8,4,...,6,4,4,5,13.0,18.0,10.0,0.910823,15.0,10.0
1117958,4,4,2,9,5,5,4,7,5,4,...,7,4,4,3,13.0,8.0,11.0,0.075057,4.0,8.0
1117959,1,3,6,5,7,2,4,6,4,2,...,3,6,8,3,9.0,18.0,4.0,0.829648,10.0,8.0
1117960,2,4,4,6,4,5,4,3,4,4,...,4,2,4,4,10.0,16.0,6.0,0.831339,36.0,9.0
1117961,6,3,2,4,6,4,5,5,3,7,...,8,4,5,5,9.0,6.0,14.0,0.347882,36.0,11.0


The code below creates the Kaggle submission that can later be submitted to the competition. 

In [14]:
%%time
print('Get predictions from test data...')
pred = predictor.predict(testX)
df_results = pd.DataFrame(data={target_variable:pred}, index=testX.index)
df_results.index.name = "id"
df_results.to_csv("flood_ag_v7.csv")
print('Done.')

Get predictions from test data...
Done.
CPU times: user 15min 51s, sys: 2.54 s, total: 15min 53s
Wall time: 2min 14s


Lets do a sanity check on the file.

In [2]:
flood_submit = pd.read_csv('./flood_ag_v7.csv')

In [3]:
flood_submit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 745305 entries, 0 to 745304
Data columns (total 2 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                745305 non-null  int64  
 1   FloodProbability  745305 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 11.4 MB


In [4]:
flood_submit.head()

Unnamed: 0,id,FloodProbability
0,1117957,0.577464
1,1117958,0.454134
2,1117959,0.449175
3,1117960,0.470078
4,1117961,0.468185


This looks correct.