<a href="https://www.kaggle.com/code/dalloliogm/drw-autogluon?scriptVersionId=242135164" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Autogluon approach

This notebook uses Autogluon, an AutoML library from Amazon, to predict the 

In [1]:
import pandas as pd

In [2]:
%%capture

!pip install -q autogluon

## Read data

In [3]:
# Load the data
train_df = pd.read_parquet("/kaggle/input/drw-crypto-market-prediction/train.parquet").reset_index()
test_df = pd.read_parquet("/kaggle/input/drw-crypto-market-prediction/test.parquet").reset_index()
sample_submission = pd.read_csv('/kaggle/input/drw-crypto-market-prediction/sample_submission.csv')

## Drop low-variance features

The training fails because of an out-of-memory error. To reduce this, we use simpler data types and limit the training to the most variable features.

In [4]:
def downcast_df(df):
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    return df

train_df = downcast_df(train_df)

In [5]:
from sklearn.feature_selection import VarianceThreshold

X = train_df.drop(columns=['timestamp', 'label'])
y = train_df['label']



In [6]:
X.dtypes

bid_qty     float32
ask_qty     float32
buy_qty     float64
sell_qty    float32
volume      float64
             ...   
X886        float32
X887        float32
X888        float32
X889        float32
X890        float32
Length: 895, dtype: object

In [8]:
import numpy as np

# Replace inf values with NaN
X = X.replace([np.inf, -np.inf], np.nan)

# Drop columns with all NaNs
X = X.dropna(axis=1, how='all')

# Option 1: Fill NaNs with 0 (or use mean imputation)
X = X.fillna(0)

# Now apply VarianceThreshold
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_reduced = pd.DataFrame(selector.fit_transform(X), columns=X.columns[selector.get_support()])

# Combine with label
train_filtered = pd.concat([X_reduced, y.reset_index(drop=True)], axis=1)


## Trigger Training

In [None]:
import pandas as pd
from autogluon.tabular import TabularPredictor



# Define feature and label columns
label = 'label'
ignore_cols = ['timestamp']  # or add more like IDs if needed

# Train AutoGluon model
predictor = TabularPredictor(
    label=label,
    eval_metric='pearsonr',  
).fit(
    train_df.drop(columns=ignore_cols),
#    presets='best_quality', 
    presets='medium_quality', 
    excluded_model_types=['NN_TORCH', 'CATBOOST'],  # Drop memory-heavy models
    time_limit=3600  # 1 hour limit
)

# Predict on test set
preds = predictor.predict(test_df.drop(columns=ignore_cols + [label]))

# Create submission
submission = sample_submission.copy()
submission['label'] = preds
submission.to_csv('submission.csv', index=False)


No path specified. Models will be saved in: "AutogluonModels/ag-20250527_102512"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Nov 10 10:07:59 UTC 2024
CPU Count:          4
Memory Avail:       3.48 GB / 31.35 GB (11.1%)
Disk Space Avail:   19.50 GB / 19.52 GB (99.9%)
Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 3600s
AutoGluon will save models to "/kaggle/working/AutogluonModels/ag-20250527_102512"
Train Data Rows:    525887
Train Data Columns: 895
Label Column:       label
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (20.740270614624023, -24.416614532470703, 0.03612999990582466, 1.00977)
	If 'regression' is not the correct problem_type, please manually specify the problem_type pa