# AutoGluon Assistant - Quick Start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/autogluon/autogluon-assistant)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://github.com/autogluon/autogluon-assistant)

(Links above are still WIP)

In this tutorial, we will see how to use AutoGluon Assistant (AG-A) to solve machine learning problems **with zero line of code**. AG-A combines the power of AutoGluon's state-of-the-art AutoML capabilities with Large Language Models (LLMs) to automate the entire data science pipeline.

We will cover:
- Setting up AutoGluon Assistant
- Preparing Your Data
- Using AutoGluon Assistant (via Command Line Interface)
- Using AutoGluon Assistant (through Python Programming)
- Using AutoGluon Assistant (via Web UI)

By the end of this tutorial, you'll be able to run your data with our highly accurate ML solutions using just natural language instructions. Let's get started with the installation!

## Setting up AutoGluon Assistant
Getting started with AutoGluon Assistant is straightforward. Let's install it directly using pip:

In [None]:
!pip install autogluon-assistant

AutoGluon Assistant supports two LLM providers: AWS Bedrock (default) and OpenAI. Choose one of the following setups:

In [None]:
# Option A: AWS Bedrock (Recommended)
!export AWS_DEFAULT_REGION='<your-region>'
!export AWS_ACCESS_KEY_ID='<your-access-key>'
!export AWS_SECRET_ACCESS_KEY='<your-secret-key>'
### OR ###
# Option B: OpenAI
!export OPENAI_API_KEY='sk-...'

*Note: If using OpenAI, we recommend a paid API key rather than a free-tier account to avoid rate limiting issues.*

Let's verify the installation by importing the package:

In [None]:
import autogluon.assistant

print(autogluon.assistant.__version__)


Now that you have AutoGluon Assistant installed and configured, let's move on to preparing your data directory structure for your first ML project!

## Preparing Your Data

For this tutorial, we'll use a spaceship transportation dataset which is perfect for getting started with machine learning. The goal is to predict whether an item was transported based on various numerical and categorical features in the dataset. We sampled 1000 training and test examples from the original data. The sampled dataset make this tutorial run quickly, but AutoGluon Assistant can handle the full dataset if desired.

Let's download the example data:

In [None]:
%%bash
wget https://automl-mm-bench.s3.us-east-1.amazonaws.com/aga/data/aga_sample_data.zip
unzip aga_sample_data.zip

That's it! We now have (under `./toy_data`):

- `train.csv`: Training data with labeled examples
- `test.csv`: Test data for making predictions
- `descriptions.txt`: A description of the dataset and task

Let's take a quick look at our training data and description file:

In [None]:
import pandas as pd
train_data = pd.read_csv("toy_data/train.csv")
train_data.head()

```
  PassengerId HomePlanet CryoSleep  Destination   Age    VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck                Name  Transported Deck  Cabin_num Side
0     5647_01     Europa      True  TRAPPIST-1e  35.0  False          0.0        NaN           0.0  0.0     NaN  Dyonevi Matoltuble         True    C      178.0    P
1     4061_02       Mars      True  TRAPPIST-1e   0.0  False          0.0        0.0           0.0  0.0     0.0          Graw Kashe         True    F      766.0    S
2     0691_03       Mars      True  TRAPPIST-1e  23.0  False          0.0        0.0           0.0  0.0     0.0          Moss Potte         True    E       49.0    S
3     1094_01      Earth     False  TRAPPIST-1e  60.0    NaN        437.0        2.0           0.0  0.0   365.0   Carona Webstenson        False    F      224.0    P
4     6394_01      Earth      True  TRAPPIST-1e  41.0  False          0.0        0.0           0.0  0.0     0.0   Pattie Lambleyoun         True    G     1037.0    S
```

In [None]:
with open('toy_data/descriptions.txt', 'r') as f:
    print(f.read())

```
You are solving this data science tasks of binary classification: 
The dataset presented here (the spaceship dataset) comprises a lot of features, including both numerical and categorical features. Some of the features are missing, with nan value. We have splitted the dataset into three parts of train, valid and test. Your task is to predict the Transported item, which is a binary label with True and False. The evaluation metric is the classification accuracy.
```

## Using AutoGluon Assistant (via Command Line Interface)

Now that we have our data ready, let's use AutoGluon Assistant to build our ML model. The simplest way to use AutoGluon Assistant is through the command line - no coding required! After installing the package, you can run it directly from your terminal:

In [None]:
%%bash
aga run ./toy_data \
    --presets medium_quality    # (Optional) Choose prediction quality level:
                                # Options: medium_quality, high_quality, best_quality (default)

```
INFO:root:Starting AutoGluon-Assistant
INFO:root:Presets: medium_quality
INFO:root:Loading default config from: /media/deephome/autogluon-assistant/src/autogluon.assistant/configs/medium_quality.yaml
INFO:root:Successfully loaded config
🤖  Welcome to AutoGluon-Assistant 
Will use task config:
{
    'infer_eval_metric': True,
    'detect_and_drop_id_column': False,
    'task_preprocessors_timeout': 3600,
    'save_artifacts': {'enabled': False, 'append_timestamp': True, 'path': './aga-artifacts'},
    'feature_transformers': None,
    'autogluon': {'predictor_init_kwargs': {}, 'predictor_fit_kwargs': {'presets': 'medium_quality', 'time_limit': 600}},
    'llm': {
        'provider': 'bedrock',
        'model': 'anthropic.claude-3-5-sonnet-20241022-v2:0',
        'max_tokens': 512,
        'proxy_url': None,
        'temperature': 0,
        'verbose': True
    }
}
Task path: /media/deephome/testdir/toy_data
Task loaded!
TabularPredictionTask(name=toy_data, description=, 3 datasets)
INFO:botocore.credentials:Found credentials in environment variables.
INFO:autogluon.assistant.llm.llm:AGA is using model anthropic.claude-3-5-sonnet-20241022-v2:0 from Bedrock to assist you with the task.
INFO:autogluon.assistant.assistant:Task understanding starts...
INFO:autogluon.assistant.task_inference.task_inference:description: data_description_file: You are solving this data science tasks of binary classification: \nThe dataset presented here (the spaceship dataset) comprises a lot of features, including both numerical and categorical features. Some of the features are missing, with nan value. We have splitted the dataset into three parts of train, valid and test. Your task is to predict the Transported item, which is a binary label with True and False. The evaluation metric is the classification accuracy.\n
INFO:autogluon.assistant.task_inference.task_inference:train_data: /media/deephome/testdir/toy_data/train.csv
Loaded data from: /media/deephome/testdir/toy_data/train.csv | Columns = 16 / 16 | Rows = 1000 -> 1000
INFO:autogluon.assistant.task_inference.task_inference:test_data: /media/deephome/testdir/toy_data/test.csv
Loaded data from: /media/deephome/testdir/toy_data/test.csv | Columns = 16 / 16 | Rows = 1000 -> 1000
INFO:autogluon.assistant.task_inference.task_inference:WARNING: Failed to identify the sample_submission_data of the task, it is set to None.
INFO:autogluon.assistant.task_inference.task_inference:label_column: Transported
INFO:autogluon.assistant.task_inference.task_inference:problem_type: binary
INFO:autogluon.assistant.task_inference.task_inference:eval_metric: accuracy
INFO:autogluon.assistant.assistant:Total number of prompt tokens: 1582
INFO:autogluon.assistant.assistant:Total number of completion tokens: 155
INFO:autogluon.assistant.assistant:Task understanding complete!
INFO:autogluon.assistant.assistant:Automatic feature generation is disabled. 
Model training starts...
INFO:autogluon.assistant.predictor:Fitting AutoGluon TabularPredictor
INFO:autogluon.assistant.predictor:predictor_init_kwargs: {'learner_kwargs': {'ignored_columns': []}, 'label': 'Transported', 'problem_type': 'binary', 'eval_metric': 'accuracy'}
INFO:autogluon.assistant.predictor:predictor_fit_kwargs: {'presets': 'medium_quality', 'time_limit': 600}
No path specified. Models will be saved in: "AutogluonModels/ag-20241111_055131"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.1.1
Python Version:     3.10.14
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #54~20.04.1-Ubuntu SMP Fri Oct 6 22:04:33 UTC 2023
CPU Count:          96
Memory Avail:       1030.28 GB / 1121.80 GB (91.8%)
Disk Space Avail:   64.75 GB / 860.63 GB (7.5%)
===================================================
Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20241111_055131"
Train Data Rows:    1000
Train Data Columns: 15
Label Column:       Transported
Problem Type:       binary
Preprocessing data ...
Selected class <--> label mapping:  class 1 = True, class 0 = False
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
        Available Memory:                    1055013.00 MB
        Train Data (Original)  Memory Usage: 0.48 MB (0.0% of available memory)
        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
        Stage 1 Generators:
                Fitting AsTypeFeatureGenerator...
        Stage 2 Generators:
                Fitting FillNaFeatureGenerator...
        Stage 3 Generators:
                Fitting IdentityFeatureGenerator...
                Fitting CategoryFeatureGenerator...
                        Fitting CategoryMemoryMinimizeFeatureGenerator...
        Stage 4 Generators:
                Fitting DropUniqueFeatureGenerator...
        Stage 5 Generators:
                Fitting DropDuplicatesFeatureGenerator...
        Unused Original Features (Count: 1): ['PassengerId']
                These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
                Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
                These features do not need to be present at inference time.
                ('object', []) : 1 | ['PassengerId']
        Types of features in original data (raw dtype, special dtypes):
                ('float', [])  : 7 | ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', ...]
                ('object', []) : 7 | ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', ...]
        Types of features in processed data (raw dtype, special dtypes):
                ('category', []) : 7 | ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', ...]
                ('float', [])    : 7 | ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', ...]
        0.1s = Fit runtime
        14 features in original data used to generate 14 features in processed data.
        Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.1s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
        To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
        'NN_TORCH': {},
        'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
        'CAT': {},
        'XGB': {},
        'FASTAI': {},
        'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
        'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
        'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 599.9s of the 599.9s of remaining time.
        0.805    = Validation score   (accuracy)
        0.04s    = Training   runtime
        0.04s    = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 599.82s of the 599.82s of remaining time.
        0.79     = Validation score   (accuracy)
        0.03s    = Training   runtime
        0.03s    = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 599.75s of the 599.75s of remaining time.
        0.83     = Validation score   (accuracy)
        0.87s    = Training   runtime
        0.01s    = Validation runtime

......

Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 581.72s of remaining time.
        Ensemble Weights: {'LightGBMLarge': 0.4, 'NeuralNetTorch': 0.25, 'NeuralNetFastAI': 0.2, 'CatBoost': 0.15}
        0.855    = Validation score   (accuracy)
        0.12s    = Training   runtime
        0.0s     = Validation runtime
AutoGluon training complete, total runtime = 18.41s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4025.3 rows/s (200 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20241111_055131")
Model training complete!
Prediction starts...
Prediction complete! Outputs written to aga-output-20241111_055149.csv
```

## Using AutoGluon Assistant (through Python Programming)

Let's also look at how to use AutoGluon Assistant programmatically in Python:

In [None]:
from autogluon.assistant import run_assistant

# Run the assistant
output_file = run_assistant(task_path="./toy_data", presets="medium_quality")

Let's examine the predictions:

In [None]:
import pandas as pd

predictions = pd.read_csv(output_file)
print("\nFirst few predictions:")
print(predictions.head())

```
First few predictions:
   Transported
0         True
1        False
2         True
3         True
4         True
```

## Using AutoGluon Assistant (via Web UI)

AutoGluon Assistant Web UI allows users to leverage the capabilities of AG-A through an intuitive web interface.

The web UI enables users to upload datasets, configure AG-A runs with customized settings, preview data, monitor execution progress, view and download results, and supports secure, isolated sessions for concurrent users.

### To run the AG-A Web UI:

In [None]:
%%bash

aga ui

# OR

# Launch Web-UI on specific port e.g. 8888
aga ui --port 8888

AG-A Web UI should now be accessible in your web browser at http://localhost:8501 or the specified port.

## Conclusion

In this quickstart tutorial, we saw how AutoGluon Assistant simplifies the entire ML pipeline by allowing users to solve machine learning problems with minimal efforts. With just a data directory, AutoGluon Assistant handles the entire process from data understanding to prediction generation. Check out the other tutorials (WIP) to learn more about customizing the configuration, using different LLM providers, and handling various types of ML tasks.

Want to dive deeper? Explore our GitHub repository for more advanced features and examples.