<a href="https://colab.research.google.com/github/dogsandpeonies/A04Colab_LogicLoopers_ITAI-1371/blob/main/A04_LogicLoopers_RESUBMISSION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="images/logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# ML through Application
## Module 1, Lab 3: Getting Started with AutoGluon

This notebook covers how to create a model to solve an ML problem by using [AutoGluon](https://auto.gluon.ai/stable/index.html#).

You will learn how to do the following:

- Import the AutoGluon library.
- Import data to a Pandas DataFrame.
- Train a model by using AutoGluon.

---

You will explore a dataset that contains information about books. The goal is to predict book prices by using features about the books.

__Business problem:__ Books from a large database with several features cannot be listed for sale because one critical piece of information is missing: the price.

__ML problem description:__ Predict book prices by using book features, such as genre, release data, ratings, and number of reviews.

This is a regression task (the training dataset has a book price column to use for labels).

----

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you can practice your coding skills.</p>

## Index
- [Importing AutoGluon](#Importing-AutoGluon)
- [Getting the data](#Getting-the-data)
- [Model training with AutoGluon](#Model-training-with-AutoGluon)

---
## Importing AutoGluon

Install and load the libraries that are needed to work with the tabular dataset.

In [1]:
%%capture
# Use pip to install libraries
!pip install autogluon
!pip install tabulate
!pip install pandas

In [2]:
!pip install numpy



In [3]:
!pip install pandas==2.2.2 --force-reinstall

Collecting pandas==2.2.2
  Using cached pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy>=1.23.2 (from pandas==2.2.2)
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting python-dateutil>=2.8.2 (from pandas==2.2.2)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas==2.2.2)
  Using cached pytz-2025.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas==2.2.2)
  Using cached tzdata-2025.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas==2.2.2)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Using cached pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Using cached python_dateutil-2.9.0.post0-

In [10]:
!pip install --upgrade numpy
!pip install --upgrade autogluon

Collecting numpy
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gluonts 0.16.0 requires numpy<2.2,>=1.16, but you have numpy 2.2.4 which is incompatible.
autogluon-timeseries 1.2 requires numpy<2.1.4,>=1.25.0, but you have numpy 2.2.

In [15]:
# Import the libraries that are needed for the notebook
import pandas as pd
import numpy as np

from autogluon.tabular import TabularPredictor

# Import utility functions and challenge questions
#from MLUMLA_EN_M1_Lab3_quiz_questions import



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
## Getting the data

Now get the data for the business problem.

__Note:__ You will use the [Amazon Product Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews) dataset. For more information about this dataset, see the following resources:

- Ruining He and Julian McAuley. "Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering." Proceedings of the 25th International Conference on World Wide Web, Geneva, Switzerland, April 2016. https://doi.org/10.1145/2872427.2883037.

- Julian McAuley, Christopher Targett, Qinfeng Shi, Anton van den Hengel. "Image-Based Recommendations on Styles and Substitutes." Proceedings of the 38th International Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, Santiago, Chile, August 2015. https://doi.org/10.1145/2766462.2767755.

In [3]:
from datasets import load_dataset

dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_All_Beauty", trust_remote_code=True)
print(dataset["full"][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

Amazon-Reviews-2023.py:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

All_Beauty.jsonl:   0%|          | 0.00/327M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

{'rating': 5.0, 'title': 'Such a lovely scent but not overpowering.', 'text': "This spray is really nice. It smells really good, goes on really fine, and does the trick. I will say it feels like you need a lot of it though to get the texture I want. I have a lot of hair, medium thickness. I am comparing to other brands with yucky chemicals so I'm gonna stick with this. Try it!", 'images': [], 'asin': 'B00YQ6X8EO', 'parent_asin': 'B00YQ6X8EO', 'user_id': 'AGKHLEW2SOWHNMFQIJGBECAF7INQ', 'timestamp': 1588687728923, 'helpful_vote': 0, 'verified_purchase': True}


To load the training and test data, and then show the first few rows of the training dataset, run the following cells.

In [9]:
# Assuming TabularDataset is a custom class or function
# You need to provide the correct way to load the data
# Example using pandas:
# import pandas as pd
# df_train = pd.read_csv("path/to/McAuley-Lab/Amazon-Reviews-2023/raw_review_All_Beauty")
# df_test = pd.read_csv("path/to/McAuley-Lab/Amazon-Reviews-2023/raw_review_All_Beauty")


In [11]:
# Convert the Hugging Face dataset to a Pandas DataFrame
df_train = pd.DataFrame(dataset["full"])  # Assuming 'full' split contains the data

In [12]:
df_train.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5.0,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923,0,True
1,4.0,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070,1,True
2,5.0,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,1589665266052,2,True
3,1.0,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220,0,True
4,5.0,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534,0,True


In [26]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [28]:
# Import TabularDataset from autogluon.tabular
from autogluon.tabular import TabularDataset

In [30]:
# Load the data using TabularDataset from autogluon
df_train = TabularDataset(data="/content/drive/MyDrive/my_dataset.csv")
df_test = TabularDataset(data="/content/drive/MyDrive/my_dataset.csv")

In [31]:
df_train.head()

Unnamed: 0,asin,imUrl,description,categories,title,price,salesRank,related,brand
0,132793040,http://ecx.images-amazon.com/images/I/31JIPhp%...,The Kelby Training DVD Mastering Blend Modes i...,"[['Electronics', 'Computers & Accessories', 'C...",Kelby Training DVD: Mastering Blend Modes in A...,,,,
1,321732944,http://ecx.images-amazon.com/images/I/31uogm6Y...,,"[['Electronics', 'Computers & Accessories', 'C...",Kelby Training DVD: Adobe Photoshop CS5 Crash ...,,,,
2,439886341,http://ecx.images-amazon.com/images/I/51k0qa8f...,Digital Organizer and Messenger,"[['Electronics', 'Computers & Accessories', 'P...",Digital Organizer and Messenger,8.15,{'Electronics': 144944},"{'also_viewed': ['0545016266', 'B009ECM8QY', '...",
3,511189877,http://ecx.images-amazon.com/images/I/41HaAhbv...,The CLIKR-5 UR5U-8780L remote control is desig...,"[['Electronics', 'Accessories & Supplies', 'Au...",CLIKR-5 Time Warner Cable Remote Control UR5U-...,23.36,,"{'also_viewed': ['B001KC08A4', 'B00KUL8O0W', '...",
4,528881469,http://ecx.images-amazon.com/images/I/51FnRkJq...,"Like its award-winning predecessor, the Intell...","[['Electronics', 'GPS & Navigation', 'Vehicle ...",Rand McNally 528881469 7-inch Intelliroute TND...,299.99,,"{'also_viewed': ['B006ZOI9OY', 'B00C7FKT2A', '...",


<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

In [32]:
# Run this cell for a knowledge check question %% [markdown] --- ## Model
# training with AutoGluon You can use AutoGluon to train a model by using a
# single line of code. You need to provide the dataset and tell AutoGluon which
# column from the dataset you are trying to predict. %% [markdown]

---
## Model training with AutoGluon

You can use AutoGluon to train a model by using a single line of code. You need to provide the dataset and tell AutoGluon which column from the dataset you are trying to predict.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">To prepare the datasets, run the following cell.<br/>
        This step is not required for AutoGluon to work, but it will reduce the time to train your first model.<br/>
The code randomly selects 1,000 rows from the dataset and splits them into training and validation datasets.</p>
    <br>
</div>

In [33]:
# Sampling 1,000
# Try setting the subsample_size to a much larger value to see what happens during training
subsample_size = 1000  # Sample a subset of the data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Print the first few rows
df_train_smaller.head()

Unnamed: 0,asin,imUrl,description,categories,title,price,salesRank,related,brand
471644,B00GANGND4,http://ecx.images-amazon.com/images/I/31tGSJhJ...,- Material: durable ABS.- Standard USB power p...,"[['Electronics', 'Computers & Accessories', 'C...",Pixnor 2013 USB Powered Colorful LED Fountain ...,28.99,,"{'also_bought': ['B00FJ0S1SO', 'B00GGRS6HU', '...",
419871,B00BQG99AY,http://ecx.images-amazon.com/images/I/514fZJ4h...,Q POWE 0 Gauge 3000 WATT Car Amplifier Install...,"[['Electronics', 'Car & Vehicle Electronics', ...",Q Power 1/0 Gauge Ga 3000W Car Amplifier Wirin...,41.99,,"{'also_viewed': ['B001F5YS1G', 'B00BQHDNMI', '...",
358015,B008LA8JS6,http://ecx.images-amazon.com/images/I/41OdU%2B...,Description:Feel opponents sneaking behind you...,"[['Electronics', 'Computers & Accessories', 'C...",Asus Xonar DSX 7.1 Channel PCIE Gaming Audio C...,68.75,,"{'also_viewed': ['B00198DM2K', 'B002VAD716', '...",Asus
348950,B008AF1AEM,http://ecx.images-amazon.com/images/I/51p6H%2B...,**ATTENTION CUSTOMERS**: If you are seeing thi...,"[['Electronics', 'Computers & Accessories', 'C...",2GB DDR3 Memory RAM for Gateway 10.1&quot; Net...,25.49,,"{'also_viewed': ['B00AWGZAEI', 'B008O4RWGO', '...",
440695,B00D8BN47O,http://ecx.images-amazon.com/images/I/51rs0ZM7...,Give your Lenovo IdeaPad Yoga 13 a stylish loo...,"[['Electronics', 'Computers & Accessories', 'L...",MightySkins Protective Skin Decal Cover for Le...,19.99,{'Electronics': 343174},"{'buy_after_viewing': ['B00DV6SMH8', 'B00AOHXT...",


In [34]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=df_train_smaller)

https://docs.google.com/spreadsheets/d/1-u9mN0Gwl0Ba3hwkdG69vUEYbH1EQiKs5WwBLCj_tkw/edit#gid=0


In [42]:
# Sampling 100
# Try setting the subsample_size to a much larger value to see what happens during training
subsample_size = 100  # Sample a subset of the data for faster demo
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Print the first few rows
df_train_smaller.head()

Unnamed: 0,asin,imUrl,description,categories,title,price,salesRank,related,brand
471644,B00GANGND4,http://ecx.images-amazon.com/images/I/31tGSJhJ...,- Material: durable ABS.- Standard USB power p...,"[['Electronics', 'Computers & Accessories', 'C...",Pixnor 2013 USB Powered Colorful LED Fountain ...,28.99,,"{'also_bought': ['B00FJ0S1SO', 'B00GGRS6HU', '...",
419871,B00BQG99AY,http://ecx.images-amazon.com/images/I/514fZJ4h...,Q POWE 0 Gauge 3000 WATT Car Amplifier Install...,"[['Electronics', 'Car & Vehicle Electronics', ...",Q Power 1/0 Gauge Ga 3000W Car Amplifier Wirin...,41.99,,"{'also_viewed': ['B001F5YS1G', 'B00BQHDNMI', '...",
358015,B008LA8JS6,http://ecx.images-amazon.com/images/I/41OdU%2B...,Description:Feel opponents sneaking behind you...,"[['Electronics', 'Computers & Accessories', 'C...",Asus Xonar DSX 7.1 Channel PCIE Gaming Audio C...,68.75,,"{'also_viewed': ['B00198DM2K', 'B002VAD716', '...",Asus
348950,B008AF1AEM,http://ecx.images-amazon.com/images/I/51p6H%2B...,**ATTENTION CUSTOMERS**: If you are seeing thi...,"[['Electronics', 'Computers & Accessories', 'C...",2GB DDR3 Memory RAM for Gateway 10.1&quot; Net...,25.49,,"{'also_viewed': ['B00AWGZAEI', 'B008O4RWGO', '...",
440695,B00D8BN47O,http://ecx.images-amazon.com/images/I/51rs0ZM7...,Give your Lenovo IdeaPad Yoga 13 a stylish loo...,"[['Electronics', 'Computers & Accessories', 'L...",MightySkins Protective Skin Decal Cover for Le...,19.99,{'Electronics': 343174},"{'buy_after_viewing': ['B00DV6SMH8', 'B00AOHXT...",


<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>It's time to check your knowledge!</i></h3>
    <br>
    <p style=" text-align: center; margin: auto;">To load the question, run the following cell.</p>
    <br>
</div>

### Training a model with a small sample

AutoGluon uses certain defaults. For example, AutoGluon uses `root_mean_squared_error` as an evaluation metric for regression problems. For more information, see [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) in the sklearn documentation.

__Note:__ Training on this smaller dataset will take approximately 3–4 minutes.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Use `TabularPredictor` to train the first version of the model along with the smaller 1000 sample training dataset so the model trains faster.<br>
</p>
    <br>
</div>

In [46]:
# Remove rows with NaN, Inf, or -Inf in the 'price' column
df_train_smaller = df_train_smaller.replace([np.inf, -np.inf], np.nan)  # Convert Inf to NaN
df_train_smaller = df_train_smaller.dropna(subset=["price"])  # Drop rows with NaN in price



In [47]:
# Now fit the model
predictor = TabularPredictor(label="price", eval_metric="mean_squared_error")
predictor.fit(train_data=df_train_smaller)

No path specified. Models will be saved in: "AutogluonModels/ag-20250323_044811"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       8.08 GB / 12.67 GB (63.7%)
Disk Space Avail:   59.69 GB / 107.72 GB (55.4%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accu

[1000]	valid_set's l2: 11335.9
[2000]	valid_set's l2: 11335.2
[3000]	valid_set's l2: 11335.2
[4000]	valid_set's l2: 11335.2
[5000]	valid_set's l2: 11335.2


	-11335.2375	 = Validation score   (-mean_squared_error)
	3.68s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Ensemble Weights: {'NeuralNetFastAI': 0.944, 'RandomForestMSE': 0.056}
	-6100.7658	 = Validation score   (-mean_squared_error)
	0.09s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 39.06s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 160.7 rows/s (18 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/content/AutogluonModels/ag-20250323_044811")


<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7d0e1ebd3390>

In [48]:
print(df_train.shape)

(498196, 9)


In [49]:
predictor.fit_summary()

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model     score_val         eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2  -6100.765762  mean_squared_error       0.112033   4.691279                0.000772           0.093289            2       True         11
1       NeuralNetFastAI  -6113.540517  mean_squared_error       0.036590   2.689377                0.036590           2.689377            1       True          7
2         ExtraTreesMSE  -7656.974754  mean_squared_error       0.069230   0.715285                0.069230           0.715285            1       True          6
3       RandomForestMSE  -9865.088206  mean_squared_error       0.074671   1.908613                0.074671           1.908613            1       True          5
4              LightGBM -10523.086527  mean_squared_error       0.005533   1.740019                0.005533           1.740019  

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestMSE': 'RFModel',
  'ExtraTreesMSE': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': np.float64(-15065.880052708517),
  'KNeighborsDist': np.float64(-15111.842152344587),
  'LightGBMXT': np.float64(-11536.416487311557),
  'LightGBM': np.float64(-10523.086527301188),
  'RandomForestMSE': np.float64(-9865.088206334662),
  'ExtraTreesMSE': np.float64(-7656.974754178085),
  'NeuralNetFastAI': np.float64(-6113.54051658192),
  'XGBoost': np.float64(-12481.271323756484),
  'NeuralNetTorch': np.float64(-12030.269097180804),
  'LightGBMLarge': np.float64(-11335.237534948186),
  'WeightedEnsemble_L2': np.float64(-6100.765762315019)},
 'model_bes

In [50]:
train_data, val_data = predictor.load_data_internal()  # Get training and validation sets

# Compute the ratio
train_size = len(train_data)
val_size = len(val_data)
ratio = train_size / (train_size + val_size)

print(f"Training Size: {train_size}, Validation Size: {val_size}")
print(f"Training to Total Data Ratio: {ratio:.2f}")

Training Size: 68, Validation Size: 68
Training to Total Data Ratio: 0.50


1. What is the shape of the training dataset?
2. What type of ML problem (such as classification or regression) does AutoGluon infer? (**Hint:** Remember, you didn't mention the problem type. You only provided the label column.)
3. What does AutoGluon suggest in case it inferred the wrong problem type?
4. What kind of data preprocessing and feature engineering did AutoGluon perform?
5. What are the basic statistics about the label in the print statements from AutoGluon?
6. How many extra features were generated in addition to the originals in the dataset? What was the runtime for that?
7. Which evaluation metric was used?
8. What does AutoGluon suggest in case it inferred the wrong metric?
9. What is the ratio between the training and validation dataset? (**Hint:** Look for `val` or `validation`.)
10. Where did AutoGluon save the predictor?
11. Which folder were the models saved in?
12. What file format are the models in? (**Note:** Look at the file name suffix. You don't need to open the file.)

Try to answer these questions before you check the solution.

### List your answers here:
1. The shape of the training dataset is print(df_train.shape) = (498196, 9).

2. The AutoGluon infers the type of machine learning problem as Regression due to the target variable being numerical and continuous.

3. AutoGluon recommends manually overriding the problem type using the problem_type argument in TabularPredictor.

4. AutoGluon performed data type inference & conversion to convert numerical columns (data preprocessing) and categorical features using
ordinal encoding and numerical features  (feature engineering).

5. The basic statistics would be predictor = TabularPredictor(label="target_column").fit(train_data)

6. 6 features in original data used to generate 71 features in processed data. Data preprocessing and feature engineering runtime = 1.16s

7. The evaluation metric used is mean_squared_error

8. If AutoGluon infers the wrong evaluation metric, it allows users to manually specify the correct one when calling fit().

9. Training Size: 68, Validation Size: 68
Training to Total Data Ratio: 0.50

10.  To load, use: predictor = TabularPredictor.load("/content/AutogluonModels/ag-20250323_044811")

11. The models were saved in: AutogluonModels/ag-20250323_044811

12. The File Format Models are saved in.pkl

<!-- SOLUTION -->
### Solution

In the following images, the arrows indicate where in the output you can find the answers to the questions. The numbers on the arrows correspond to the numbers of the questions in the previous cell.

<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_01.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_02.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_03.png"  width="900" height=auto>
<p style="padding: 10px; border: 1px solid black;">
<img src="./images/lab3_04.png"  width="900" height=auto>

<!-- END SOLUTION -->

----
## Conclusion

The purpose of this notebook was to explore a dataset of information about books and to use AutoGluon to build a basic model to predict book prices based on book features.

## Next lab
In the next lab, you will learn how to use AutoGluon features to refine your model.