<p style="padding: 10px; border: 1px solid black;">
<img src=".././images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
    
# MLU Day One Machine Learning - DEMO

In [1]:
# Load in libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as img
from sklearn.model_selection import train_test_split

## ML Problem Description
Predict the Unit of measure (count, volume, weight) identification for a product from the Amazon Product Catalog. 
> This is a multiclass classification task (3 distinct classes). <br>

The data for this ML problem has 32 features and 1 label column. Examples of features include:


| Feature | Description |
| :---        |    :----  |
| marketplace_id | Marketplace ID.|
| product_type   | Type of product.  |
| item_name | Short item description. |
| product_description   | Long item description.  |
| bullet_point | Bullet point item description. |
| brand   | Brand name.  |
| manufacturer | Manufacturer name. |
| ...   | ...  |
| list_price_value_with_tax   | Price of item including tax.  |
| imgID | ID for image of product. |
| ID   | Product identifier.  |

## 1. <a name="5">AutoGluon Installation</a>

We need to begin by installing AutoGluon (documentation [here](https://auto.gluon.ai/stable/install.html)).  


__NOTE__: This may take a few minutes to install (you can see that it is finished once the `[*]` symbol next to the cell disappears and turns into a number).

In [2]:
!python3 -m pip install -qU pip
!python3 -m pip install -qU setuptools wheel
!python3 -m pip install -qU "mxnet<2.0.0"
!python3 -m pip install -qU autogluon

Now we load the libraries needed to work with our Tabular dataset.

In [3]:
from autogluon.tabular import TabularPredictor, TabularDataset

___
## 2. <a name="5">Getting the Data</a>
Let's create a small sample to quickly train with AutoGluon and look at the first few rows of data.

In [4]:
train = TabularDataset(".././datasets/uomi-train.csv")
test = TabularDataset(".././datasets/uomi-test.csv")

In [5]:
# Print size of training set
print(f"Size of training set: {len(train)}")

# Show the first rows of training data
train.head(2)

Size of training set: 28305


Unnamed: 0,ID,marketplace_id,label,product_type,item_name,product_description,bullet_point,brand,manufacturer,part_number,...,item_dimensions_height,item_dimensions_width,item_dimensions_length,normalized_item_weight,normalized_item_package_weight,list_price_currency,list_price_value,list_price_value_with_tax,imgID,ID_0
0,1633,1,1,GROCERY,"JELL-O Play Ocean Build + Eat Kit, 6 oz Box",,One 6 oz. JELL-O Play Ocean Build + Eat Kit,Jell-O Play,Jell-o,4300008150.0,...,2.625,6.625,8.5,0.023438,0.500449,USD,3.99,,51sislDjTYL,9cd726a519754b6bad27be39bc95cac6
1,18103,1,2,GROCERY,Crystal Light Pure Variety Pack includes- Rasp...,"With no artificial sweeteners, flavors or pres...",Customer Will Receive 6 Boxes Total - 1 Raspbe...,Crystal Light,Crystal Light,,...,,,,,0.599657,,,,41MsGCednqL,44a997b7ff9f4d2ebd1615ac5f3861ff


In [6]:
# Print size of test set
print(f"Size of test set: {len(test)}")

# Show the first rows of test data
test.head(2)

Size of test set: 12131


Unnamed: 0,ID,marketplace_id,product_type,item_name,product_description,bullet_point,brand,manufacturer,part_number,model_number,...,item_dimensions_height,item_dimensions_width,item_dimensions_length,normalized_item_weight,normalized_item_package_weight,list_price_currency,list_price_value,list_price_value_with_tax,imgID,ID_0
0,17912,1,HEALTH_PERSONAL_CARE,ChicShop US 9.9 Inch Soft Penî's Lifelike Wate...,The products sold are all shipped confidential...,Perfect Gift: Dīldɔ suitable for beginners and...,ChicShop US,ChicShop US,ChicShop US,,...,,,,1.763698,,,,,41I-gf3fB+L,49c30d495bb74a5e9ed988d4dffe9e23
1,36819,1,BEAUTY,Zink Color Multi Purpose Glitter Brilliance Pr...,Z!NK COLORGlitter Brilliance Pro Emerald Green,Size: 5 gram gem shape jar,Zink Color inc.,Zink Color,emerald green glitter,,...,1.0,1.0,1.0,0.011023,0.022046,USD,5.99,,61OqJD7jjnL,dd82ff42aba547f6bdd0946ca5f61746


## 3. <a name="5">Train and Validation split (Optional)</a>

For unbiased evaluation, we want to split the training set into train and validation datasets. Also, it is good practice to create a small sample dataset to quickly run AutoGluon before using the full dataset.

In [7]:
# Take a sample of 1000 datapoints for a quick test
train_sample_small = train.sample(n=1000, random_state=1)

# For the small sample, we create a train & validation split
train_data_small, val_data_small = train_test_split(
    train_sample_small, test_size=0.1, shuffle=True, random_state=23
)

## 4. <a name="5">Training our model</a>

We can train a model using AutoGluon with only a single line of code.  All we need to do is tell it what column from the dataset we are trying to predict, and what the dataset is.

We are going to use only the small sample from our train dataset containing 1000 data points in order to train faster.

In [8]:
# We specify train and validation data for the model training
first_predictor = TabularPredictor(label="label").fit(
    train_data=train_data_small, tuning_data=val_data_small
) 

No path specified. Models will be saved in: "AutogluonModels/ag-20210909_205717/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20210909_205717/"
AutoGluon Version:  0.3.1
Train Data Rows:    900
Train Data Columns: 33
Tuning Data Rows:    100
Tuning Data Columns: 33
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	3 unique label values:  [1, 2, 0]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    62723.1 MB
	Train Data (Original)  Memory Usage: 2.44 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feat

## 5. <a name="5">AutoGluon Results</a>
Now let's take a look at all the training information AutoGluon provides via its __leaderboard function__. <br/>

In [9]:
first_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMLarge,0.77,0.0355,4.728991,0.0355,4.728991,1,True,13
1,WeightedEnsemble_L2,0.77,0.036028,5.128941,0.000527,0.39995,2,True,14
2,LightGBM,0.74,0.034194,1.57794,0.034194,1.57794,1,True,5
3,LightGBMXT,0.74,0.034866,1.851772,0.034866,1.851772,1,True,4
4,CatBoost,0.74,0.067553,6.483831,0.067553,6.483831,1,True,8
5,NeuralNetFastAI,0.74,0.246855,15.204956,0.246855,15.204956,1,True,3
6,RandomForestEntr,0.73,0.116251,1.443754,0.116251,1.443754,1,True,7
7,XGBoost,0.72,0.017889,2.779312,0.017889,2.779312,1,True,11
8,ExtraTreesEntr,0.71,0.115485,1.330225,0.115485,1.330225,1,True,10
9,ExtraTreesGini,0.71,0.115989,1.337786,0.115989,1.337786,1,True,9


## 6. <a name="5">Making a Prediction</a>
Now that our model is trained, let's use it to predict the labels for the test dataset.

In [10]:
prediction = first_predictor.predict(test)
print(f"Predictions for the first 20 data points in the test dataset: {prediction.values[0:20]}")

Predictions for the first 20 data points in the test dataset: [2 2 2 2 2 2 2 1 2 2 1 0 2 2 2 2 2 2 2 2]


## 7. <a name="5">Submission to our MLU Leaderboard</a>
For this example, we will submit to the Units of Measurement Challenge on Leaderboard. The link for this challenge is https://leaderboard.corp.amazon.com/tasks/724.

In [11]:
# Creating a new dataframe for the submission
submission = test[["ID"]].copy(deep=True)

# Creating label column from price prediction list
submission["label"] = prediction

# Saving our csv file for Leaderboard submission
# index=False prevents printing the row IDs as separate values
submission.to_csv(
    ".././datasets/predictions/UOMI_Prediction_to_Leaderboard_First_Predictor.csv",
    index=False,
)  

In [12]:
# Inspect the first rows of our submission file
submission.head()

Unnamed: 0,ID,label
0,17912,2
1,36819,2
2,15782,2
3,6565,2
4,33314,2


Run the cell below to check if your submission file has the right IDs for MLU Leaderbord.

In [13]:
print("Double-check submission file against the original test file...")
check_submission = TabularDataset(".././datasets/uomi-test.csv")
print(
    "Differences between project result IDs and sample submission IDs:",
    (check_submission["ID"] != submission["ID"]).sum(),
)

Double-check submission file against the original test file...


Loaded data from: .././datasets/uomi-test.csv | Columns = 33 / 33 | Rows = 12131 -> 12131


Differences between project result IDs and sample submission IDs: 0


## 8. <a name="5">Re-Train with full data and predict again</a>


In [14]:
import os
os.environ['AUTOGLUON_TEXT_TRAIN_WITHOUT_GPU']='1'

# Retrain the model using all training data 
# We let AutoGluon perform the train/validation split directly
second_predictor = TabularPredictor(label="label").fit(
    train_data=train, time_limit = 60*20
) 

prediction = second_predictor.predict(test)

# Creating a new dataframe for the submission
submission = test[["ID"]].copy(deep=True)

# Creating label column from price prediction list
submission["label"] = prediction

# Saving our csv file for Leaderboard submission
# index=False prevents printing the row IDs as seperate values
submission.to_csv(
    ".././datasets/predictions/UOMI_Prediction_to_Leaderboard_Second_Predictor.csv",
    index=False,
)  

No path specified. Models will be saved in: "AutogluonModels/ag-20210909_205826/"
Beginning AutoGluon training ... Time limit = 1200s
AutoGluon will save models to "AutogluonModels/ag-20210909_205826/"
AutoGluon Version:  0.3.1
Train Data Rows:    28305
Train Data Columns: 33
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	3 unique label values:  [1, 2, 0]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    59847.97 MB
	Train Data (Original)  Memory Usage: 69.35 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manuall

## 9. <a name="5">Before You Go</a>
After you are done with this Hands On, you can clean model artifacts by uncommenting and executing the cell below.

__It is always good practice to clean everything when you are done, preventing the disk from getting full.__

In [15]:
!rm -rf AutogluonModels