<p style="padding: 10px; border: 1px solid black;">
<img src="./images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# MLU Day One Machine Learning - Hands On

### <font color='orange'>Please make sure to run the below cell! It will allow you to print solutions for the code challenges.</font> 

In [None]:
# Load coding libraries
from sklearn.model_selection import train_test_split
import pandas as pd

# Import utility functions that provide answers to challenges
%load_ext autoreload
%aimport dayone_utils

## Objective
This hands-on notebook is meant to let you practice the concepts you have learned in this course so far.
Here we explore a big database of books (books of different genres, from thousands of authors).<br/>

We want to predict book prices using book features, such as genre, release data, ratings, number of reviews. 
This is a regression problem: we have a book price column in our dataset that we can use as labels.

## Part I - Leaderboard Submission
In the first part of the notebook you are going to learn how __AutoGluon__ can solve the book price prediction problem.<br/>

You will learn how to build a simple and quick base model and then implement iterations of this model to make it better. To measure how well you are doing (and to see how the model improves) you have to submit your model's predictions to the [__MLU Leaderboard__](https://leaderboard.corp.amazon.com/tasks/718). Leaderboard will assess your performance against other participants and it also counts towards your course completion. 

We ask you to make 2 submissions in this section:<br/>
1. First a simple prediction trained with a smaller dataset, in order to have your first submisison fast.
2. Then another prediction trained with a full dataset, in order to submit an improved result.

## Part II - Advanced AutoGluon
In the second part of the notebook you will find some advanced features of AutoGluon to explore feature importance and explainability. You're welcome to use the insights you can gain from this section to make an additional 3rd submission. However, a quick word of warning - AutoGluon is very powerful in its base form so you might not see much additional model improvement.


After the hands-on work, we will walk through the solution as well.

___
## How does this Hands-On notebook work?


In this course, we are not trying to measure your coding skills, so you will find solutions throughout the notebook: 
All the challenges have answers that you can copy and paste into the challlenge coding area.

**No matter how experienced and skilled you are with coding, you will be able to submit a solution!**

Throughout the notebook, you will be presented with two kinds of exercises: __Knowledge Tasks__ and __Coding Challenges__. <br/>


|Knowledge Tasks      | Coding Challenges |
|:---    |   ---  |
| No coding needed for theses tasks. <br /> Try to understand what is happening and run the cells & code associated to this. | These are challenges where you can practice your coding skills. <br /> Once done, uncomment the challenge asnwer and check your solution. <br />  __NOTE:__ Try hard to code your solution before looking at the answer. <br /> __Learn and Be Curious, right?__|
| <img style="float: center;" src="./images/task_robot.png" alt="drawing" width="100"/>|<img style="float: center;" src="./images/challenge_robot.png" alt="drawing" width="130"/>| 



___

# Part I - Leaderboard Submission
Let's solve the book price prediction problem using __AutoGluon__.

## 1. <a name="5">AutoGluon Installation</a>

We need to begin by installing AutoGluon (documentation [here](https://auto.gluon.ai/stable/install.html)).  


__NOTE__: This may take a few minutes to install (you can see that it has finished once the `[*]` symbol next to the cell disappears and turns into a number).

In [None]:
!python3 -m pip install -qU pip
!python3 -m pip install -qU setuptools wheel
!python3 -m pip install -qU "mxnet<2.0.0"
!python3 -m pip install -qU autogluon

Now we load the libraries needed to work with our Tabular dataset.

In [None]:
# Importing the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

## 2. <a name="5">Business Problem Summary</a>

Let's output an overview of our book price predicting __business problem__. <br/>

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/>  
>Run the function below for a description of the business problem and the data dictionary that will be used to solve it.

In [None]:
dayone_utils.answer_html("BP1")

___
## 3. <a name="5">Getting the Data</a>

Let's get the data for our business problem.

>  <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100" /> 
>  Run the cell below to load and take a look at the first samples of our train dataset. <br/>
Compare it with your data dictionary to see if everything is there and if the data makes sense. This is a very basic check when performing __Data Exploration__.

In [None]:
df_train = TabularDataset(data="./datasets/training.csv")
df_train.head()

## 4. <a name="5">Training our model</a>

We can train a model using AutoGluon with only a single line of code.  All we need to do is to tell it which column from the dataset we are trying to predict, and what the dataset is.

For this first training, we are going to randomly sample 1000 samples of our train dataset in order to have a faster training.

### Why are we splitting our data into train and validation below?
The reason we split our original data into train and validation datasets is related to __overfitting__. Spliting a dataset to validate the performance is a useful way to identify if your model is overfitting.



> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/>  Run the cell below to prepare the datasets (AutoGluon is doing all the magic for us). <br/>
Here we are randomly selecting 1000 rows of our dataset and splitting it into train and validation datasets.
> 

<br/>

__NOTE__: The `random_state` parameter below alows to have repeatability when running the code multiple times.

In [None]:
# Run the code below
# Sampling 1000
subsample_size = 1000  # subsample subset of data for faster demo, try setting this to much larger values
df_train_smaller = df_train.sample(n=subsample_size, random_state=0)

# Splitting in train and validation datasets
train_data, val_data = train_test_split(
    df_train, test_size=0.1, shuffle=True, random_state=23
)
train_data_smaller, val_data_smaller = train_test_split(
    df_train_smaller, test_size=0.1, shuffle=True, random_state=23
)

# Printing the first rows
train_data_smaller.head()

## Training our model with a small sample

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
For this first training we are going to use the smaller dataset with 1000 samples of our original train dataset in order to have a faster training.

In [None]:
# Run the code below
smaller_predictor = TabularPredictor(label="Price").fit(train_data=df_train_smaller)

## Interpreting the Training Output
AutoGluon outputs a lot of information about what is happening.

<img style="float: left;" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
<br/><br/>
<br/>
<br/>

> After the prediction above finishes, examine the output and try to find the information below in the print out messages from AutoGluon. <br/>
1. What is the shape of your training dataset?
2. What kind of ML problem type does AutoGluon infer (classification, regression, ...)? Remember, you've never mentioned what kind of problem type it is; you only provided the label column.
3. What does AutoGluon suggest in case it inferred the wrong problem type?
4. Identify the kind of data preprocessing and feature engineering performed by AutoGluon.
5. Find the basic statistics about your label in the print statements from AutoGluon.
6. How many extra features were generated besides the originals in our dataset? What was the runtime for that?
7. What is the evaluation metric used?
8. What does AutoGluon suggests to do if it inferred the wrong metric?
9. How much of the training data was used for validation when splitted?
10. Identify the folder where the models are saved.
11. Identify where AutoGluon saved your prediction.
12. Enter a specific model folder and take a quick look to see the file format.

__Please, try hard to identify all information above before uncommenting the answer below. <br/>
Day One is about Learn and Be Curious, right?__

################# LIST YOUR ANSWERS HERE #################
1. <br/>
2. <br/>
3. <br/>
4. <br/>
5. <br/>
6. <br/>
7. <br/>
8. <br/>
9. <br/>
10. <br/>
11. <br/>
12. <br/>

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FIT_INFO")

## 5. <a name="5">AutoGluon Leaderboard</a>
Now let's take a look at all the information AutoGluon provides via its __leaderboard function__. <br/> 

__NOTE__: Don't confuse this with the MLU Leaderboard. The MLU Leaderboard is where you will make submissions with the predictions from your trained models; the AutoGluon leaderboard function is a summary of all models that AutoGluon trained.

<br/>

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Run the cell below and take a closer look at AutoGluon's leaderboard output. <br/>
__Which one is the best model?__

<br/>

__NOTE__: As AutoGluon only maximizes metrics, you will see a negative RMSE value, for prioritization purposes only.


In [None]:
# Run the code below
smaller_predictor.leaderboard(silent=True)

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_BEST")

## 6. <a name="5">Making a Prediction</a>
### Now that your model is trained, let's use it to predict Prices

We are now reading the test dataset that was not used to train our model. It is a good practice to assess if your model is __overfitting__. 
#### Why are we using a different dataset that was not used so far during the training step?
We should always run a final model performance assessment using data that was unseen by the model.

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
Run the cell below to load the test dataset that we will use for the MLU leaderboard. 

In [None]:
# Run the code below
df_test_leaderboard = TabularDataset("./datasets/mlu-leaderboard-test.csv")

# We show the first row there.
df_test_leaderboard.head(1)

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
Use this new dataset as input to the model you have just trained to predict Book Prices on it <br/>
__TIP:__ look at the AutoGluon Tasks documentation and look for function __predict__ to see how to implement it [here](https://auto.gluon.ai/api/autogluon.task.html#autogluon.tabular.TabularPredictor.predict).

__Please, try hard to identify all information above before uncomment the answer below. You know, it is about Learn and Be Curious, right?__

In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_PRED")

## 7. <a name="5">Your First MLU Leaderboard Submission</a>
### Now you are ready for your first submission to our MLU Leaderboard!

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Run the cell below to save your prediction file in the format expected by the MLU Leaderboard.

In [None]:
# define pandas columns
df_submission = pd.DataFrame(columns=["ID", "Price"])
# Creating ID column from ID list
df_submission["ID"] = df_test_leaderboard["ID"].tolist()
# Creating label column from price prediction list
df_submission["Price"] = price_prediction
# saving your csv file for Leaderboard submission
df_submission.to_csv(
    "./datasets/predictions/Prediction_to_Leaderboard.csv", index=False
)

#### Let's do a quick check to see if the file is ok related to the IDs expected
> <img style="float: left; padding-right: 30px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> 1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
> 2. If the difference is zero you are good to go!

In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_submission["ID"]).sum(),
)

### Downloading the Prediction File and Submitting
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100" align="left"/> 
> 1. Download the file you just saved to your local machine. <br/>
> 2. Follow the instructions on the Leaderboard submission page [here](https://leaderboard.corp.amazon.com/tasks/718/submit) to submit your file.

<br>
You can find your submission file in the folder <code>datasets > predictions</code>.

## 8. <a name="5">Your Second MLU Leaderboard Submission with the Full Train Dataset</a>

<img style="float: left;" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
<br/><br/>
<br/>
<br/>

> Now that you made your first submission using the small sample from your dataset, repeat the process using the full dataset and submit again to see if your score gets better.<br>
If you don't know how to write the code for this, uncomment the challenge answer; copy and paste it in the section below.

__NOTE__: It should take around 12-15 minutes to run this training with our CPU. Just in case, use the `time_limit` parameter (in seconds) to limit the run time to 20 minutes.



In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ### CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FULL_PRED")

### Second MLU Leaderboard Submission with the Full Train Dataset

><img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
1. Run the AutoGluon leaderboard function for this and the smaller dataset into the first cell below.
2. Run the leaderboard function again for the full dataset into the second cell below.
3. Compare the performances.

__How can you explain the differences in `score_val` and `fit_time` columns?__
 


In [None]:
############## FIRST CODE HERE ####################


############## END OF CODE ####################

In [None]:
############## SECOND CODE HERE ###############


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FULL_LEAD")

### Get the second submission for MLU Leaderboard ready</a>

><img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Write the code that creates the output file using the predictions from your second model.


In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FULL_SUBM")

#### Let's do a quick check to see if the file is ok related to the IDs expected
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
2. If the difference is zero you are good to go

In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_full_submission["ID"]).sum(),
)

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Submit again to MLU leaderboard to improve your score. For the submission use the link as before [here](https://leaderboard.corp.amazon.com/tasks/718/submit).<br>

___
# Part II - Advanced AutoGluon
## 9. <a name="5">AutoGluon Advanced Features</a>

Now that you have made your first Leaderboard submission, let's practice using some advanced features of AutoGluon. <br/>

## 9.1. <a name="4">Explainability</a>

There are growing business needs and legislative regulations that require explanations of why a model made a certain decision.<br/>
To better understand our trained predictor, we can estimate the overall importance of each feature.

## 9.1.1 <a name="5">Feature Importance</a> 
A feature’s importance score represents the performance drop that results when the model makes predictions on a perturbed copy of the dataset where this feature’s values have been randomly shuffled across rows. A feature score of 0.01 would indicate that the predictive performance dropped by 0.01 when the feature was randomly shuffled. The higher the score a feature has, the more important it is to the model’s performance. If a feature has a negative score, this means that the feature is likely harmful to the final model, and a model trained without that feature  would be expected to achieve a better predictive performance.



> <img style="float: left;padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100" align="left"/> 
> Run the code below to see the output of the AutoGluon feature importance function for the first model we have run, with only 1000 samples. <br/>

In [None]:
# Run the code below
smaller_predictor.feature_importance(val_data_smaller)

## 9.1.2. <a name="5">An Experiment on Tuning the Data</a>

With AutoGluon you don't have to worry about which model to chose; indeed you can focus on the data itself. 
In the book price case, there are a few columns which are clearly very poorly encoded, most importantly the ```Edition``` column. <br/>
For this experiment, let's use our small dataset __df_train_smaller__ to make everything run a bit faster.

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Use the functions below to clean things up a bit and expand that data out.<br/>
For this experiment, our feature engineering taks will be:<br/><br/>
>1. Splitting the Column ```Edition``` into three new ones: ```hard_paper```, ```year``` and ```month```
>2. Creating two numerical features based on the features ```Reviews``` and ```Ratings```, named ```Reviews-n``` and ```Ratings-n``` respectively.
>3. Drop the old columns from the dataset: ```Edition```,  ```Reviews``` and ```Ratings```. 

__Please, try hard to solve the challenge before uncommenting for the answer below.__ <br/>


__Day One is about Learn and Be Curious, right?__

In [None]:
# Run the code below
import re
import pandas as pd


def first_num(in_val):
    num_string = in_val.split(" ")[0]
    digits = re.sub(r"[^0-9\.]", "", num_string)
    return float(digits)


def year_get(in_val):
    m = re.compile(r"\d{4}").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return int(m[0])
    else:
        return None


def month_get(in_val):
    m = re.compile(r"Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec").findall(in_val)
    # print(in_val, m)
    if len(m) > 0:
        return m[0]
    else:
        return "None"


def drop_features(in_feat):
    train_data_feateng.drop(in_feat, axis=1, inplace=True)
    val_data_feateng.drop(in_feat, axis=1, inplace=True)
    return

In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FEAT_ENG")

><img style="float: left;padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
>Now print the dataset with the new features to see how they look like

In [None]:
# Run the code below
train_data_feateng.head(2)

## A bit of Data Preprocessing: Identifying Missing values
By doing the feature engineering above we introduced a new challenge. 
We might now have some missing data.

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Try to identify the features that may have missing values and how many are missing. <br/>
__Are there any missing values?__

__Please, try hard to solve the challenge before uncommenting for the answer below.__ <br/>


__Day One is about Learn and Be Curious, right?__

In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_MISSING")

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Let's train the model again with these new and clean features to compare results.



In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_PRED_FEAT")

> <img style="float: left; padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Try to identify the features that may have missing values and how many are missing. <br/>
__Are there any significant differences?__


In [None]:
############## FIRST CODE FROM THE ANSWER HERE ####################


############## END OF CODE ########################################

In [None]:
############## SECOND CODE FROM THE ANSWER HERE ####################


############## END OF CODE #########################################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_LEAD_COMP")

> <img style="float: left; padding-right: 30px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
1. Run the AutoGluon `feature_importance` function for original smaller dataset into the first cell below.
2. Run the feature_importance function again for the feature engineered dataset into the second cell below.
3. Compare the results.

__Are there any significant differences?__


In [None]:
############## CODE FOR THE ORIGINAL DATASET FEATURE IMPORTANCE HERE ####################


############## END OF CODE ############################################################

In [None]:
############## CODE FOR THE FEATURE ENGINEERED DATASET FEATURE IMPORTANCE HERE  ####################


############## END OF CODE #########################################################################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FEAT_COMP")

## 11. <a name="5">Further Enhancement</a>
So far we have worked with AutoGluon's default settings; however there are settings that let you tune things further.  When you have text, the best default line to run is the following.  Letting this run (for 14 hours instead of < 20 minutes) will produce a model that comes in second on the global leaderboard, all with only about an hour of human labor.

> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> Now it is time to train your model using using AutoGluon __enhanced version__


<br/>
For this experiment we will use a time limit of 20 min (`time_limit` in seconds below).

__NOTE__: 20 minutes may not be enough to have a better score than your previous submission. If you have time, try running for more than 20 minutes to improve your performance!

In [None]:
# As we're working on CPU based instances, we need to tell AutoGluon to train without GPU
import os
os.environ['AUTOGLUON_TEXT_TRAIN_WITHOUT_GPU']='1'

In [None]:
enhanced_predictor = TabularPredictor(label="Price").fit(
    train_data=df_train, time_limit=20 * 60, hyperparameters="multimodal"
)

### Time to make Your Final Submission to the MLU Leaderboard</a>

> <img style="float: left;padding-right: 20px" src="./images/challenge_robot.png" alt="drawing" width="130"/> 
> Now make a final prediction and submit this to MLU leaderboard.<br>

In [None]:
############## CODE HERE ####################


############## END OF CODE ####################

In [None]:
# ## CHALLENGE ANSWER
# dayone_utils.answer_html("CH_FINAL_SUBM")

#### Let's do a quick check to see if the file is ok related to the IDs expected
><img style="float: left; padding-right: 30px" src="./images/task_robot.png" alt="drawing" width="100"/> 
> 1. Run the cell below to check if your submission file has the right IDs for the MLU Leaderboard.
2. If the difference is zero you are good to go!

In [None]:
# Run the code below
print("Double-check submission file against the original test file")
sample_submission_df = pd.read_csv("./datasets/mlu-leaderboard-test.csv", sep=",")
print(
    "Differences between project result IDs and sample submission IDs:",
    (sample_submission_df["ID"] != df_enhanced_submission["ID"]).sum(),
)

<p style="padding: 10px; border: 1px solid black;">
<img src="./images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
    
## Congrats for Finishing this Hands On!!
In the next module, __Code Walkthrough and Advanced AutoGluon__ we are going do a walkthrough over your solutions and also show a notebook that implements an __end-to-end__ solution, deploying your model for use in production.

## 12. <a name="5">Before You Go</a>
> <img style="float: left; padding-right: 20px" src="./images/task_robot.png" alt="drawing" width="100"/> 
>After you are done with this Hands On, you can clean all model artifacts uncommenting and executing the cell below.<br/>

__It's always a good practice to clean up everything when you are done.__

In [None]:
# !rm -r AutogluonModels