# AzureML Studio : Automated ML

In this notebook, we will use the AzureML Studio's [AutoML](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) to automatically select the best model given time and compute constraints.

The AutoML process is as follows:
- check the data for potential issues (Class balancing detection, Missing feature values imputation, High cardinality feature detection)
- build data pipelines with :
  - a data preprocessor (scaling, standardisation, normalisation)
  - a prediction model (XGBoost, LightGBM, Random Forest, Linear Regression, etc.)
  - a set of hyperparameters (learning rate, number of trees, etc.)
- train each pipeline for a limited time
- evaluate the performance of each pipeline
- select the best pipeline

The experiment is visible in the AzureML Studio : [oc-p7-automated-ml](https://ml.azure.com/experiments/id/9bde22d7-75e6-41cc-9bfe-03e15873f292?wsid=/subscriptions/da2e4791-6dd1-422b-848a-a961cef6ab89/resourcegroups/OC_P7/workspaces/oc-p7-ml-workspace&tid=43204f6d-c600-4585-985a-6bafda08d2bb)

We will compare this pre-trained local model to the baseline model from [1_baseline.ipynb](1_baseline.html).

## AutoML model after 1h training on CPU

In this version, we did not include DNN models in the AutoML process, because they require GPU resources.

This AutoML run is available in the AzureML Studio : [automl_1h-cpu](https://ml.azure.com/experiments/id/9bde22d7-75e6-41cc-9bfe-03e15873f292/runs/AutoML_d6a52352-b3dd-4cf3-82e3-b5e44d038436?wsid=/subscriptions/da2e4791-6dd1-422b-848a-a961cef6ab89/resourcegroups/OC_P7/workspaces/oc-p7-ml-workspace&tid=43204f6d-c600-4585-985a-6bafda08d2bb#models)

Here are the models that were trained in the AutoML process :

![AzureML - AutomatedML - 1h on CPU - models](img/azureml_automated_ml_1h_cpu_models.png)


The best model is a `XGBoostClassifier` (wrapper for [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier)) with [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) .


| Confusion Matrix | Precision Recall Curve (AP = 0.79) | ROC Curve (AUC = 0.80) |
|---|---|---|
| ![Confusion Matrix](img/azureml_automated_ml_1h_cpu_confusion_matrix.png) | ![Precision Recall Curve](img/azureml_automated_ml_1h_cpu_precision_recall.png) | ![ROC Curve](img/azureml_automated_ml_1h_cpu_ROC.png) |


  

The performances on the dataset are quite better than our baseline model : 
- Average Precision = 0.79 (baseline = 0.73 , +8.2%)
- ROC AUC = 0.80 (baseline = 0.74 , +8.1%)

Like our baseline model, this model is biased towards the _POSITIVE_ class. It is much less biased than our baseline model : it predicted 9% (baseline = 35% , -74%) more _POSITIVE_ (78403) messages than _NEGATIVE_ (65597).
