# AzureML Studio : Designer

In this notebook, we will use the AzureML Studio's [Designer](https://docs.microsoft.com/en-us/azure/machine-learning/concept-designer) to design our data processing pipeline.

The experiment is visible in the AzureML Studio : [oc-p7-automated-ml](https://ml.azure.com/experiments/id/9bde22d7-75e6-41cc-9bfe-03e15873f292?wsid=/subscriptions/da2e4791-6dd1-422b-848a-a961cef6ab89/resourcegroups/OC_P7/workspaces/oc-p7-ml-workspace&tid=43204f6d-c600-4585-985a-6bafda08d2bb)

![AzureML Designer - Pipeline](img/azureml_designer_pipeline.png)

We will compare this pre-trained local model to the baseline model from [1_baseline.ipynb](1_baseline.html).

## Text preprocessing

Before training our models, the data is prepared as follow :
- data is sampled to 2% of the original data (stratified according to the target variable)
- text is processed :
  - expand verb contractions
  - remove stop words
  - use lemmatization
  - detect sentences by adding a sentence terminator "|||" that can be used by the n-gram features extractor module
  - normalize case to lowercase
  - remove numbers
  - remove non-alphanumeric special characters and replace them with "|" character
  - remove duplicate characters
  - remove email addresses
  - remove URLs
  - normalize backslashes to slashes
  - split tokens on special characters


## Text vectorization

We need to represent the text as a vector of numbers.

### Feature Hashing

In this version, we will use the [Feature Hashing](https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/feature-hashing) module to extract features from the text, with the following parameters :
- Hashing bitsize : 10 => 2^10 = 1024 features
- N-grams : 2 => tokens are couple of words

### N-Gram Features

In this version, we will use the [Extract N-Gram Features from Text](https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/extract-n-gram-features-from-text) module to extract features from the text, with the following parameters :
- Hashing bitsize : 10 => 2^10 = 1024 features
- N-grams : 2 => tokens are couple of words
- Weighting function : TF-IDF Weight => Represents well the relative importance of a term in a specific document, versus the importance of a term in the whole corpus.
- Minimum word length : 25
- Minimum n-gram document absolute frequency : 5 => avoid rare words
- Maximum n-gram document ratio : 1 => do not exclude very frequent tokens
- Normalize n-gram feature vectors : True => normalize the vectors to unit length

This creates a vocabulary that is specific to our training data and that will be used for testing our model.

## Model training

We train a [Two-Class Logistic Regression](https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/two-class-logistic-regression) model with the following parameters :
- Optimization tolerance : 1e-7
- L2 regularization weight : 1


## Results

The test dataset goes through the same text pre-processing and vectorization steps as the training dataset, before being used to test the model.

| Model | Confusion Matrix | AP | Precision Recall Curve | ROC AUC | ROC Curve |
|-------|------------------|----|------------------------|---------|-----------|
| Feature Hashing | ![Confusion Matrix](img/azureml_designer_feature_hashing_confusion_matrix.png) | 0.663 | ![Precision Recall Curve](img/azureml_designer_feature_hashing_precision_recall_curve.png) | 0.726 | ![ROC Curve](img/azureml_designer_feature_hashing_ROC_curve.png) |
| N-Gram Features | ![Confusion Matrix](img/azureml_designer_n-gram_confusion_matrix.png)          | 0.723 | ![Precision Recall Curve](img/azureml_designer_n-gram_precision_recall_curve.png)          | 0.811 | ![ROC Curve](img/azureml_designer_n-gram_ROC_curve.png)          |

We can see that the N-Gram Features model performs better than the Feature Hashing model.

The performances on the dataset are similar to our baseline model : 
- Average Precision = 0.723 (baseline = 0.73 , -1%)
- ROC AUC = 0.811 (baseline = 0.74 , +9.6%)

Unlike our baseline model, this model is quite balanced, just slightly biased towards the _POSITIVE_ class. It is much less biased than our baseline model : it predicted 6.8% (baseline = 35% , -81%) more _POSITIVE_ (3305) messages than _NEGATIVE_ (3095).
