Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Classifying Commecial Blocks with Microsoft Azure
_**Comparing Automated Machine Learning with three standard Scikit Learn Models.**_

---
---

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
    1. [Accessing the Azure Workspace](#Accessing-the-Azure-Workspace)
    1. [Configuring the Azure Workspace](#Configuring-the-Azure-Workspace)
    1. [Importing Libraries](#Importing-Libraries)
    1. [Creating an Experiment](#Creating-an-Experiment)
1. [Data](#Data)
1. [Classifying with Scikit Learn](#Classifying-with-Scikit-Learn)
1. [Classifying with Automated Machine Learning](#Classifying-with-Automated-Machine-Learning)
1. [Results](#Results)
1. [Finding the Best Classification Model](#Finding-the-Best-Classification-Model)

---

## Introduction

**The Task:** Classify commercial blocks from TV news segments (+1 commercial, -1 Non-commercial).

This classification model runs a dataset of broadcast data to classify whether a specific segment is a commercial on television. The dataset was taken from the UCI Machine Learning Repository with over 120,000 instances and 4125 features. For more information about the features in this datset, check out [the UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/tv+news+channel+commercial+detection+dataset). 

**The Method:** This experiment will compare training a classification model with traditional machine learning models (KNN, Random Forest, Neural Networks) to training it with Microsoft Azure Automated Machine Learning. 

<br>
<img src="https://sportsradiopd.com/wp-content/uploads/2015/11/commercial-e1447279275378.jpg" style="width:500px;">

---

## Setup

This experiment uses [Microsoft Azure Machine Learning Service](https://docs.microsoft.com/en-us/azure/machine-learning/service/) to implement a machine learning classification model. For more examples of Microsoft Machine Learning Service, check out the [Azure Machine Learning Notebooks GitHub Repo](https://github.com/Azure/MachineLearningNotebooks).

### Accessing the Azure Workspace
To configure a Microsoft Azure workspace, you must [set up a  Azure subscription](https://azure.microsoft.com/en-us/free/) and manage the subscription from the [Azure portal](https://portal.azure.com/). Once your workspace is configured in the Azure portal, your machine learning service workspace should look like the following:


<img src="Images/Azure-Configuration.png">

### Configuring the Azure Workspace
Before configuring your workspace, be sure to have the following installed:
```
$ conda install -y matplotlib tqdm scikit-learn
$ pip install azureml-sdk[notebooks,automl]
```
Then, add a `config.json` file in a directory two levels above anything that is being pushed to Github. **Never push your Subscription ID to Github**. Your Azure notebook will find the config file even if it is not in the present directory.

`config.json` should look like the following (see above figure for where to find these inputs):
```
{
    "subscription_id": "<my-subscription-id>",
    "resource_group": "<my-resource-group>",
    "workspace_name": "<my-workspace-name>"
}
```

In [1]:
# Now you're ready to load your workspace...
from azureml.core import Workspace

ws = Workspace.from_config()

Found the config file in: C:\Users\house\Documents\GitHub\config.json


### Importing Libraries

In [2]:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT AZURE LIBRARIES
# Azure Notebook Libraries
import azureml.core
import logging
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

### Creating an Experiment
This command will create a new experiment on Azure's Machine Learning Services Workspace. Experiments track important metrics of each model run, including training time (in seconds) and the ROC curve.

In [3]:
# Choose a name for the experiment and specify the project folder.
experiment_name = 'simple_classification'
project_folder = './simple_classification_project'

experiment = Experiment(ws, experiment_name)

<img src="Images/Azure-Experiments.png">

---

## Data
Data is from the UCI Machine Learning Repository with over 120,000 instances and 4125 features to [classify commercial blocks](http://archive.ics.uci.edu/ml/datasets/tv+news+channel+commercial+detection+dataset).

Data Citation:
Dr. Prithwijit Guha , Raghvendra D. Kannao and Ravishankar Soni 
Multimedia Analytics Lab, 
Department of Electrical and Electronics Engineering, 
Indian Institute of Technology, Guwahati, India 
rdkannao '@' gmail.com , prithwijit.guha '@' gmail.com

### Import Data from Local Directory
To import this data, make sure to download the datasets in `Data/` on your local machine and point to your filepath in `get_data()`.

In [4]:
# Data Upload Functions
from sklearn.datasets import load_svmlight_file

def get_data(filepath):
    data = load_svmlight_file(filepath)
    return data[0], data[1]

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT DATA
print("\nImporting Data...")

X_train, y_train = get_data("Data/train_data.txt")
X_test, y_test = get_data("Data/test_data.txt")

X_train = X_train.toarray() # convert sparce matrix to array
X_test = X_test.toarray() 
print("Data imported.")


Importing Data...
Data imported.


---

## Classifying with SciKit Learn

### Managing Dependencies
This is a local run of the classifcation model, so you must ensure all the necessary packages are available in the Python environment you run in the training script.

In [5]:
from azureml.core.runconfig import RunConfiguration

# Editing a run configuration property on-fly.
run_config_user_managed = RunConfiguration()

run_config_user_managed.environment.python.user_managed_dependencies = True

### Read Local Training Script

In [18]:
with open('./train.py', 'r') as f:
    f.read()

### Train with `train.py` Script and Log Metrics
With Azure, you can run a local `train.py` script. To track the accuracy in your local `train.py` script, you can add logger capabilities. With logging, Azure keeps track of any variable changes during training and plots the variables for you with the Azure Portal.

To add logging, add the following lines to your `train.py`code:
```
from azureml.core.run import Run

# Your training script goes here...

# Run this script to initialize logger in your experiment context
run_logger = Run.get_context()

# Log any variable that you would like to track!
run_logger.log(name='Accuracy', value=accuracy) 
run_logger.log(name='Training_Time', value=train_t) 
```
[Navigate here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-track-experiments#viewing-charts-in-run-details) more information about logging variables.

In [19]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='./', script='train.py', run_config=run_config_user_managed)
run = experiment.submit(src)

In [20]:
run

Experiment,Id,Type,Status,Details Page,Docs Page
simple_classification,simple_classification_1555074315_b61d4ca2,azureml.scriptrun,Completed,Link to Azure Portal,Link to Documentation


### View your Evaluation Metrics in Azure
Azure automatically visualizes your logged metrics for your convenience. This expermiment logs `Training_Time` and `Accuracy` for the default Random Forest, Neural Networks, K Nearest Neighbor models in SciKit Learn. 

**Click `Link to Azure Portal` above** to view your visualizations:

<img src="Images/Azure-Logging-1.png">

To get your evaluation metrics as a dictionary, use the following commamd:

In [44]:
run_metrics = run.get_metrics()
run_metrics

{'Model': ['Random Forest', 'Neural Networks', 'K Nearest Neighbor'],
 'Accuracy': [0.884, 0.785, 0.777],
 'Training_Time': [2.26, 5.59, 0.17]}

Using well-known models in Scikit Learn, we only achieve up to ~88% accuracy on the test set with Random Forest classification. Let's store the best accuracy and see if Automated Machine Learning can beat our best model.

In [63]:
best_manual_accuracy = float(max(run_metrics['Accuracy'])) * 100 
print("The best Accuracy we acheived with Scikit Learn:  %.2f%%" % best_manual_accuracy)

The best Accuracy we acheived with Scikit Learn:  88.40%


---

## Classifying with Automated Machine Learning
Microsoft Azure's [Automated Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train) functionality trains models for you to help find the best model for your machine learning problem. The following experiment will compare the metrics I achieved with the simple `train.py` script with the automated machine learning metrics.

### Configure Automated ML for classification
Automated ML offers many different configurations to match your machine learning task. [This article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#select-your-experiment-type) describes all the possible configurations you can choose from. I decided to iterate through 10 different models with 3 cross validations to attempt to beat my initial 88% accuracy result.

In [25]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 60,
                             iterations = 10,
                             n_cross_validations = 3,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             path = project_folder)

### Run the AML Experiment Locally

In [26]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_26005e4b-041d-4401-bfa6-ad42832bddad
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:00:19       0.9495    0.9495
         1   RobustScaler LightGBM                          100.0000    0:00:21       0.9719    0.9719
         2   RobustScaler LogisticRegression                100

In [31]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
simple_classification,AutoML_26005e4b-041d-4401-bfa6-ad42832bddad,automl,Completed,Link to Azure Portal,Link to Documentation


---

## Results

### View your Machine Learning Performance in Azure
To view the `Weighted AUC` of your experiment as a visualization, **Click `Link to Azure Portal` above**.

<img src="Images/Azure-AutoML-1.png">

In [36]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### Select the Best Auto ML Model
Now the best classification model from the Automated Machine Learning iterations was selected to compare with the manual `train.py` script.

In [39]:
best_run, fitted_model = local_run.get_output()

---

## Finding the Best Classification Model
Let's compare the testing accuracy of the manual training script versus the automated machine learning script to find the best classification model.

In [65]:
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

y_pred = fitted_model.predict(X_test)
aml_accuracy = f1_score(y_test, y_pred) * 100

print('Train.py Accuracy: %.2f%%' % best_manual_accuracy)
print('Automated Machine Learning Accuracy: %.2f%%' % aml_accuracy)

Train.py Accuracy: 88.40%
Automated Machine Learning Accuracy: 91.65%


By using Automated Machine Learning, we increased our testing accuracy by 3.25%!


<img src="https://media.giphy.com/media/YJ5OlVLZ2QNl6/giphy.gif">