Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Classifying Commecial Blocks with Microsoft Azure
_**Comparing Automated Machine Learning with three standard Scikit Learn Models.**_

---
---

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
    1. [Accessing the Azure Workspace](#Accessing-the-Azure-Workspace)
    1. [Configuring the Azure Workspace](#Configuring-the-Azure-Workspace)
    1. [Importing Libraries](#Importing-Libraries)
    1. [Creating an Experiment](#Creating-an-Experiment)
1. [Data](#Data)
1. [Classifying with Scikit Learn](#Classifying-with-SciKit-Learn)
1. [Classifying with Automated Machine Learning](#Classifying-with-Automated-Machine-Learning)
1. [Results](#Results)
1. [Finding the Best Classification Model](#Finding-the-Best-Classification-Model)

---

## Introduction

**The Task:** Classify commercial blocks from TV news segments (+1 commercial, -1 Non-commercial).

This classification model runs a dataset of broadcast data to classify whether a specific segment is a commercial on television. The dataset was taken from the UCI Machine Learning Repository with over 120,000 instances and 4125 features. For more information about the features in this dataset, check out [the UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/tv+news+channel+commercial+detection+dataset). 

**The Method:** This experiment will compare training a classification model with traditional machine learning models (KNN, Random Forest, Neural Networks) to training it with Microsoft Azure Automated Machine Learning. 

<br>
<img src="https://sportsradiopd.com/wp-content/uploads/2015/11/commercial-e1447279275378.jpg" style="width:500px;">

---

## Setup

This experiment uses [Microsoft Azure Machine Learning Service](https://docs.microsoft.com/en-us/azure/machine-learning/service/) to implement a machine learning classification model. For more examples of Microsoft Machine Learning Service, check out the [Azure Machine Learning Notebooks GitHub Repo](https://github.com/Azure/MachineLearningNotebooks).

### Accessing the Azure Workspace
To configure a Microsoft Azure workspace, you must [set up a  Azure subscription](https://azure.microsoft.com/en-us/free/) and manage the subscription from the [Azure portal](https://portal.azure.com/). Once your workspace is configured in the Azure portal, your machine learning service workspace should look like the following screenshot. 

<img src="img/configuration.png">

Then, download the `config.json` file in a directory two levels above anything that is being pushed to GitHub. **Never push your config file to GitHub**. Your Azure notebook will find the config file even if it is not in the present directory.

### Configuring the Azure Workspace
Before configuring your workspace, be sure to have the following installed:
```
$ conda install -y matplotlib tqdm scikit-learn
$ pip install azureml-sdk[notebooks,automl]
```

In [1]:
# Now you're ready to load your workspace...
from azureml.core import Workspace

ws = Workspace.from_config()

### Importing Libraries

In [2]:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT AZURE LIBRARIES
# Azure Notebook Libraries
import azureml.core
import logging
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

### Creating an Experiment
This command will create a new experiment on Azure's Machine Learning Services Workspace. Experiments track important metrics of each model run, including training time (in seconds) and the ROC curve.

In [3]:
# Choose a name for the experiment and specify the project folder.
experiment_name = 'commercial_block_classification'
project_folder = './commercial_block_classification'

experiment = Experiment(ws, experiment_name)

### Viewing your Experiment
Once you create an experiment, you can view the experiment and all of the metrics on the Azure [Machine Learning Services workspace](https://portal.azure.com/).


<img src="img/experiments.png">

---

## Data
Data is from the UCI Machine Learning Repository with over 120,000 instances and 4125 features to [classify commercial blocks](http://archive.ics.uci.edu/ml/datasets/tv+news+channel+commercial+detection+dataset).

Data Citation:
Dr. Prithwijit Guha , Raghvendra D. Kannao and Ravishankar Soni 
Multimedia Analytics Lab, 
Department of Electrical and Electronics Engineering, 
Indian Institute of Technology, Guwahati, India 
rdkannao '@' gmail.com , prithwijit.guha '@' gmail.com

### Import Data from Local Directory
To import this data, make sure to download the datasets in `data/` on your local machine and point to your filepath in `get_data()`.

In [4]:
# Data Upload Functions
from sklearn.datasets import load_svmlight_file

def get_data(filepath):
    data = load_svmlight_file(filepath)
    return data[0], data[1]

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IMPORT DATA
print("\nImporting Data...")

X_train, y_train = get_data("data/train_data.txt")
X_test, y_test = get_data("data/test_data.txt")

X_train = X_train.toarray() # convert sparce matrix to array
X_test = X_test.toarray() 
print("Data imported.")

print("\nTraining data has %i rows" % len(X_train))
print("Testing data has %i rows" % len(X_test))



Importing Data...
Data imported.

Training data has 13500 rows
Testing data has 5750 rows


---

## Classifying with SciKit Learn

### Managing Dependencies
This is a local run of the classifcation model, so you must ensure all the necessary packages are available in the Python environment you run in the training script.

In [5]:
from azureml.core.runconfig import RunConfiguration

# Editing a run configuration property on-fly.
run_config_user_managed = RunConfiguration()

run_config_user_managed.environment.python.user_managed_dependencies = True

### Train with `train.py` Script and Log Metrics
With Azure, you can train on almost any local `train.py` script! The training script in this scenario implements SciKit learn models with the standard SciKit Learn libraries. This script was taken from a homework assignment that did not originally use Microsoft Azure. Azure Machine Learning Services allows users to run standalone `train.py` scripts on an experiment without any alterations. However, logging variables were added to the original script to track model accuracy and other important metics. To add logging, the following lines were added to the original `train.py` script:

**Added to the library imports:**
```
from azureml.core.run import Run
```
**Added after the model was trained** 

This code initializes the logger in your experiment context
```
run_logger = Run.get_context()
```

**Logged multiple variables to track!** 
```
run_logger.log(name='Model', value=model_name)
run_logger.log(name='Accuracy', value=accuracy)
run_logger.log(name='Training_Time', value=train_t)
```


For more documentation about logging variables, [navigate here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-track-experiments#viewing-charts-in-run-details).

### Read Local Training Script
This local training script (with added logging capabilities) iterates through three Scikit Learn models, `RandomForestClassifier`, `MLPClassifier`, `KNeighborsClassifier`, with default parameters and logs the test accuracy of each model.

In [6]:
with open('train.py', 'r') as f:
    print(f.read())

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Manual train.py for Classification of Commericial Blocks
# ~ By: Katie House
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# IMPORT LIBRARIES
from sklearn.datasets import load_svmlight_file
import numpy as np
import time
from sklearn.metrics import f1_score
import csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# IMPORT AZURE LIBRARY
from azureml.core.run import Run


# DEFINE FUNCTIONS
def get_data(filepath):
    data = load_svmlight_file(filepath)
    return data[0], data[1]


def Classifier_Test_Train(model, model_name):
    # Training Model...
    start_time = time.time()  # track train time
    model.fit(X_train, y_train)
    train_t = round(time.time() - start_time, 2)

    # Testing Model...
    predictions = model.predict(X_test)
    accuracy = round(f1_score(y_test, predictions), 3)

    # AZURE LOGGING VA

### Submit a Local Run of the Training Script
Adding the `script='train.py'` argument to `ScriptRunConfig()` will run your local training script on the experiment on the commercial_block_classification experiment.

In [7]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='./', script='train.py', run_config=run_config_user_managed)
run = experiment.submit(src)

### View your Evaluation Metrics in Azure
Azure automatically visualizes your logged metrics for your convenience. This expermiment logs `Training_Time` and `Accuracy` for the default Random Forest, Neural Networks, K Nearest Neighbor models in SciKit Learn. 

Use run by itself to access a link to the run in the Azure Portal. **Click `Link to Azure Portal` below** to view your visualizations.

In [None]:
run

<img src="img/azure-link.png">

In the "Experiments" tab you can view your logging metrics as a visualization!

<img src="img/logging.png">

To get your evaluation metrics as a dictionary, use the following commamd:

In [9]:
run_metrics = run.get_metrics()
run_metrics

{'Model': ['Random Forest', 'Neural Networks', 'K Nearest Neighbor'],
 'Accuracy': [0.879, 0.822, 0.777],
 'Training_Time': [1.2, 6.71, 0.2]}

Using well-known models in Scikit Learn, we only achieve up to ~88% accuracy on the test set with Random Forest classification. Let's store the best accuracy and see if Automated Machine Learning can beat our best model.

In [10]:
best_manual_accuracy = float(max(run_metrics['Accuracy'])) * 100 
print("The best accuracy we acheived with Scikit Learn:  %.2f%%" % best_manual_accuracy)

The best accuracy we acheived with Scikit Learn:  87.90%


---

## Classifying with Automated Machine Learning
Microsoft Azure's [Automated Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train) functionality trains models for you to help find the best model for your machine learning problem. The following experiment will compare the metrics I achieved with the simple `train.py` script with the automated machine learning metrics.

### Configure Automated ML for Classification
Automated ML offers many different configurations to match your machine learning task. [This article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#select-your-experiment-type) describes all the possible configurations you can choose from. I decided to iterate through 10 different models with 3 cross validations to attempt to beat my initial 88% accuracy result.

In [12]:
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'AUC_weighted',
                             iteration_timeout_minutes = 60,
                             iterations = 10,
                             n_cross_validations = 3,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             path = project_folder)

### Run the AML Experiment Locally

In [13]:
local_run = experiment.submit(automl_config, show_output = True)

Running on local machine
Parent Run ID: AutoML_8f520f62-3996-4f07-b9ec-363dc3fa34aa
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      0:00:19       0.9373    0.9373
         1   StandardScalerWrapper SGD                      0:00:18       0.9340    0.9373
         2   MinMaxScaler LightG

---

## Results
Use local_run by itself to access a link to the run in the Azure Portal. The experiment will contain visualizations of each model performance in the AutoML iterations.

In [None]:
local_run

<img src="img/azure-link-3.png">

### View your Machine Learning Performance in Azure
AzureML widgets automatically generates 20 interactive visualizations of each of the model iterations. View this visualizations by running the following code:

In [15]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

<img src="img/auc.png">

### Select the Best Auto ML Model
Now the best classification model from the Automated Machine Learning iterations was selected to compare with the manual `train.py` script.

In [17]:
best_run, fitted_model = local_run.get_output()
print(fitted_model)

Pipeline(memory=None,
     steps=[('stackensembleclassifier', StackEnsembleClassifier(base_learners=[('2', PipelineWithYTransformations(Pipeline={'memory': None, 'steps': [('MinMaxScaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('LightGBMClassifier', <automl.client.core.common.model_wrappers.LightGBMClassifier object...olver='warn',
          tol=0.0001, verbose=0, warm_start=False),
            training_cv_folds=5))])
Y_transformer(['LabelEncoder', LabelEncoder()])


---

## Finding the Best Classification Model
Let's compare the testing accuracy of the manual training script versus the automated machine learning script to find the best classification model.

In [18]:
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

y_pred = fitted_model.predict(X_test)
aml_accuracy = f1_score(y_test, y_pred) * 100

print('Train.py Accuracy: %.2f%%' % best_manual_accuracy)
print('Automated Machine Learning Accuracy: %.2f%%' % aml_accuracy)

Train.py Accuracy: 87.90%
Automated Machine Learning Accuracy: 88.13%


By using Automated Machine Learning, we increased our testing accuracy by ~3%!


<img src="https://media.giphy.com/media/YJ5OlVLZ2QNl6/giphy.gif">