# Train a deep learning model
In this notebook you will train a deep learning model to classify the descriptions of car components as compliant or non-compliant. You will train the model on the Azure Databricks cluster and use MLflow integration with Azure Machine Learning to track and log experiment metrics and artifacts in the Azure Machine Learning workspace.

Each document in the supplied training data set is a short text description of the component as documented by an authorized technician. 
The contents include:
- Manufacture year of the component (e.g. 1985, 2010)
- Condition of the component (poor, fair, good, new)
- Materials used in the component (plastic, carbon fiber, steel, iron)

The compliance regulations dictate:
*Any component manufactured before 1995 or in fair or poor condition or made with plastic or iron is out of compliance.*

For example:
* Manufactured in 1985 made of steel in fair condition -> **Non-compliant**
* Good condition carbon fiber component manufactured in 2010 -> **Compliant**
* Steel component manufactured in 1995 in fair condition -> **Non-Compliant**

The labels present in this data are 0 for compliant, 1 for non-compliant.

The challenge with classifying text data is that deep learning models only undertand vectors (e.g., arrays of numbers) and not text. To encode the car component descriptions as vectors, we use an algorithm from Stanford called [GloVe (Global Vectors for Word Representation)](https://nlp.stanford.edu/projects/glove/). GloVe provides us pre-trained vectors that we can use to convert a string of text into a vector.

The model will be built using a type of DNN called the Long Short-Term Memory (LSTM) recurrent neural network using TensorFlow via the Keras library.

In [2]:
# Install the necessary libraries directly into the notebook context

dbutils.library.installPyPI('tensorflow')
dbutils.library.installPyPI('keras')
dbutils.library.installPyPI('mlflow')
dbutils.library.installPyPI('azureml-mlflow')
dbutils.library.restartPython()
dbutils.library.list()

## Connect to the Azure Machine Learning Workspace

In [4]:
# Import required packages

import azureml
from azureml.core import Run
from azureml.core import Workspace
from azureml.core.experiment import Experiment

import mlflow
import mlflow.keras

print("azureml SDK version:", azureml.core.VERSION)
print("mlflow version:", mlflow.__version__)

### Configure access to the Azure Machine Learning resources
To begin, you will need to provide the following information about your Azure Subscription.

**If you are using your own Azure subscription, please provide names for subscription_id, resource_group, workspace_name and workspace_region to use.** Note that the workspace needs to be of type [Machine Learning Workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace).

In the following cell, be sure to set the values for `subscription_id`, `resource_group`, `workspace_name` and `workspace_region` as directed by the comments (*these values can be acquired from the Azure Portal*).

To get these values, do the following:
1. Navigate to the Azure Portal and login with the credentials provided.
2. From the left hand menu, under Favorites, select `Resource Groups`.
3. In the list, select the resource group with the name similar to `MCW-AI-Lab`.
4. From the Overview tab, capture the desired values.

In addition to these, be sure to set the `experiment_name` with the name of the experiment you used in training the model with Automated Machine Learning.

Execute the following cell by selecting the `>|Run` button in the command bar above.

In [6]:
#Provide the Subscription ID of your existing Azure subscription
subscription_id = "281b526e-0f57-4142-ae7c-b89b634fd26e" # <- subscription you are using for this hands-on lab

#Provide values for the existing Resource Group 
resource_group = "MCW-AI-Lab"

#Provide the Workspace Name and Azure Region of the Azure Machine Learning Workspace
workspace_name = "AML-workspace-181384"
workspace_region = "westus2" # <- region of your resource group

#Provide the name of the Automated ML experiment you executed previously
experiment_name = "Battery-Cycles"

Run the following cells to connect to your **Azure Machine Learning Workspace**

**Important Note**: You will be prompted to login in the text that is output below the cell. Be sure to navigate to the URL displayed and enter the code that is provided. Once you have entered the code, return to this notebook and wait for the output to read `Workspace Provisioning complete`.

In [8]:
# By using the exist_ok param, if the worskpace already exists we get a reference to the existing workspace
ws = Workspace.create(
    name = "AML-workspace-181384",
    subscription_id = "281b526e-0f57-4142-ae7c-b89b634fd26e",
    resource_group = "ODL-ml-181384", 
    location = "westus2",
    exist_ok = True)

print("Workspace Provisioning complete")

# Creating a Deep Learning model from text data
The following cell will guide you through the process of preparing the data and using it to train a model.

In [10]:
# Import required packages

import os
import random

import numpy as np
import pandas as pd

import tensorflow
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Embedding, Flatten, Dense, LSTM

import matplotlib
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow

print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("matplotlib version:", matplotlib.__version__)
print("keras version: {} tensorflow version: {}".format(keras.__version__, tensorflow.__version__))

### Download the GloVe embeddings to your environment.
Run the following cell to download the embeddings to the `data` folder in your environment. Note: this may take a **few minutes** as the GloVe file is about 340 MB.

In [12]:
def download_glove():
    print("Downloading GloVe embeddings...")
    import urllib.request
    glove_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                 'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/'
                 'quickstarts/connected-car-data/glove.6B.100d.txt')
    urllib.request.urlretrieve(glove_url, 'glove.6B.100d.txt')
    print("Download complete.")

download_glove()

Next load the data into a Pandas DataFrame and create the training, validation and test data sets by running the following cell.

In [14]:
# Load the car components labeled data
print("Loading car components data...")
data_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
            'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/'
            'quickstarts/connected-car-data/connected-car_components.csv')
car_components_df = pd.read_csv(data_url)
components = car_components_df["text"].tolist()
labels = car_components_df["label"].tolist()
print("Loading car components data completed.")

# split data 60% for trianing, 20% for validation, 20% for test
print("Splitting data...")
train, validate, test = np.split(car_components_df.sample(frac=1), [int(.6*len(car_components_df)), int(.8*len(car_components_df))])
print(train.shape)
print(test.shape)
print(validate.shape)

In the following cell, you use the Tokenizer from Keras to "learn" a vocabulary from the entire car components text. Then the data (both the text and the compliance labels) is split into three subsets, one that will be used for training the deep learning model, one that will be used during training batches to tune the model weights and one that will be used after the model is trained to evaluate how it performs on data the model has never seen. 

Run the following cell.

In [16]:
# use the Tokenizer from Keras to "learn" a vocabulary from the entire car components text
print("Tokenizing data...")

maxlen = 15                                           
training_samples = 90000                                 
validation_samples = 5000    
max_words = 10000      

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(components)
sequences = tokenizer.texts_to_sequences(components)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])                     
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]

x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

x_test = data[training_samples + validation_samples:]
y_test = labels[training_samples + validation_samples:]
print("Tokenizing data complete.")

Now take a look at how the text was encoded as an array in the above. Run the following cell to take a peek.

Each text vector will be of fixed length 100 since we defined maxlen to be 100 above. The following text: "manufactured in 1971 made of carbon fiber in good condition" has 10 words, and each word is represented by an integer value as encoded by the keras.preprocessing.text.Tokenizer. For example, the word "manufactured" is represented by the integer "3". Finally, the text vector is prepadded with zeros to fix the vector length to be 100.

In [18]:
print("The text '{text}' is represented as the vector '{data}'".format(text=components[indices[0]], data=x_train[0]))

Next, you will apply the vectors provided by GloVe to create a word embedding matrix. This matrix will be used shortly to set the model wights of the first layer of the deep neural network. 

Run the following cell.

In [20]:
# apply the vectors provided by GloVe to create a word embedding matrix
print("Applying GloVe vectors...")
glove_dir =  './'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector    
print("Applying GloVe vectors completed.")

## Build the LSTM recurrent neural network

In the next cell, you will use Keras to define the structure of the deep neural network. In this case, we will build a LSTM recurrent neural network. The network will have a word embedding layer that will convert the word indices to GloVe word vectors. The GloVe word vectors are then passed to the LSTM layer, followed by an output layer.

Run the following cell to structure the network and view a summary description of it.

In [22]:
# use Keras to define the structure of the deep neural network   
print("Creating model structure...")

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=False)

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

## Use MLflow with Azure Machine Learning for Model Training

In the subsequent cells you will learn to do the following:
- Set up MLflow tracking URI so as to use Azure ML
- Create MLflow experiment – this will create a corresponding experiment in Azure ML Workspace
- Train a model on Azure Databricks cluster while logging metrics and artifacts using MLflow

After this notebook, you should return to the `HOL step-by step - Machine Learning` guide and follow instructions to review the model performance metrics and training artifacts in the Azure Machine Learning workspace.

### Set MLflow tracking URI

Set the MLflow tracking URI to point to your Azure ML Workspace. The subsequent logging calls from MLflow APIs will go to Azure ML services and will be tracked under your Workspace.

In [25]:
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

### Create Experiment

In [27]:
experiment_name = "component-classifier"
mlflow.set_experiment(experiment_name)

### Train Model and Log Metrics and Artifacts

Now you are ready to train the model. Run the cell below to do the following:
-	Log model training metrics
-	Train model
-	Save model
-	Log model training curves
-	Evaluate model
-	Log evaluation metrics

Note that the metrics and artifacts will be recorded with your Azure ML Workspace.

The cell will take **a few minutes** to run on a CPU cluster.

In [29]:
print("Training model...")
model_save_path = "model"
os.makedirs('./outputs/model', exist_ok=True)

with mlflow.start_run() as run:
  lr = 0.001
  mlflow.log_metric('lr', lr)
  opt = keras.optimizers.Adam(lr=lr)
  model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

  history = model.fit(x_train, y_train,
                      epochs=5, 
                      batch_size=16,
                      validation_data=(x_val, y_val))
  print("Training model completed.")
  
  mlflow.keras.log_model(model, model_save_path)

  print("Saving model files...")
  model.save('./outputs/model/model.h5')
  print("model saved in ./outputs/model folder")
  with open(os.path.join('./outputs/model', 'history.txt'), 'w') as f:
    f.write(str(history.history))
  print("history saved in ./outputs/model folder")
  print("Saving model files completed.")
  
  mlflow.log_artifact(os.path.join('./outputs/model', 'history.txt'))
  
  acc = history.history['accuracy']
  val_acc = history.history['val_accuracy']
  loss = history.history['loss']
  val_loss = history.history['val_loss']

  epochs = range(1, len(acc) + 1)

  fig, axes = plt.subplots(ncols=2, figsize=(12.6, 4.8))

  axes[0].plot(epochs, acc, 'bo', label='Training acc')
  axes[0].plot(epochs, val_acc, 'b', label='Validation acc')
  axes[0].set_title('Training and validation accuracy')
  axes[0].legend()

  axes[1].plot(epochs, loss, 'bo', label='Training loss')
  axes[1].plot(epochs, val_loss, 'b', label='Validation loss')
  axes[1].set_title('Training and validation loss')
  axes[1].legend()
  
  training_results_graph = os.path.join('./outputs', 'training_results.png')
  fig.savefig(training_results_graph)
  mlflow.log_artifact(training_results_graph)
  
  print('Evaluating model performance...')
  evaluation_metrics = model.evaluate(x_test, y_test)
  print(evaluation_metrics)
  mlflow.log_metric('eval_loss', evaluation_metrics[0])
  mlflow.log_metric('eval_accuracy', evaluation_metrics[1])

### View the Experiment in Azure Machine Learning Workspace

Run the cell below to list the experiment in Azure Machine Learning Workspace that you just completed.

In [31]:
list(ws.experiments[experiment_name].get_runs())[0]

Experiment,Id,Type,Status,Details Page,Docs Page
component-classifier,0231a3be-52bd-4b7d-999e-88d0bd10ffec,,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Review Model Performance Metrics and Training Artifacts in Azure Machine Learning Workspace

Return to the `HOL step-by step - Machine Learning` guide and follow instructions to review the model performance metrics and training artifacts in the Azure Machine Learning workspace.