<a href="https://colab.research.google.com/github/arangoml/arangopipe/blob/add_pytorch_example/Arangopipe_with_Pytorch_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Arangopipe with Pytorch

This notebook serves to demonstrate the details of using Arangopipe with Pytorch to store meta-data from machine learning pipelines. This notebook uses the message classification data in the *torchtext* package. This example is available in pytorch. The notebook builds a classifier on the *AG_NEWS* dataset. The task is to predict the label of a message - Science/Technology, World News, Sports etc. The documentation narrative is supplemented to include the details of using *Arangopipe* to capture the metadata from this notebook.

Note: Training will take the order of a few minutes

## Installation Prerequisites

In [0]:
!pip install torch<=1.2.0
!pip install torchtext==0.4
!pip install python-arango
!pip install arangopipe==0.0.6.8.9
!pip install pandas PyYAML==5.1.1
%matplotlib inline

/bin/bash: =1.2.0: No such file or directory
Collecting torchtext==0.4
[?25l  Downloading https://files.pythonhosted.org/packages/43/94/929d6bd236a4fb5c435982a7eb9730b78dcd8659acf328fd2ef9de85f483/torchtext-0.4.0-py3-none-any.whl (53kB)
[K     |████████████████████████████████| 61kB 1.9MB/s 
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.4.0
Collecting python-arango
[?25l  Downloading https://files.pythonhosted.org/packages/f9/c3/1f8445ffc2505997da53ad5ce5e9f40875d561623d6bc84c259879e5cbcc/python-arango-5.2.1.tar.gz (78kB)
[K     |████████████████████████████████| 81kB 2.3MB/s 
Building wheels for collected packages: python-arango
  Building wheel for python-arango (setup.py) ... [?25l[?25hdone
  Created wheel for python-arango: filename=python_arango-5.2.1-py2.py3-none-any.whl size=86479 sha256=e83e5698baf2eb8d3f8f468db27891


Text Classification Tutorial
============================

This tutorial shows how to use the text classification datasets,
including

::

   - AG_NEWS,
   - SogouNews, 
   - DBpedia, 
   - YelpReviewPolarity,
   - YelpReviewFull, 
   - YahooAnswers, 
   - AmazonReviewPolarity,
   - AmazonReviewFull

This example shows the application of ``TextClassification`` Dataset for
supervised learning analysis.

Load data with ngrams
---------------------

A bag of ngrams feature is applied to capture some partial information
about the local word order. In practice, bi-gram or tri-gram are applied
to provide more benefits as word groups than only one word. An example:

::

   "load data with ngrams"
   Bi-grams results: "load data", "data with", "with ngrams"
   Tri-grams results: "load data with", "data with ngrams"

``TextClassification`` Dataset supports the ngrams method. By setting
ngrams to 2, the example text in the dataset will be a list of single
words plus bi-grams string.




In [0]:
import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
	os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ag_news_csv.tar.gz: 11.8MB [00:00, 83.5MB/s]
120000lines [00:08, 13870.11lines/s]
120000lines [00:17, 6916.69lines/s]
7600lines [00:01, 7230.39lines/s]


Define the model
----------------

The model is composed of the
`EmbeddingBag <https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag>`__
layer and the linear layer (see the figure below). ``nn.EmbeddingBag``
computes the mean value of a “bag” of embeddings. The text entries here
have different lengths. ``nn.EmbeddingBag`` requires no padding here
since the text lengths are saved in offsets.

Additionally, since ``nn.EmbeddingBag`` accumulates the average across
the embeddings on the fly, ``nn.EmbeddingBag`` can enhance the
performance and memory efficiency to process a sequence of tensors.

![](https://github.com/pytorch/tutorials/blob/gh-pages/_static/img/text_sentiment_ngrams_model.png?raw=1)





In [0]:
import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

Initiate an instance
--------------------

The AG_NEWS dataset has four labels and therefore the number of classes
is four.

::

   1 : World
   2 : Sports
   3 : Business
   4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word
and ngrams). The number of classes is equal to the number of labels,
which is four in AG_NEWS case.




In [0]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

Functions used to generate batch
--------------------------------




Since the text entries have different lengths, a custom function
generate_batch() is used to generate data batches and offsets. The
function is passed to ``collate_fn`` in ``torch.utils.data.DataLoader``.
The input to ``collate_fn`` is a list of tensors with the size of
batch_size, and the ``collate_fn`` function packs them into a
mini-batch. Pay attention here and make sure that ``collate_fn`` is
declared as a top level def. This ensures that the function is available
in each worker.

The text entries in the original data batch input are packed into a list
and concatenated as a single tensor as the input of ``nn.EmbeddingBag``.
The offsets is a tensor of delimiters to represent the beginning index
of the individual sequence in the text tensor. Label is a tensor saving
the labels of individual text entries.




In [0]:
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)
    
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

Define functions to train the model and evaluate results.
---------------------------------------------------------




`torch.utils.data.DataLoader <https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader>`__
is recommended for PyTorch users, and it makes data loading in parallel
easily (a tutorial is
`here <https://pytorch.org/tutorials/beginner/data_loading_tutorial.html>`__).
We use ``DataLoader`` here to load AG_NEWS datasets and send it to the
model for training/validation.




In [0]:
from torch.utils.data import DataLoader

def train_func(sub_train_):

    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()

    # Adjust the learning rate
    scheduler.step()
    
    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

Split the dataset and run the model
-----------------------------------

Since the original AG_NEWS has no valid dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
`torch.utils.data.dataset.random_split <https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split>`__
function in PyTorch core library.

`CrossEntropyLoss <https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>`__
criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class.
It is useful when training a classification problem with C classes.
`SGD <https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html>`__
implements stochastic gradient descent method as optimizer. The initial
learning rate is set to 4.0.
`StepLR <https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR>`__
is used here to adjust the learning rate through epochs.




In [0]:
import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 25 seconds
	Loss: 0.0261(train)	|	Acc: 84.7%(train)
	Loss: 0.0001(valid)	|	Acc: 90.4%(valid)
Epoch: 2  | time in 0 minutes, 25 seconds
	Loss: 0.0119(train)	|	Acc: 93.7%(train)
	Loss: 0.0001(valid)	|	Acc: 91.1%(valid)
Epoch: 3  | time in 0 minutes, 25 seconds
	Loss: 0.0069(train)	|	Acc: 96.4%(train)
	Loss: 0.0000(valid)	|	Acc: 89.2%(valid)
Epoch: 4  | time in 0 minutes, 25 seconds
	Loss: 0.0039(train)	|	Acc: 98.1%(train)
	Loss: 0.0000(valid)	|	Acc: 91.2%(valid)
Epoch: 5  | time in 0 minutes, 25 seconds
	Loss: 0.0023(train)	|	Acc: 99.0%(train)
	Loss: 0.0000(valid)	|	Acc: 90.2%(valid)


Evaluate the model with test dataset
------------------------------------




In [0]:
print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0003(test)	|	Acc: 87.6%(test)


Test on a random news
---------------------

Use the best model so far and test a golf news. The label information is
available
`here <https://pytorch.org/text/datasets.html?highlight=ag_news#torchtext.datasets.AG_NEWS>`__.




In [0]:
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news


# Using Arangopipe to Capture Metadata

In [0]:

from arangopipe.arangopipe_storage.arangopipe_api import ArangoPipe
from arangopipe.arangopipe_storage.arangopipe_admin_api import ArangoPipeAdmin
from arangopipe.arangopipe_storage.arangopipe_config import ArangoPipeConfig
from arangopipe.arangopipe_storage.managed_service_conn_parameters import ManagedServiceConnParam
mdb_config = ArangoPipeConfig()
msc = ManagedServiceConnParam()
conn_params = { msc.DB_SERVICE_HOST : "arangoml.arangodb.cloud", \
                        msc.DB_SERVICE_END_POINT : "createDB",\
                        msc.DB_SERVICE_NAME : "createDB",\
                        msc.DB_SERVICE_PORT : 8529,
                        msc.DB_CONN_PROTOCOL : 'https'}
        
mdb_config = mdb_config.create_connection_config(conn_params)

In [0]:
admin = ArangoPipeAdmin(reuse_connection = False, config = mdb_config)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)
# Error indicating "heart beat check was not found" is expected.

DEBUG:arangopipe_admin_logger:Connection reuse: False
INFO:arangopipe_admin_logger:Requesting a managed service database...


API endpoint: https://arangoml.arangodb.cloud:8529/_db/_system/createDB/createDB


INFO:arangopipe_admin_logger:Managed service database was created !


Host Connection: https://arangoml.arangodb.cloud:8529


2020-02-20 11:11:23,236 - arangopipe_logger - ERROR - The dataset by name: heart beat check was not found in Arangopipe!
ERROR:arangopipe_logger:The dataset by name: heart beat check was not found in Arangopipe!


In [0]:
proj_info = {"name": "Text_Classification_Using_Pytorch"}
proj_reg = admin.register_project(proj_info)

ds_info = {"name" : "text classification dataset in torchtext",\
           "description": "Classification task pertaining to classifiying the category of a message" }
ds_reg = ap.register_dataset(ds_info)

featureset = {'name': 'model_embeddings',\
              'description': 'see model state for embeddings'}
fs_reg = ap.register_featureset(featureset, ds_reg["_key"])

model_info = {"name": "Neural Network",\
              "type": "Neural network with a linear layer and an embedding layer using the Cross-Entropy loss with SGD optimizer  "}
model_reg = ap.register_model(model_info, project = "Text_Classification_Using_Pytorch")

INFO:arangopipe_logger:Recording dataset dataset link {'_id': 'datasets/54082370', '_key': '54082370', '_rev': '_aEHJj5y---'}
INFO:arangopipe_logger:Recording featureset {'_id': 'featuresets/54082371', '_key': '54082371', '_rev': '_aEHJkAy--A'}
INFO:arangopipe_logger:Recording featureset dataset link {'_id': 'featureset_dataset/54082371-54082370', '_key': '54082371-54082370', '_rev': '_aEHJkHa--A'}
INFO:arangopipe_logger:Recording project model link {'_id': 'project_models/54082369-54082372', '_key': '54082369-54082372', '_rev': '_aEHJkau--A'}


In [0]:

import uuid #used as run id

from datetime import datetime

# current date and time
now = datetime.now()
timestamp = datetime.timestamp(now)
ruuid = uuid.uuid4()

model_params = {"run_id": str(ruuid), 'nn_parameters_storage_loc': '< a hdfs url>'}
model_perf = {'training_accuracy': train_acc, 'test_accuracy': test_acc, "run_id": str(ruuid), "timestamp": timestamp}
run_info = {"dataset" : ds_reg["_key"],\
              "featureset": fs_reg["_key"],\
              "run_id": str(ruuid),\
              "model": model_reg["_key"],\
              "model-params": model_params,\
              "model-perf": model_perf,\
              "pipeline" : "Text Classification Pipeline",\
              "tag": "text_classification_model_params_saved",\
              "project": "Text_Classification_Using_Pytorch"}

ap.log_run(run_info)

INFO:arangopipe_logger:Run info {'_key': '96318554-52c1-4396-9d30-5e0bf2ea3708', 'timestamp': 1582197095.216473, 'tag': 'text_classification_model_params_saved'}
INFO:arangopipe_logger:Recording run {'_id': 'run/96318554-52c1-4396-9d30-5e0bf2ea3708', '_key': '96318554-52c1-4396-9d30-5e0bf2ea3708', '_rev': '_aEHJs3m--_'}
INFO:arangopipe_logger:Recording model run link {'_id': 'run_models/96318554-52c1-4396-9d30-5e0bf2ea3708-54082372', '_key': '96318554-52c1-4396-9d30-5e0bf2ea3708-54082372', '_rev': '_aEHJt-G--A'}
INFO:arangopipe_logger:Recording model params {'_id': 'modelparams/96318554-52c1-4396-9d30-5e0bf2ea3708', '_key': '96318554-52c1-4396-9d30-5e0bf2ea3708', '_rev': '_aEHJtEm--A'}
INFO:arangopipe_logger:Recording run featureset link {'_id': 'run_featuresets/96318554-52c1-4396-9d30-5e0bf2ea3708-54082371', '_key': '96318554-52c1-4396-9d30-5e0bf2ea3708-54082371', '_rev': '_aEHJtLK---'}
INFO:arangopipe_logger:Recording run model params {'_id': 'run_modelparams/96318554-52c1-4396-9d30-

## Review the Metadata Captured

The above example illustrated the details of storing metadata from machine learning model development into Arangopipe. This information can be subsequently retrieved as shown below. Since, we are using the same notebook, we can simply request a connection to the Arangopipe database used above by specifying the reconnection parameter to be true. If you need to reconnect to the database created earlier in another notebook, you can do so by exporting the connection information to a file. You can then import this connection configuration in your new notebook. See the notebooks *Arangopipe_Feature_Examples* and *Reuse_Old_Arangopipe_Connection* for the details.


In [0]:
admin = ArangoPipeAdmin(reuse_connection = True)
ap_config = admin.get_config()
ap = ArangoPipe(config = ap_config)

INFO:arangopipe_admin_logger:If a config is provided, it will be used for setting up the connection
INFO:arangopipe_admin_logger:DB name for connection: MLdnkskf2efzdtskfnq3rs7f
INFO:arangopipe_admin_logger:user name for connection: MLjh7zu8wy07mk1bz026dyn
INFO:arangopipe_admin_logger:A specific password was requested !
DEBUG:arangopipe_admin_logger:Connection reuse: True


Host Connection: https://arangoml.arangodb.cloud:8529


2020-02-20 11:11:38,813 - arangopipe_logger - ERROR - The dataset by name: heart beat check was not found in Arangopipe!
ERROR:arangopipe_logger:The dataset by name: heart beat check was not found in Arangopipe!


Use the connection obtained above to retrieve the model performance of the model created earlier.

In [0]:
ap.lookup_modelperf(tag_value='text_classification_model_params_saved')

{'_id': 'devperf/96318554-52c1-4396-9d30-5e0bf2ea3708',
 '_key': '96318554-52c1-4396-9d30-5e0bf2ea3708',
 '_rev': '_aEHJtYO--_',
 'run_id': '96318554-52c1-4396-9d30-5e0bf2ea3708',
 'test_accuracy': 0.8763157894736842,
 'timestamp': 1582197095.216473,
 'training_accuracy': 0.9900701754385965}