# Initial RoBERTa NLP Explorations

Author: Cognitech LLC

### Environment Setup

__USE_TPU:-__ True, If you want to use TPU runtime. First change Colab Notebook runtype to TPU

## Pre-trained models

Model | Description | # params | Download
---|---|---|---
`roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
`roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](wsc/README.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)

### Imports and hardware and software verification

In [1]:
!pip install splinter

Collecting splinter
  Downloading https://files.pythonhosted.org/packages/4e/33/6d6e3abeb0fa47284f2d3b74de7f72018a3af0267b0752b9e7c66175c99e/splinter-0.11.0.tar.gz
Collecting selenium>=3.141.0 (from splinter)
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 5.2MB/s 
Building wheels for collected packages: splinter
  Building wheel for splinter (setup.py) ... [?25l[?25hdone
  Created wheel for splinter: filename=splinter-0.11.0-cp36-none-any.whl size=30144 sha256=fa74ebf66e992f7527efd545c50d51bad073326ea7e71dd40a067a44198d836d
  Stored in directory: /root/.cache/pip/wheels/9c/e1/35/4809427c48cb88853ddc1e701e9c5629bf192ae9324f2d0042
Successfully built splinter
Installing collected packages: selenium, splinter
Successfully installed selenium-3.141.0 splinter-0.11.0


In [2]:
!pip install bs4



In [3]:
!pip install regex

Collecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/ff/60/d9782c56ceefa76033a00e1f84cd8c586c75e6e7fea2cd45ee8b46a386c5/regex-2019.08.19-cp36-cp36m-manylinux1_x86_64.whl (643kB)
[K     |████████████████████████████████| 645kB 3.0MB/s 
[?25hInstalling collected packages: regex
Successfully installed regex-2019.8.19


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

assert float('.'.join(tf.__version__.split('.')[:2])) >= 1.14, 'Make sure that Tensorflow version is at least 1.14'

# Web Scraping
from splinter import Browser
from bs4 import BeautifulSoup

In [0]:
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']

### Load RoBERTa from torch.hub (PyTorch >= 1.1):

In [5]:
import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()  # disable dropout (or leave in train mode to finetune)

Downloading: "https://github.com/pytorch/fairseq/archive/master.zip" to /root/.cache/torch/hub/master.zip
100%|██████████| 655283069/655283069 [00:17<00:00, 37119719.06B/s]


loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz from cache at /root/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2
extracting archive file /root/.cache/torch/pytorch_fairseq/83e3a689e28e5e4696ecb0bbb05a77355444a5c8a3437e0f736d8a564e80035e.c687083d14776c1979f3f71654febb42f2bb3d9a94ff7ebdfe1ac6748dba89d2 to temp dir /tmp/tmp5elu94d8
| dictionary: 50264 types


1042301B [00:00, 2565164.20B/s]
456318B [00:00, 2000449.87B/s]


RobertaHubInterface(
  (model): RobertaModel(
    (decoder): RobertaEncoder(
      (sentence_encoder): TransformerSentenceEncoder(
        (embed_tokens): Embedding(50265, 1024, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 1024, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): Linear(in_features=1024, out_features=1024, bias=Tru

### Use RoBERTa for sentence-pair classification tasks:

In [6]:
# Download RoBERTa already finetuned for MNLI (Multi Natural Language Interface)
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
roberta.eval()  # disable dropout for evaluation

Using cache found in /root/.cache/torch/hub/pytorch_fairseq_master
100%|██████████| 751652118/751652118 [00:19<00:00, 38175956.89B/s]


loading archive file http://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz from cache at /root/.cache/torch/pytorch_fairseq/7685ba8546f9a5ce1a00c7a6d7d44f7e748d22681172f0f391c3d48f487c801c.74e37d47306b3cc51c5f8d335022a392c29f1906c8cd9e9cd3446d7422cf55d8
extracting archive file /root/.cache/torch/pytorch_fairseq/7685ba8546f9a5ce1a00c7a6d7d44f7e748d22681172f0f391c3d48f487c801c.74e37d47306b3cc51c5f8d335022a392c29f1906c8cd9e9cd3446d7422cf55d8 to temp dir /tmp/tmpg6v9qara
| dictionary: 50264 types


RobertaHubInterface(
  (model): RobertaModel(
    (decoder): RobertaEncoder(
      (sentence_encoder): TransformerSentenceEncoder(
        (embed_tokens): Embedding(50265, 1024, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 1024, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): Linear(in_features=1024, out_features=1024, bias=Tru

### Legend for Sentence Prediction
__0__ :- contradiction means false

__1__ :- neutral

__2__ :- entailment means true or confirmed

In [7]:
# Encode a pair of sentences and make a prediction
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()  # 0: contradiction

tensor(0)

In [8]:
# Encode another pair of sentences
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
roberta.predict('mnli', tokens).argmax()  # 2: entailment

tensor(2)

In [9]:
# encode a custom sentence made up by Finian
tokens_fin = roberta.encode('Mathematicians seek and use patterns to formulate new conjectures.', 'Patterns are always used by mathematicians.')
roberta.predict('mnli', tokens_fin).argmax()  # 1: neutral

tensor(1)

In [10]:
# encode a custom sentence made up by Finian
tokens_giants = roberta.encode('The San Francisco Giants play baseball at Oracle Park in downtown San Francisco.', 'The Giants play in New York.')
roberta.predict('mnli', tokens_giants).argmax()  # 0: contradiction

tensor(0)

### Scrape Sample Article to Feed into RoBERTa

In [14]:
# determine the path to the Google Chrome driver and store it in a variable
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_driver_path = !which chromedriver

Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [564 B]
Get:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release [564 B]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [819 B]
Get:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release.gpg [833 B]
Hit:9 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:11 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Get

In [0]:

# set chrome driver options to stop from crashing
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [0]:
# Set the executable path and initialize the chrome browser in splinter
executable_path = {'executable_path': chrome_driver_path[0]}
chrome_browser = Browser('chrome', **executable_path, options=chrome_options)

In [0]:
# Assign URL variables to visit Physics Wikipedia
physics_url = 'https://en.wikipedia.org/wiki/Physics'
# have Chrome navigate to that URL
chrome_browser.visit(physics_url)

In [0]:
# Convert the browser html from the Physics Wikipedia website from the link above to a Beautiful Soup object
physics_html = chrome_browser.html
physics_beautiful_soup = BeautifulSoup(physics_html, 'html.parser')

In [31]:
# display all paragraph tags
counter = 1
for entry in physics_beautiful_soup.find_all('p'):
  print("{0}: {1}".format(counter,entry))
  counter += 1

1: <p class="mw-empty-elt">
</p>
2: <p><b>Physics</b> (from <a class="mw-redirect" href="/wiki/Ancient_Greek_language" title="Ancient Greek language">Ancient Greek</a>: <span lang="grc">φυσική (ἐπιστήμη)</span>, <small><a class="mw-redirect" href="/wiki/Romanization_of_Ancient_Greek" title="Romanization of Ancient Greek">romanized</a>: </small><i lang="grc-Latn" title="Ancient Greek-language romanization">physikḗ (epistḗmē)</i>, <small><a href="/wiki/Literal_translation" title="Literal translation">lit.</a> </small>'knowledge of nature', from <span lang="grc" title="Ancient Greek language text">φύσις</span> <i>phýsis</i> 'nature')<sup class="reference" id="cite_ref-etymonline-physics_1-0"><a href="#cite_note-etymonline-physics-1">[1]</a></sup><sup class="reference" id="cite_ref-etymonline-physic_2-0"><a href="#cite_note-etymonline-physic-2">[2]</a></sup><sup class="reference" id="cite_ref-LSJ_3-0"><a href="#cite_note-LSJ-3">[3]</a></sup> is the <a href="/wiki/Natural_science" title="Na

In [41]:
# paragraph 11 seems like worthwile text to test out on RoBERTa, so store as variable
test_text = physics_beautiful_soup.find_all('p')[10].text
print(test_text)

But this is completely erroneous, and our view may be corroborated by actual observation more effectively than by any sort of verbal argument. For if you let fall from the same height two weights of which one is many times as heavy as the other, you will see that the ratio of the times required for the motion does not depend on the ratio of the weights, but that the difference in time is a very small one. And so, if the difference in the weights is not considerable, that is, of one is, let us say, double the other, there will be no difference, or else an imperceptible difference, in time, though the difference in weight is by no means negligible, with one body weighing twice as much as the other[18]


In [43]:
# strip and format text
test_text_clean = test_text.split("[18]")[0] + "."
print(test_text_clean)

But this is completely erroneous, and our view may be corroborated by actual observation more effectively than by any sort of verbal argument. For if you let fall from the same height two weights of which one is many times as heavy as the other, you will see that the ratio of the times required for the motion does not depend on the ratio of the weights, but that the difference in time is a very small one. And so, if the difference in the weights is not considerable, that is, of one is, let us say, double the other, there will be no difference, or else an imperceptible difference, in time, though the difference in weight is by no means negligible, with one body weighing twice as much as the other.


In [47]:
# encode this and test on RoBERTa
tokens_physics = roberta.encode(test_text_clean, 'The weight of the object determines how fast the object falls.')
roberta.predict('mnli', tokens_physics).argmax()  # 0: contradiction

tensor(0)