## Masked Language Model

In [5]:
from transformers import pipeline
import numpy as np
import pandas as pd
import seaborn as sn
import textwrap
import matplotlib.pyplot as plt
from pprint import pprint

from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split

In [2]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)
if device == 'cuda':
    print("current_device: ", torch.cuda.current_device())

Device: cuda
current_device:  0


In [3]:
mlm = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [4]:
mlm("The cat <mask> over the box")

[{'score': 0.10449201613664627,
  'token': 13855,
  'token_str': ' jumps',
  'sequence': 'The cat jumps over the box'},
 {'score': 0.05758359655737877,
  'token': 33265,
  'token_str': ' crawling',
  'sequence': 'The cat crawling over the box'},
 {'score': 0.048404544591903687,
  'token': 33189,
  'token_str': ' leaping',
  'sequence': 'The cat leaping over the box'},
 {'score': 0.047166697680950165,
  'token': 10907,
  'token_str': ' climbing',
  'sequence': 'The cat climbing over the box'},
 {'score': 0.03080787882208824,
  'token': 32564,
  'token_str': ' leaps',
  'sequence': 'The cat leaps over the box'}]

In [6]:
df = pd.read_csv('data/bbc_text_cls.csv')

In [7]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [8]:
labels = set(df['labels'])
labels

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [10]:
label = 'business'

In [11]:
texts = df[df['labels'] == label]['text']
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [13]:
i = np.random.choice(texts.shape[0])
doc = texts.iloc[i]

In [14]:
doc

"Troubled Marsh under SEC scrutiny\n\nThe US stock market regulator is investigating troubled insurance broker Marsh & McLennan's shareholder transactions, the firm has said.\n\nThe Securities and Exchange Commission has asked for information about transactions involving holders of 5% or more of the firm's shares. Marsh has said it is co-operating fully with the SEC investigation. Marsh is also the focus of an inquiry the New York attorney-general into whether insurers rigged the market. Since that inquiry was launched in October, Marsh has replaced its chief executive and held a boardroom shake-out to meet criticism by lessening the number of company executives on the board. Prosecutors allege that Marsh - the world's biggest insurance broker - and other US insurance firms may have fixed bids for corporate cover. This is the issue at the heart of the inquiry by New York's top law officer, Eliot Spitzer, and a separate prosecution of five insurers by the State of California. The SEC's 

In [15]:
print(textwrap.fill(doc, replace_whitespace=False, fix_sentence_endings=True))

Troubled Marsh under SEC scrutiny

The US stock market regulator is
investigating troubled insurance broker Marsh & McLennan's shareholder
transactions, the firm has said.

The Securities and Exchange
Commission has asked for information about transactions involving
holders of 5% or more of the firm's shares.  Marsh has said it is co-
operating fully with the SEC investigation.  Marsh is also the focus
of an inquiry the New York attorney-general into whether insurers
rigged the market.  Since that inquiry was launched in October, Marsh
has replaced its chief executive and held a boardroom shake-out to
meet criticism by lessening the number of company executives on the
board.  Prosecutors allege that Marsh - the world's biggest insurance
broker - and other US insurance firms may have fixed bids for
corporate cover.  This is the issue at the heart of the inquiry by New
York's top law officer, Eliot Spitzer, and a separate prosecution of
five insurers by the State of California.  The SEC'

In [18]:
mlm("The Securities and Exchange Commission has asked for information about <mask>")

[{'score': 0.026470301672816277,
  'token': 24,
  'token_str': ' it',
  'sequence': 'The Securities and Exchange Commission has asked for information about it'},
 {'score': 0.023832373321056366,
  'token': 734,
  'token_str': '...',
  'sequence': 'The Securities and Exchange Commission has asked for information about...'},
 {'score': 0.02131667733192444,
  'token': 35,
  'token_str': ':',
  'sequence': 'The Securities and Exchange Commission has asked for information about:'},
 {'score': 0.01988436095416546,
  'token': 42,
  'token_str': ' this',
  'sequence': 'The Securities and Exchange Commission has asked for information about this'},
 {'score': 0.014431827701628208,
  'token': 13070,
  'token_str': ' settlements',
  'sequence': 'The Securities and Exchange Commission has asked for information about settlements'}]

In [20]:
text = 'The uncertainty unleashed by the scandal has prompted <mask>'
pprint(mlm(text))

[{'score': 0.2745307385921478,
  'sequence': 'The uncertainty unleashed by the scandal has prompted protests',
  'token': 3246,
  'token_str': ' protests'},
 {'score': 0.18643581867218018,
  'sequence': 'The uncertainty unleashed by the scandal has prompted '
              'speculation',
  'token': 6116,
  'token_str': ' speculation'},
 {'score': 0.03370516747236252,
  'sequence': 'The uncertainty unleashed by the scandal has prompted '
              'investigations',
  'token': 4941,
  'token_str': ' investigations'},
 {'score': 0.027479568496346474,
  'sequence': 'The uncertainty unleashed by the scandal has prompted '
              'resignation',
  'token': 6985,
  'token_str': ' resignation'},
 {'score': 0.020345035940408707,
  'sequence': 'The uncertainty unleashed by the scandal has prompted criticism',
  'token': 3633,
  'token_str': ' criticism'}]
