In [4]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
#!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

In [5]:
!pip install transformers



In [6]:
import numpy as np
import pandas as pd
import textwrap
from pprint import pprint

from transformers import pipeline

In [8]:
df = pd.read_csv('bbc_text_cls.csv')

In [9]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [10]:
labels = set(df['labels'])
labels

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [11]:
# Pick a label
label = 'business'

In [12]:
texts = df[df['labels'] == label]['text']
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

In [13]:
np.random.seed(1234)

In [14]:
i = np.random.choice(texts.shape[0])
doc = texts.iloc[i]

In [15]:
print(textwrap.fill(doc, replace_whitespace=False, fix_sentence_endings=True))

Bombardier chief to leave company

Shares in train and plane-making
giant Bombardier have fallen to a 10-year low following the departure
of its chief executive and two members of the board.

Paul Tellier,
who was also Bombardier's president, left the company amid an ongoing
restructuring.  Laurent Beaudoin, part of the family that controls the
Montreal-based firm, will take on the role of CEO under a newly
created management structure.  Analysts said the resignations seem to
have stemmed from a boardroom dispute.  Under Mr Tellier's tenure at
the company, which began in January 2003, plans to cut the worldwide
workforce of 75,000 by almost a third by 2006 were announced.  The
firm's snowmobile division and defence services unit were also sold
and Bombardier started the development of a new aircraft seating 110
to 135 passengers.

Mr Tellier had indicated he wanted to stay at the
world's top train maker and third largest manufacturer of civil
aircraft until the restructuring was comple

In [16]:
mlm = pipeline('fill-mask')

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [17]:
mlm('Bombardier chief to leave <mask>')

[{'score': 0.06950808316469193,
  'token': 633,
  'token_str': ' job',
  'sequence': 'Bombardier chief to leave job'},
 {'score': 0.06693080812692642,
  'token': 1470,
  'token_str': ' France',
  'sequence': 'Bombardier chief to leave France'},
 {'score': 0.052735332399606705,
  'token': 558,
  'token_str': ' office',
  'sequence': 'Bombardier chief to leave office'},
 {'score': 0.025823058560490608,
  'token': 2201,
  'token_str': ' Paris',
  'sequence': 'Bombardier chief to leave Paris'},
 {'score': 0.021368617191910744,
  'token': 896,
  'token_str': ' Canada',
  'sequence': 'Bombardier chief to leave Canada'}]

In [18]:
text = 'Shares in <mask> and plane-making ' + \
  'giant Bombardier have fallen to a 10-year low following the departure ' + \
  'of its chief executive and two members of the board.'

mlm(text)

[{'score': 0.6640968322753906,
  'token': 11016,
  'token_str': ' Airbus',
  'sequence': 'Shares in Airbus and plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.'},
 {'score': 0.2614646553993225,
  'token': 6722,
  'token_str': ' Boeing',
  'sequence': 'Shares in Boeing and plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.'},
 {'score': 0.023635275661945343,
  'token': 15064,
  'token_str': ' aerospace',
  'sequence': 'Shares in aerospace and plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of the board.'},
 {'score': 0.014581810683012009,
  'token': 8537,
  'token_str': ' airlines',
  'sequence': 'Shares in airlines and plane-making giant Bombardier have fallen to a 10-year low following the departure of its chief executive and two members of th

In [19]:
text = 'Shares in train and plane-making ' + \
  'giant Bombardier have fallen to a 10-year low following the <mask> ' + \
  'of its chief executive and two members of the board.'

pprint(mlm(text))

[{'score': 0.5513917207717896,
  'sequence': 'Shares in train and plane-making giant Bombardier have fallen '
              'to a 10-year low following the resignation of its chief '
              'executive and two members of the board.',
  'token': 6985,
  'token_str': ' resignation'},
 {'score': 0.2109048217535019,
  'sequence': 'Shares in train and plane-making giant Bombardier have fallen '
              'to a 10-year low following the departure of its chief executive '
              'and two members of the board.',
  'token': 5824,
  'token_str': ' departure'},
 {'score': 0.13042035698890686,
  'sequence': 'Shares in train and plane-making giant Bombardier have fallen '
              'to a 10-year low following the departures of its chief '
              'executive and two members of the board.',
  'token': 25624,
  'token_str': ' departures'},
 {'score': 0.036515578627586365,
  'sequence': 'Shares in train and plane-making giant Bombardier have fallen '
              'to a 10-ye

In [20]:
# Prompt for the MLM
prompt = "boys are very <mask>"

# Generate 10 top prompts
result = mlm(prompt, top_k=10)

# Print the generated prompts
for item in result:
    print(item['sequence'])

boys are very polite
boys are very smart
boys are very nice
boys are very naughty
boys are very good
boys are very cute
boys are very lucky
boys are very stupid
boys are very cool
boys are very respectful
