In [2]:
#!pip install transformers
import transformers
from transformers import pipeline
import torch
import numpy as np
import tqdm

In [10]:
#!pip list

In [1]:
!which python

/home/iris/paulshab/NLP_env/nlp_environ/bin/python


In [2]:
!which pip

/home/iris/paulshab/NLP_env/nlp_environ/bin/pip


### Using pre-trained transformers
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [12]:
#!pip install ipywidgets
#!jupyter nbextension enable --py widgetsnbextension --sys-prefix

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [8]:
import base64

classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

outputs = {}

for key in data:
    outputs[key] = classifier(data[key])[0]['score'] > 0.8

assert sum(outputs.values()) == 3 and outputs[base64.decodestring(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

[{'label': 'POSITIVE', 'score': 0.9998860955238342}]
Well done!




You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [4]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
    print(f"P={hypo['score']:.5f}", hypo['sequence'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [5]:
# Your turn: use bert to recall what year was the Soviet Union founded in
for hypo in mlm_model(f"the Soviet Union was founded in the year {MASK}."):
    print(f"P={hypo['score']:.5f}", hypo['sequence'])

P=0.07628 the soviet union was founded in the year 1917.
P=0.06085 the soviet union was founded in the year tsar.
P=0.03568 the soviet union was founded in the year 1918.
P=0.02235 the soviet union was founded in the year ussr.
P=0.01872 the soviet union was founded in the year stalin.


In [7]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad" )
question_answerer(
    question="What year was the Soviet Union founded?",
    context= "Following the 1917 Revolution, four socialist republics were established on the territory of the former empire: the Russian and Transcaucasian Soviet Federated Socialist Republics and the Ukrainian and Belorussian Soviet Socialist Republics. On December 30, 1918"
  )

{'score': 0.7281541228294373, 'start': 258, 'end': 262, 'answer': '1918'}

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [8]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list

ner_model= pipeline("ner", grouped_entities=True, model="elastic/distilbert-base-uncased-finetuned-conll03-english" )

named_entities = ner_model(text)

named_entities

  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'entity_group': 'MISC',
  'score': 0.8614428,
  'word': 'rose',
  'start': 112,
  'end': 116},
 {'entity_group': 'LOC',
  'score': 0.5662445,
  'word': '##tta',
  'start': 116,
  'end': 119},
 {'entity_group': 'ORG',
  'score': 0.9864218,
  'word': 'guardian',
  'start': 179,
  'end': 187},
 {'entity_group': 'PER',
  'score': 0.9983278,
  'word': 'ian sample',
  'start': 207,
  'end': 217},
 {'entity_group': 'ORG',
  'score': 0.48886383,
  'word': 'astronomy',
  'start': 223,
  'end': 232},
 {'entity_group': 'PER',
  'score': 0.9990902,
  'word': 'stuart clark',
  'start': 240,
  'end': 252},
 {'entity_group': 'LOC',
  'score': 0.9989255,
  'word': 'germany',
  'start': 414,
  'end': 421},
 {'entity_group': 'PER',
  'score': 0.9877829,
  'word': 'phil',
  'start': 471,
  'end': 475},
 {'entity_group': 'PER',
  'score': 0.98374355,
  'word': '##ae',
  'start': 475,
  'end': 477}]

In [77]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'entity_group': 'MISC', 'score': 0.8614428, 'word': 'rose', 'start': 112, 'end': 116}, {'entity_group': 'LOC', 'score': 0.5662445, 'word': '##tta', 'start': 116, 'end': 119}, {'entity_group': 'ORG', 'score': 0.9864218, 'word': 'guardian', 'start': 179, 'end': 187}, {'entity_group': 'PER', 'score': 0.9983278, 'word': 'ian sample', 'start': 207, 'end': 217}, {'entity_group': 'ORG', 'score': 0.48886383, 'word': 'astronomy', 'start': 223, 'end': 232}, {'entity_group': 'PER', 'score': 0.9990902, 'word': 'stuart clark', 'start': 240, 'end': 252}, {'entity_group': 'LOC', 'score': 0.9989255, 'word': 'germany', 'start': 414, 'end': 421}, {'entity_group': 'PER', 'score': 0.9877829, 'word': 'phil', 'start': 471, 'end': 475}, {'entity_group': 'PER', 'score': 0.98374355, 'word': '##ae', 'start': 475, 'end': 477}]


KeyError: 'entity'

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [9]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
k = tokens_info['input_ids']
model(k)["last_hidden_state"].shape

torch.Size([2, 15, 768])

In [10]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [None]:
model()

In [15]:
# You can now apply the model to get embeddings
with torch.no_grad():
    token_embeddings, sentence_embedding = model(**tokens_info)

print(sentence_embedding)

pooler_output


In [17]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

### The search for similar questions.

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to **really** solve this task using context-aware embeddings.

In [29]:
# download the data:
#!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
zsh:1: no matches found: https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1


__Main task(3 pts):__ 
* Implement a function that takes a text string and finds top-k most similar questions from `quora.txt`
* Demonstrate your function using at least 5 examples

There are no prompts this time: you will have to write everything from scratch.

In [37]:
#!pip install -U sentence-transformers

## Load Data

In [9]:
lines = []
with open('quora.txt') as f:
    lines = f.readlines()
lines[5]

'Why was George RR Martin critical of JK Rowling after losing the Hugo award?\n'

## Load Model

In [10]:
from sentence_transformers import SentenceTransformer, util

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
#checkpoint = 'sentence-transformers/distilbert-base-nli-stsb-mean-tokens'
checkpoint = 'sentence-transformers/LaBSE'
model = SentenceTransformer(checkpoint).to(device)

## Get Token ID's

In [150]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokens = tokenizer.tokenize("hello how are you")

#get input tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("token ids:", input_ids)

# deocde input tokens
decoded_string = tokenizer.decode(input_ids)
print("decoded tokens:", decoded_string)

# get padding id:
tokenizer.pad_token_id

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/804 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

token ids: [83285, 16369, 15179, 15100]
decoded tokens: hello how are you


0

## Get Encodings

In [11]:
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.4557]])


In [34]:
def most_similiar_q(input_sentence, model):
    
    # get embedding for input string
    input_embedding = model.encode(input_sentence)
    
    # get embedding for document 
    if os.path.isfile('tensor.pt'):
        embeddings = torch.load('tensor.pt')
        print(embeddings1.shape)
    else:
        embeddings = model.encode(lines, show_progress_bar=True)
        torch.save(embeddings, 'tensor.pt')
             
    #Compute cosine similarity between all pairs
    cos_sim = util.cos_sim(input_embedding, embeddings)
    cos_sim = torch.squeeze(cos_sim)
    #print(cos_sim.shape)
    
    #Add all pairs to a list with their cosine similarity score
    all_sentence_combinations = []
    for i in range(cos_sim.shape[0]):
        all_sentence_combinations.append([cos_sim[i], i])
    
    #Sort list by the highest cosine similarity score
    all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

    #print("Top-5 most similar pairs:")
    for score, i in all_sentence_combinations[0:5]:
        print(lines[i])
        print(cos_sim[i])

In [38]:
input_sentence = "How old is the universe?"
most_similiar_q(input_sentence, model)

(537273, 768)
torch.Size([537273])
How old is the world?

tensor(0.9344)
How big is the universe?

tensor(0.8689)
How old is the earth?

tensor(0.8620)
How vast is the universe?

tensor(0.8567)
What is the size of the universe?

tensor(0.8387)


In [24]:
input_sentence = "Who is Donald Trump?"

#Encode all sentences + input_sentence
embeddings = model.encode(lines, show_progress_bar=True)
input_embedding = model.encode(input_sentence)

print(embeddings.shape)
# print(input_embedding.shape)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(input_embedding, embeddings)
cos_sim = torch.squeeze(cos_sim)
print(cos_sim.shape)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(cos_sim.shape[0]):
    all_sentence_combinations.append([cos_sim[i], i])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

#print("Top-5 most similar pairs:")
for score, i in all_sentence_combinations[0:5]:
    print(lines[i])
    print(cos_sim[i])

Batches:   0%|          | 0/16790 [00:00<?, ?it/s]

(537273, 768)
torch.Size([537273])
Who is Donald Trump?

tensor(1.0000)
Who is the father of Donald Trump?

tensor(0.8873)
How old is Donald Trump?

tensor(0.8792)
Who is Donald Trump's hairdresser?

tensor(0.8691)
Who or what is Donald Trump, really?

tensor(0.8683)


```















```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [1]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
#device = "cpu"
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

In [3]:
text = "Irakli Ubilava"
tokens = tokenizer.encode(text)
num_steps = 1024
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()
    #print(p_next.shape[0])

    #next_token_index = p_next.argmax() #<YOUR CODE: REPLACE THIS LINE>
    # YOUR TASK: change the code so that it performs nucleus sampling
    next_token_index = np.random.choice(p_next.shape[0], p=p_next)
    

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()



 Irakli Ubilava the 10,000-year-old family of the city's ancient Bronze Age people, faces
 a six-month trial for child abuse.

Aurora Court jury unanimously rejected
 the defence's weekend claim that Kurilas lent the game for sexual gratification
, for reaching them with candy laced with magic mushrooms. In a first in
 a series of rulings, the jury sentenced Kurilas to five years in prison
 and three years of supervised release while the other six men paid a fine
 of about US$40,000. The appeal had been on hold until close to a two-
year judge's decision Sunday.

Today, it appears Kurilas was forced to
 take aim at court officials that he had violated them when he sent the
 charges against the prosecutors, who would not speak until a police investigation
 carried out in April deemed insufficient to provide needed answers.


When Naveen Agrawal, the defence lawyer in charge of Hello Bosi's trouble
, questioned why all cases were brought against the two trial court employees
, Kurilas ad

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [161,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: 

RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 3

Transformers knowledge hub: https://huggingface.co/transformers/