In [None]:
!pip install transformers
import transformers

### Using pre-trained transformers
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [17]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))
print(classifier("BERT is awful!"))
print(classifier("I will find you and I'll kill you!"))

print(classifier("As High as Honor."))
print(classifier("Ours is the fury."))
print(classifier("Winter is coming."))
print(classifier("Growing strong."))

[{'label': 'POSITIVE', 'score': 0.9998860359191895}]
[{'label': 'NEGATIVE', 'score': 0.9996429681777954}]
[{'label': 'POSITIVE', 'score': 0.9947445392608643}]
[{'label': 'POSITIVE', 'score': 0.9997355341911316}]
[{'label': 'NEGATIVE', 'score': 0.695769727230072}]
[{'label': 'POSITIVE', 'score': 0.9950939416885376}]
[{'label': 'POSITIVE', 'score': 0.9998761415481567}]


In [18]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
# <...>
# outputs = <YOUR CODE: dict (house name) : True if positive, False if negative>

outputs = {k: True if classifier(data[k])[0]['label'] == 'POSITIVE' else False for k in data}
print(outputs)
print(outputs.values())

assert sum(outputs.values()) == 3 and outputs[base64.decodebytes(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

{'arryn': True, 'baratheon': False, 'stark': True, 'tyrell': True}
dict_values([True, False, True, True])
Well done!


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [19]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [20]:
# Your turn: use bert to recall what year was the Soviet Union founded in
mlm_model(f"Soviet Union was founded in {MASK} .")
mlm_model(f"Mask aligner is {MASK} in many labs .")
mlm_model(f"{MASK} is the greatest human!")

[{'score': 0.7481589317321777,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'he is the greatest human!'},
 {'score': 0.11051268130540848,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'she is the greatest human!'},
 {'score': 0.02057752013206482,
  'token': 2023,
  'token_str': 'this',
  'sequence': 'this is the greatest human!'},
 {'score': 0.014749483205378056,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'it is the greatest human!'},
 {'score': 0.007025654893368483,
  'token': 2008,
  'token_str': 'that',
  'sequence': 'that is the greatest human!'}]

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [21]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
# ner_model = <YOUR CODE>
ner_model = pipeline('ner', model ='dslim/bert-base-NER')

named_entities = ner_model(text)

In [22]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'entity': 'B-LOC', 'score': 0.7991047, 'index': 27, 'word': 'Rose', 'start': 112, 'end': 116}, {'entity': 'I-LOC', 'score': 0.95119274, 'index': 28, 'word': '##tta', 'start': 116, 'end': 119}, {'entity': 'B-ORG', 'score': 0.998223, 'index': 40, 'word': 'Guardian', 'start': 179, 'end': 187}, {'entity': 'B-PER', 'score': 0.9997613, 'index': 46, 'word': 'Ian', 'start': 207, 'end': 210}, {'entity': 'I-PER', 'score': 0.99978715, 'index': 47, 'word': 'Sam', 'start': 211, 'end': 214}, {'entity': 'I-PER', 'score': 0.99964595, 'index': 48, 'word': '##ple', 'start': 214, 'end': 217}, {'entity': 'B-PER', 'score': 0.9997831, 'index': 53, 'word': 'Stuart', 'start': 240, 'end': 246}, {'entity': 'I-PER', 'score': 0.9997482, 'index': 54, 'word': 'Clark', 'start': 247, 'end': 252}, {'entity': 'B-LOC', 'score': 0.9997228, 'index': 85, 'word': 'Germany', 'start': 414, 'end': 421}, {'entity': 'B-PER', 'score': 0.9963127, 'index': 99, 'word': 'Phil', 'start': 471, 'end': 475}, {'entity': 'I-PER',

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


In [None]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

In [None]:
[tokenizer.decode(int(i)) for i in tokens_info['input_ids'][1].data.numpy()]

In [None]:
# You can now apply the model to get embeddings
with torch.no_grad():
    # token_embeddings, sentence_embedding = model(**tokens_info)
    out = model(**tokens_info)

# print(sentence_embedding)
out

In [None]:
list(model.parameters())

### The search for similar questions.

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to **really** solve this task using context-aware embeddings.

In [None]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

__Main task(3 pts):__ 
* Implement a function that takes a text string and finds top-k most similar questions from `quora.txt`
* Demonstrate your function using at least 5 examples

There are no prompts this time: you will have to write everything from scratch.

In [None]:
# <A whole lot of your code. Feel free to format it as you see fit>

In [8]:
import numpy as np

data = list(open("./quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

In [9]:
!pip install -U sentence-transformers



In [10]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

[[ 6.76569119e-02  6.34959117e-02  4.87131104e-02  7.93049857e-02
   3.74480635e-02  2.65278947e-03  3.93749550e-02 -7.09845126e-03
   5.93614280e-02  3.15370224e-02  6.00980595e-02 -5.29051945e-02
   4.06067967e-02 -2.59308480e-02  2.98428312e-02  1.12691941e-03
   7.35149086e-02 -5.03819585e-02 -1.22386619e-01  2.37028524e-02
   2.97264922e-02  4.24768627e-02  2.56337821e-02  1.99517142e-03
  -5.69190606e-02 -2.71598138e-02 -3.29035632e-02  6.60248771e-02
   1.19007140e-01 -4.58791219e-02 -7.26214647e-02 -3.25839818e-02
   5.23413792e-02  4.50552814e-02  8.25301930e-03  3.67023796e-02
  -1.39415115e-02  6.53919131e-02 -2.64272355e-02  2.06399694e-04
  -1.36643350e-02 -3.62810381e-02 -1.95043776e-02 -2.89738197e-02
   3.94270197e-02 -8.84090737e-02  2.62426492e-03  1.36713926e-02
   4.83063087e-02 -3.11565641e-02 -1.17329188e-01 -5.11690117e-02
  -8.85288119e-02 -2.18962412e-02  1.42986597e-02  4.44168150e-02
  -1.34815276e-02  7.43392482e-02  2.66382881e-02 -1.98762510e-02
   1.79191

In [11]:
# data_vectors = np.array([get_phrase_embedding(l) for l in data])
data_vectors = model.encode(data)


In [12]:
data_vectors.shape
data_vectors[1,:]

array([ 5.29168807e-02,  1.56224016e-02,  4.28081155e-02,  8.24510977e-02,
       -4.42565344e-02,  4.34900187e-02, -1.05001703e-02, -8.61628130e-02,
        8.16449465e-04, -1.21423215e-01,  6.56440482e-02, -1.67419296e-02,
       -4.15178239e-02, -3.25913318e-02,  5.27058020e-02, -1.48751616e-01,
        6.32438511e-02,  7.99805671e-02,  6.66677505e-02, -6.18607849e-02,
       -2.31075753e-02, -3.88341099e-02,  9.50826041e-04,  1.23370858e-02,
       -3.47167589e-02,  1.19227953e-02, -2.29910687e-02, -4.05294225e-02,
       -3.67469117e-02, -2.93446891e-02,  5.39622530e-02, -5.96254356e-02,
       -1.14789978e-02,  5.00036068e-02, -5.01460731e-02, -1.15877446e-02,
        6.90763891e-02, -6.00012094e-02,  2.31682286e-02,  4.64344816e-03,
        1.25187226e-02, -6.27211556e-02, -2.18027532e-02, -3.10064275e-02,
        4.93833348e-02, -8.87878984e-02, -5.90049475e-02,  4.40719798e-02,
        3.24760564e-02, -3.09204105e-02, -1.44695286e-02, -1.61288437e-02,
        2.89997421e-02,  

In [13]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    # YOUR CODE
    
    query_embedding = model.encode(query)
    query_norm = np.linalg.norm(query_embedding)
    cosins = []
    for vec in data_vectors:
        vec_norm = np.linalg.norm(vec)  
        cosin = (query_embedding @ vec) / (vec_norm * query_norm)
        cosins.append(cosin)

    args = np.argsort(cosins)[::-1][2:k+2]
    print(f"{args}")
    
    return [data[i] for i in args]

In [14]:
find_nearest(query="How do i enter the matrix?", k=10)

[195289  87199 231348 415628 128022  39464 259923 516438 289747 145469]


['How can this matrix be written in [math]\\LaTeX[/math]?\n',
 'How do I write an algorithm that compute matrix?\n',
 'What is The Matrix about?\n',
 'How do you multiply matrices?\n',
 'How do you multiply three matrices?\n',
 'How would you rotate a matrix?\n',
 'What is the shortcut to finding an inverse of a square matrix?\n',
 'How would I solve this math problem using a matrix?\n',
 'What is a matrix?\n',
 'How do I do a matrix transpose in Go?\n']

In [15]:
find_nearest(query="How does Trump?", k=10)

[ 36971 180884  77623 472078 311696 499904 103444  10908 178919 221239]


['How can Trump win?\n',
 'How did Trump win the presidency?\n',
 'How is Donald Trump winning?\n',
 'How is Donald Trump as a businessman?\n',
 'How did Trump get Elected?\n',
 'How good is Donald Trump for America as President?\n',
 'What is Trump?\n',
 'How did trump become president?\n',
 'How did Donald Trump win the presidency?\n',
 'In what ways is Donald Trump good for U.S. politics?\n']

```




```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [23]:
import random
import torch

def get_nucleous_index(probs, p : float):

    sorted_probs, indices = torch.sort(torch.tensor(probs, requires_grad=False, dtype=torch.float64), dim=-1, descending=True)
    cum_sum_probs = torch.cumsum(sorted_probs, dim=-1)

    if cum_sum_probs[0] > p:
        return indices[0]
    
    sorted_probs[cum_sum_probs > p] = 0
    sorted_probs /= torch.sum(sorted_probs, dim=-1)

    return random.choices(indices, weights=sorted_probs)[0]

In [24]:
probs = [0.1, 0.12, 0.05, 0.03, 0.32, 0.21, 0.01]
# probs = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
p = 0.8
print(get_nucleous_index(probs, p))


tensor(5)


In [29]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

text = "The Fermi paradox "
# text = "Anastasia is a "
tokens = tokenizer.encode(text)
num_steps = 128
line_length, max_length = 0, 70
p = 0.7

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()
    next_token_index = get_nucleous_index(p_next, p)

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()



cuda
 The Fermi paradox  are highly plausible, and it's hard to believe that the two of them are
 not involved in a philosophical issue in which one of them was justly
 criticized.  I'm not saying that one of them is wrong, but that one of
 them is not important.  I don't think that this conclusion is wholly dis
putable.  I think that there is a problem with this conclusion, and that
 it is very hard to reconcile with our actual understanding of  Cosmos
  .  It is more plausible that he has a legitimate argument for an  int
ended but not justified way of considering the two phenomena. 

Transformers knowledge hub: https://huggingface.co/transformers/