<a href="https://colab.research.google.com/github/fginter/ainl_2020_tutorial/blob/main/bert_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install laserembeddings
!python -m laserembeddings download-models

Collecting laserembeddings
  Downloading https://files.pythonhosted.org/packages/a2/4b/a9e3ee9f4825bd2bb6b48f26370e2c341860ec0cb2a9a27deea9be6c2299/laserembeddings-1.1.0-py3-none-any.whl
Collecting transliterate==1.10.2
[?25l  Downloading https://files.pythonhosted.org/packages/a1/6e/9a9d597dbdd6d0172427c8cc07c35736471e631060df9e59eeb87687f817/transliterate-1.10.2-py2.py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 4.0MB/s 
Collecting sacremoses==0.0.35
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 7.3MB/s 
[?25hCollecting subword-nmt<0.4.0,>=0.3.6
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8cae49b837f93f5842a41fe513a694912/subword_nmt-0.3.7-py2.py3-none-any.whl
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ..

In [2]:
from laserembeddings import Laser

laser = Laser()
#can this be any simpler? :)
embeddings = laser.embed_sentences(['I love pasta.',"J'adore les pâtes.",'Ich liebe Pasta.'],lang=['en', 'fr', 'de'])

print(embeddings)
print(embeddings.shape)


[[-5.2039017e-04 -2.8321840e-05 -1.6871469e-04 ...  3.4788840e-03
  -1.9968930e-03  8.1148231e-03]
 [ 3.2193204e-03 -9.9815654e-05  5.9067555e-05 ...  7.6490263e-03
   1.1962679e-03  2.4502634e-03]
 [ 5.3412444e-04 -3.6210116e-05 -1.4794576e-04 ...  6.1386470e-03
  -1.6569846e-03  6.3126544e-03]]
(3, 1024)


# Test the embeddings

* We are working on a paraphrase corpus, from which I borrowed some early data
* The two files below `yle.txt` and `hs.txt` contain some 200+ news titles from YLE and HS, judged by a human to be paraphrases or near-paraphrases of each other
* The selection is such that lexical overlap is minimized
* The two files are line-aligned
* We could make a simple test of LASER, comparing them against each other to see if we can pair these up
* In other words: for every HS title, find the nearest YLE title
* Measure how often it is correct
* Random baseline is roughly 1/200, i.e. about 0.5%

In [3]:
!wget -nc http://dl.turkunlp.org/.ginter/hs.txt
!wget -nc http://dl.turkunlp.org/.ginter/yle.txt

--2020-10-05 20:23:12--  http://dl.turkunlp.org/.ginter/hs.txt
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24291 (24K) [text/plain]
Saving to: ‘hs.txt’


2020-10-05 20:23:13 (122 KB/s) - ‘hs.txt’ saved [24291/24291]

--2020-10-05 20:23:13--  http://dl.turkunlp.org/.ginter/yle.txt
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21392 (21K) [text/plain]
Saving to: ‘yle.txt’


2020-10-05 20:23:13 (115 KB/s) - ‘yle.txt’ saved [21392/21392]



In [4]:
def read_file(fname):
  lines=[]
  with open(fname) as f:
    for line in f:
      line=line.strip()
      if not line:
        continue
      lines.append(line)
  return lines

hs=read_file("hs.txt")
yle=read_file("yle.txt")

hs_vectors=laser.embed_sentences(hs,"fi")
yle_vectors=laser.embed_sentences(yle,"fi")

print("hs",hs_vectors.shape)
print("yle",yle_vectors.shape)

hs (217, 1024)
yle (217, 1024)


In [5]:
import sklearn.metrics
# Given two sets of vectors, this function calculates all-pair cosine distances
all_dist=sklearn.metrics.pairwise_distances(hs_vectors,yle_vectors)
print("Distance matrix shape:", all_dist.shape)
#we get a sentence-by-sentence matrix, with distances

Distance matrix shape: (217, 217)


In [6]:
#Calculate for every row the document with minimal distance in that row (axis=-1 means minimum along the last axis)
nearest=all_dist.argmin(axis=-1) #These are the nearest neighbors for each HS title (indices into YLE), perfect solution would be [0,1,2,3...,216]
print(nearest)

[121   1   1   3 120 116   6   9   8   9 177  11 102  13  44  15 126  17
 196 206 189  96  22  23  70   9  77 189 216  67  30 136  78  61  34  74
  53  37 161  39  40 147 203  13  44  45   9  47  48  49  14  51 120  53
  54  55 121  46  58 123  60  61 189  48  64  65  39  13 126  69  70  71
  61  73  74  61  76  26 189  74  14  81  82  83  84  85  86  87 182  64
  15 135 176 189  70  59 206  97 208  99  99  84 102 103 104 105  48  67
  91 175 110  14 112 113  39 115 116 121 118  73 120 121 121 123 124 125
 126 159 128 129 130  92 132 133 134  61 136 137 138   9  39 141 142 123
  39   9  35  15 150 149  67 151 152 153 180 155 156 137 158  79 160 176
  79 163 164 112  39 167 168 169 132 171 172 146  61 175 201 177 178 179
 180  63 189 183 184  67  39 187 197 189  94 146 192 193 150 195 196 197
  78  39 198  74 202 203 204 133 206   9 208 146  86  59  30 213 214 215
 216]


In [7]:
# Let's package this all nicely into a function
import random


def eval_embeddings(texts1,texts2,vectors1,vectors2):
  assert len(texts1)==len(texts2), "We assume aligned data"
  all_dist=sklearn.metrics.pairwise_distances(vectors1,vectors2)
  nearest=all_dist.argmin(axis=-1) #These are the nearest neighbors for each HS title (indices into YLE), perfect solution would be [0,1,2,3...,216]   
  correct=[] #Let's put here the correct pairs
  incorrect=[] #Let's put here the incorrect pairs
  for i,txt1 in enumerate(texts1):
    j=nearest[i] #the index at which the nearest sentence is
    txt2=texts2[j] #..and its text
    if i==j:
      #This is correct
      correct.append((txt1,txt2))
    else:
      incorrect.append((txt1,txt2))

  print(f"Correct {len(correct)}/{len(texts1)}={len(correct)/len(texts1)*100}%") #these f-strings are really neat, you can embed expressions and have them printed
  random.shuffle(correct)
  random.shuffle(incorrect)
  print("\n\n---------- Sample of correct ones:")
  for t1,t2 in correct[:15]:
    print(t1)
    print(t2)
    print()
  print("\n\n---------- Sample of incorrect ones:\n")
  for t1,t2 in incorrect[:15]:
    print(t1)
    print(t2)
    print()

In [8]:
eval_embeddings(hs,yle,hs_vectors,yle_vectors)

Correct 110/217=50.69124423963134%


---------- Sample of correct ones:
Vetoomustuomioistuin: Trump ei voi estää eri mieltä olevia seuraamasta Twitter-tiliään
Yhdysvaltalainen tuomioistuin: Trumpin Twitter-blokkaukset ovat perustuslain vastaista toimintaa

Ensimmäinen avaruuteen lähetetty suomalais­satelliitti tuhoutui tähden­lentona
Aalto-2 paloi poroksi ilmakehässä

Kolme ihmistä kuoli hallituksen vastaisissa mielenosoituksissa Kolumbiassa, sadattuhannet protestoijat vaativat turvaa
Kolumbiassa mielenosoittajat vastustavat hallitusta suurprotesteissa – Korruptio, huumekauppa ja toimeentulo huolestuttavat kansaa

Viikinmäen typenpoisto toimii lähes normaalisti – Mysteerimyrkyn arvoitus ei ole selvinnyt
Viikinmäen jätevedenpuhdistamon häiriö jatkuu – Typpeä valuu vesistöön edelleen, mutta pahin on ohi

Li Andersson toivoo opetusministerin salkkua, sisäministerin paikka menossa vihreille – Tämä tiedetään salkkujaosta
Vuorossa salkkujako – puolueiden puheenjohtajat koolle ratkomaan minis

# Try with BERT?

*   We could try with BERT
*   Test the [CLS] token as the sentence embedding
*   Test the average of token embeddings as the sentence embedding



In [9]:
#Note: since LASER is torch, maybe we continue in torch for the fun of it? :) (and you also asked for some torch examples)
!pip install transformers
import transformers

bert_model = transformers.BertModel.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1") #models can be loaded by name from this list: https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L35
bert_model = bert_model.cuda() #move the model to GPU
bert_model.eval() #tell the model it will be used for predictions, not training (disables dropout for example)


Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 5.4MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 31.0MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 46.9MB/s 
Installing collected packages: sentencepiece, tokenizers, transformers
Successfully installed sentencepiece-0.1.91 tokenizers-0.8.1rc2 transformers-3.3.1


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=500709232.0, style=ProgressStyle(descri…




BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(50105, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [11]:
import torch
import torch.nn

#Load the Finnish BERT tokenizer
tokenizer = transformers.BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1") #also tokenizers can be loaded by nae

def tokenize_texts(texts):
  tokenized_ids=[tokenizer.encode(txt,add_special_tokens=True) for txt in texts] #this runs the BERT tokenizer, returns list of lists of integers
  tokenized_ids_t=[torch.tensor(ids,dtype=torch.long) for ids in tokenized_ids] #turn lists of integers into torch tensors
  tokenized_single_batch=torch.nn.utils.rnn.pad_sequence(tokenized_ids_t,batch_first=True) #zero-padding
  return tokenized_single_batch

hs_data=tokenize_texts(hs).cuda() #tokenize and move to GPU
yle_data=tokenize_texts(yle).cuda()

print(hs_data.shape)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=424343.0, style=ProgressStyle(descripti…


torch.Size([217, 32])


In [12]:
#This is how you run BERT in torch
data=hs_data
with torch.no_grad(): #tell the model not to gather gradients since we are evaluating, not training, saves memory and troubles
  mask=data.clone().float() # this is a mask telling which tokens are padding and which are real
  mask[data>0]=1.0 #We need to set this to 1 for tokens that the attention should see, and 0 for those that are mere padding

  emb=bert_model(data.cuda(),attention_mask=mask) #applies the model and returns several things, we care about the first. Documentation: https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L648
  print(emb[0].shape)  # word x sequence x embedding


torch.Size([217, 32, 768])


In [13]:
#Let's pack this into a nice function
def embed(data,how_to_pool="CLS"):
  with torch.no_grad(): #tell the model not to gather gradients
    mask=data.clone().float() #
    mask[data>0]=1.0
    emb=bert_model(data.cuda(),attention_mask=mask.cuda()) #runs BERT and returns several things, we care about the first
    #emb[0]  # batch x word x embedding
    if how_to_pool=="AVG":
      pooled=emb[0]*(mask.unsqueeze(-1)) #multiply everything by the mask
      pooled=pooled.sum(1)/mask.sum(-1).unsqueeze(-1) #sum and divide by non-zero elements in mask to get masked average
    elif how_to_pool=="CLS":
      pooled=emb[0][:,0,:].squeeze() #Pick the first token as the embedding
    else:
      assert False, "how_to_pool should be CLS or AVG"
    print("Pooled shape:",pooled.shape)
  return pooled.cpu().numpy() #done! move data back to CPU and extract the numpy array

hs_emb_cls=embed(hs_data,"CLS")
yle_emb_cls=embed(yle_data,"CLS")

hs_emb_avg=embed(hs_data,"AVG")
yle_emb_avg=embed(yle_data,"AVG")


Pooled shape: torch.Size([217, 768])
Pooled shape: torch.Size([217, 768])
Pooled shape: torch.Size([217, 768])
Pooled shape: torch.Size([217, 768])


In [14]:
eval_embeddings(hs,yle,hs_emb_avg,yle_emb_avg)

Correct 152/217=70.04608294930875%


---------- Sample of correct ones:
Ulkomaalaisten opiskelijoiden into tulla Suomeen notkahti lukuvuosi­maksujen jälkeen vain hetkeksi
Suomi kiinnostaa ulkomaalaisia opiskelijoita, vaikka opiskelusta tuli maksullista – hakijamäärät ovat jälleen reippaassa nousussa

Matkustajia nostettiin vinssillä korkeuksiin rajussa tuulessa – uusi video näyttää, miten merihätään joutuneen laivan pelastus­operaatio eteni Norjassa
Yli tuhat henkilöä evakuoidaan merihätään joutuneelta risteilyalukselta Norjan rannikolla – video

Venäjä ja Kiina estivät jatkoajan Syyria-avulle, Yhdysvaltain ulkoministeri Pompeo kuvailee toimintaa ”häpeälliseksi”
Venäjä ja Kiina torppasivat jatkoajan rajan yli toimitettavalle avulle Syyriaan

EU ei päässyt yhteisymmärrykseen ehdokkaasta Kansainvälisen valuuttarahaston johtoon, perjantaina edessä äänestys
EU:n valtiovarainministerit aloittivat äänestyksen IMF-ehdokkaasta

Pilkille lähtenyt 80-vuotias mies hukkui Kuopiossa, useita jäihin 

In [15]:
eval_embeddings(hs,yle,hs_emb_cls,yle_emb_cls)

Correct 143/217=65.89861751152074%


---------- Sample of correct ones:
Eduskuntapuolueilta yhteinen vetoomus kansalaisille: Olkaa tarkkana vale­uutisten suhteen
9 puolueelta yhteinen kannanotto: Vaalihäirintä uhkaa demokratiaa – tärkeää, että äänestäjät tunnistavat valeuutisoinnin

Työtön voi jatkossa kerryttää aktiivisuutta entistä useammalla tavalla
Työtön voi nyt olla hyväksyttävästi aktiivinen ilman työviranomaisten palveluita – Kela vaatii todistuksen

Kokoomus esittelee sittenkin sotemallin ennen vaaleja, mutta aikataulusta ei ole vielä tietoa
Ryhmäjohtaja Jokinen korjaa sanomisiaan: Kokoomus esittelee sote-mallinsa, aikataulu vielä auki

Useilla pankeilla oli sunnuntaina vakavia yhteysongelmia – palvelut palautuivat käyttöön alkuillasta
Usealla pankilla laaja ongelma verkkopankin ja pankkikorttien kanssa – vian kestosta ei ole tietoa

Eduskunta jää tänään kesätauolle
Eduskunta aloittaa kesälomansa

Lännen Media: EU ei myönnä kriisiapua maatalouden kuivuus­ongelmaan, koska rahaa