# Introduction to Sentence Transformers

[Sentence Transformers](https://www.sbert.net/index.html) is a Python framework for sentence, text, and image embeddings based on the Sentence-BERT [paper](https://arxiv.org/abs/1908.10084).

## Imports, Inits, and Functions

In [1]:
%load_ext autoreload
%autoreload 2
%config IPCompleter.greedy=True

import pdb, pickle, sys, warnings, tqdm, time, torch, json, gzip
warnings.filterwarnings(action='ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string

import os
import bokeh, bokeh.models, bokeh.plotting
import tensorflow.compat.v2 as tf
from simpleneighbors import SimpleNeighbors
from tqdm import tqdm, trange

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

from sentence_transformers import SentenceTransformer, util, CrossEncoder
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [2]:
def bm25_tokenizer(text):
  tokenized_doc = []
  for token in text.lower().split():
    token = token.strip(string.punctuation)

    if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
      tokenized_doc.append(token)
  return tokenized_doc

def search(query):
  print("Input question:", query)

  ##### BM25 search (lexical search) #####
  bm25_scores = bm25.get_scores(bm25_tokenizer(query))
  top_n = np.argpartition(bm25_scores, -5)[-5:]
  bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
  bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

  print("Top-3 lexical search (BM25) hits")
  for hit in bm25_hits[0:3]:
    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

  ##### Sematic Search #####
  # Encode the query using the bi-encoder and find potentially relevant passages
  question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
  question_embedding = question_embedding.cuda()
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
  hits = hits[0]  # Get the hits for the first query

  ##### Re-Ranking #####
  # Now, score all retrieved passages with the cross_encoder
  cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
  cross_scores = cross_encoder.predict(cross_inp)

  # Sort results by the cross-encoder scores
  for idx in range(len(cross_scores)):
    hits[idx]['cross-score'] = cross_scores[idx]

  # Output of top-5 hits from bi-encoder
  print("\n-------------------------\n")
  print("Top-3 Bi-Encoder Retrieval hits")
  hits = sorted(hits, key=lambda x: x['score'], reverse=True)
  for hit in hits[0:3]:
    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

  # Output of top-5 hits from re-ranker
  print("\n-------------------------\n")
  print("Top-3 Cross-Encoder Re-ranker hits")
  hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
  for hit in hits[0:3]:
    print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

We first load a pre-trained model. For this tutorial we will use the `all-mpnet-base-v2` model which was trained on all available training data (more than 1 billion training pairs) and is designed as a general purpose model. `sentence_transformers` have other pretrained models that can be found [here](https://www.sbert.net/docs/pretrained_models.html)

In [3]:
model = SentenceTransformer('all-mpnet-base-v2')

## Semantic Textual Similarity

Sentence embeddings can be used to compare how semantically similar two sentences are. A common way to compare semantic content is to use `cosine` similarity.

In [4]:
sentences = [
  'The cat sits outside',
  'A man is playing guitar',
  'I love pasta',
  'The new movie is awesome',
  'The cat plays in the garden',
  'A woman watches TV',
  'The new movie is so great',
  'Do you like pizza?',
]

embeddings = model.encode(sentences, convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
  for j in range(i+1, len(cosine_scores)):
    pairs.append({'index': [i, j], 'score': cosine_scores[i][j].cpu().numpy()})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

sim = {'Sentence 1': [], 'Sentence 2': [], 'Score': []}
for pair in pairs[0:10]:
  i, j = pair['index']
  sim['Sentence 1'].append(sentences[i])
  sim['Sentence 2'].append(sentences[j])
  sim['Score'].append(pair['score'])

pd.DataFrame(sim)

Unnamed: 0,Sentence 1,Sentence 2,Score
0,The new movie is awesome,The new movie is so great,0.91006535
1,The cat sits outside,The cat plays in the garden,0.6738438
2,I love pasta,Do you like pizza?,0.5188928
3,I love pasta,The new movie is so great,0.2034733
4,The new movie is awesome,Do you like pizza?,0.19310407
5,The new movie is so great,Do you like pizza?,0.178824
6,I love pasta,The new movie is awesome,0.17515305
7,I love pasta,The cat plays in the garden,0.11078526
8,The cat sits outside,The new movie is so great,0.10594223
9,The cat plays in the garden,Do you like pizza?,0.09111772


## Semantic Search

Semantic search improves search accuracy by understanding the content of the search query. In addition to finding documents based on lexical matches, semantic search can also synonyms.

In [5]:
corpus = [
  'A man is eating food.',
  'A man is eating a piece of bread.',
  'The girl is carrying a baby.',
  'A man is riding a horse.',
  'A woman is playing violin.',
  'Two men pushed carts through the woods.',
  'A man is riding a white horse on an enclosed ground.',
  'A monkey is playing drums.',
  'A cheetah is running behind its prey.'
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = [
  'A man is eating pasta.',
  'Someone in a gorilla costume is playing a set of drums.',
  'A cheetah chases prey on across a field.'
]


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
  query_embedding = model.encode(query, convert_to_tensor=True)

  # We use cosine-similarity and torch.topk to find the highest 5 scores
  cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
  top_results = torch.topk(cos_scores, k=top_k)

  print("\n\n======================\n\n")
  print("Query:", query)
#   print("\nTop 5 most similar sentences in corpus:")
  
  sim = {'Similar Sentence': [], 'Similarity Score': []}  
  for score, idx in zip(top_results[0], top_results[1]):
    sim['Similar Sentence'].append(corpus[idx])
    sim['Similarity Score'].append(score.cpu().numpy())
  
  display(pd.DataFrame(sim))

#         print(corpus[idx], "(Score: {:.4f})".format(score))





Query: A man is eating pasta.


Unnamed: 0,Similar Sentence,Similarity Score
0,A man is eating food.,0.5898959
1,A man is eating a piece of bread.,0.42187655
2,A man is riding a horse.,0.19115905
3,A man is riding a white horse on an enclosed g...,0.099979654
4,A cheetah is running behind its prey.,0.008013582






Query: Someone in a gorilla costume is playing a set of drums.


Unnamed: 0,Similar Sentence,Similarity Score
0,A monkey is playing drums.,0.643533
1,A cheetah is running behind its prey.,0.17746502
2,A woman is playing violin.,0.11836575
3,A man is riding a white horse on an enclosed g...,0.08056441
4,A man is eating food.,0.0723409






Query: A cheetah chases prey on across a field.


Unnamed: 0,Similar Sentence,Similarity Score
0,A cheetah is running behind its prey.,0.800229
1,A man is riding a white horse on an enclosed g...,0.16320977
2,A monkey is playing drums.,0.15903687
3,A man is riding a horse.,0.11885679
4,A woman is playing violin.,0.0761629


## Paraphrase Mining

Paraphrase mining is the task of finding texts with identical or similar meaning in a large corpus of sentences.

In [6]:
# Single list of sentences - Possible tens of thousands of sentences
sentences = [
  'The cat sits outside',
  'A man is playing guitar',
  'I love pasta',
  'The new movie is awesome',
  'The cat plays in the garden',
  'A woman watches TV',
  'The new movie is so great',
  'Do you like pizza?'
]

paraphrases = util.paraphrase_mining(model, sentences)
sim = {'Source Sentence': [], 'Paraphrase': [], 'Similarity Score': []}
for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    sim['Source Sentence'].append(sentences[i])
    sim['Paraphrase'].append(sentences[j])
    sim['Similarity Score'].append(score)
    
pd.DataFrame(sim)    

Unnamed: 0,Source Sentence,Paraphrase,Similarity Score
0,The new movie is awesome,The new movie is so great,0.910065
1,The cat sits outside,The cat plays in the garden,0.673844
2,I love pasta,Do you like pizza?,0.518893
3,I love pasta,The new movie is so great,0.203473
4,The new movie is awesome,Do you like pizza?,0.193104
5,The new movie is so great,Do you like pizza?,0.178824
6,I love pasta,The new movie is awesome,0.175153
7,I love pasta,The cat plays in the garden,0.110785
8,The cat sits outside,The new movie is so great,0.105942
9,The cat plays in the garden,Do you like pizza?,0.091118


## Retrieval & Re-ranking

Given a search query, a **retrieval system** retrieves a large list of potentially relevant hits for the query using a **bi-encoder**. Then a **re-ranker** based on a **cross-encoder** scores the relevancy of all candidates for the given search query.

In [7]:
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
wikipedia_filepath = '../project_dir/simplewiki-2020-11-01.jsonl.gz'
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
  for line in fIn:
    data = json.loads(line.strip())
    #passages.extend(data['paragraphs'])
    passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


In [8]:
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

tokenized_corpus = []
for passage in tqdm_notebook(passages):
  tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

queries = [
  'What is the best orchestra in the world?',
  'Number countries Europe',
  'When did the cold war end?',
  'How long do cats live?',
  'How many people live in Toronto?',
  'Oldest US president',
  'Coldest place earth',
  'Elon Musk year birth',
  'Paris eiffel tower',
  'Which US president was killed?',
  'When is Chinese New Year',  
]

Batches:   0%|          | 0/5300 [00:00<?, ?it/s]

  0%|          | 0/169597 [00:00<?, ?it/s]

In [9]:
for query in queries:
  print(search(query))
  print("\n\n======================\n\n")

Input question: What is the best orchestra in the world?
Top-3 lexical search (BM25) hits
	15.328	The BBC Symphony Orchestra is the main orchestra of the British Broadcasting Corporation. It is one of the best orchestras in Britain.
	15.320	The NHK Symphony Orchestra is a Japanese orchestra based in Tokyo, Japan. In Japanese it is written: NHK交響楽団, pronounced: Enueichikei Kōkyō Gakudan. When the orchestra was started in 1926 it was called "New Symphony Orchestra". It was the first large professional orchestra in Japan. Later, it changed its name to "Japan Symphony Orchestra". In 1951 it started to get money from the Japanese radio station NHK (Nippon Hōsō Kyōkai), so it changed its name again to the name it has now. It is thought of as the best orchestra in Japan. They have played in many parts of the world, including at the BBC Proms in London.
	14.079	The Bamberger Symphoniker (Bamberg Symphony Orchestra) is a world-famous orchestra from the city of Bamberg, Germany. It was formed in

Top-3 lexical search (BM25) hits
	22.997	Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.
	16.974	The sabertoothed cats or sabretooth cats are some of the best known and most popular extinct animals. They are among the most impressive carnivores that ever have lived. These cats had long canines and jaws which opened wider than modern cats. This suggests a different style of killing from modern felines.
	16.490	The Cyprus cat is a breed of cat. These cats are thought to have first come from ancient Egypt or Palestine. They were brought to the island of Cyprus by St. Helen. These are now common domestic cats that live in homes or outside. Many of these cats still live all over Cyprus. But, a large number are now feral. 

Top-3 lexical search (BM25) hits
	24.891	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side of the Transantarctic Mountains. It is the coldest, windiest, and driest part of Earth. East Antarctica holds the record as the coldest place on earth.
	12.650	Earth Day is a day that is supposed to inspire more awareness and appreciation for the Earth's natural environment. It takes place each year on April 22. It now takes place in more than 193 countries around the world. During Earth Day, the world encourages everyone to turn off all unwanted lights.
	12.172	Heinrich events occurred during the coldest point of "Bond Cycles" in which many icebergs were discharged into the North Atlantic and melted.

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.633	East Antarctica, also called Greater Antarctica, is the largest part (two-thirds) of the Antarctic continent. It is on the Indian Ocean side o

Top-3 lexical search (BM25) hits
	10.179	Lyndon Baines Johnson (August 27, 1908 – January 22, 1973) was a member of the Democratic Party and the 36th president of the United States serving from 1963 to 1969. Johnson took over as president when President Kennedy was killed in November 1963. He was then re-elected in the 1964 election.
	10.091	Lech Kaczyński, the fourth President of the Republic of Poland, died on 10 April 2010. He died in a plane crash outside of Smolensk, Russia. The plane was a Tu-154 belonging to the Polish Air Force. The crash killed all 96 on board. His wife, Maria Kaczyńska, was also among those killed.
	9.791	Jacobo Majluta Azar (October 9, 1934 – March 2, 1996) was a Dominican politician. He was Vice President of the Dominican Republic during the Antonio Guzmán Fernández presidency between 1978 to 1982. He became President of the Dominican Republic after Guzmán Fernández killed himself in 1982. He was president for a month between July to August 1982.

---------

## Computing Sentence Embeddings

Sentence embeddings compute an embedding vector for a piece of text. While thinking sentences is natural here, this method can also be used on shorter phrases and longer text containing multiple sentences.

In [10]:
sentences = [
  'This framework generates embeddings for each input sentence',
  'Sentences are passed as a list of string.',
  'The quick brown fox jumps over the lazy dog.'
]

embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, embeddings):
  print("Sentence:", sentence)
  print("Embedding:", embedding)
  print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [ 6.41697226e-03  7.04137795e-03 -2.81442143e-02  5.12470752e-02
 -8.93958658e-03  2.12669224e-02  2.30778940e-02 -1.44860102e-02
 -5.55314077e-03 -2.49297004e-02  4.53493707e-02  2.48958636e-02
 -3.07578910e-02  5.66224307e-02  6.32021949e-02 -5.62528186e-02
  5.16509861e-02  5.78277186e-03 -2.62116492e-02  1.31875766e-03
  1.99272186e-02 -1.30595511e-03 -2.28706747e-03  4.72541675e-02
 -3.72494720e-02 -2.85245366e-02 -4.10241000e-02 -1.57975797e-02
  3.17328353e-03 -8.74162768e-04 -2.96460111e-02  3.21501270e-02
  3.51344086e-02  1.09738000e-02  9.16706938e-07 -1.18584954e-03
 -2.53640544e-02 -7.92879052e-03 -5.09485463e-03  7.40652019e-03
  2.80068498e-02  1.06995292e-02  1.07513396e-02  2.76827645e-02
 -5.19132428e-02 -4.98179607e-02  5.34074865e-02  5.79066686e-02
  7.86073431e-02  7.73014277e-02 -1.01112109e-02 -6.35445938e-02
 -1.71579663e-02 -6.77374518e-03 -2.45815911e-03  2.61346120e-02
 -5.38508

## Cross-lingual Retrieval 

We show a use case of sentence transformer model in cross-lingual retrieval: cross-lingual semantic-similarity search engine

This notebook has been adapted from [this paper](https://aclanthology.org/2020.acl-demos.12/).

### Load Transformer Model

We load `xlm-mlm-100-1280` transformer which is language model trained for masked token prediction task on a dataset with 100 languages.

In [11]:
model = SentenceTransformer('xlm-mlm-100-1280')

def embed_text(input):
  return model.encode(input)

No sentence-transformers model found with name /home/75y/.cache/torch/sentence_transformers/xlm-mlm-100-1280. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/75y/.cache/torch/sentence_transformers/xlm-mlm-100-1280 were not used when initializing XLMModel: ['pred_layer.proj.bias', 'pred_layer.proj.weight']
- This IS expected if you are initializing XLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Calculate Multilingual Semantic Similarity

With sentence embeddings in hand, we can take their dot-product to measure their semantically similarity.

In [12]:
# an example of calculating semantic similarity between two multilingual sentences
sentence1 = "natural language understanding" #en
sentence2 = "comprensión del lenguaje natural" #es
# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())


# Retrieve top K most similar sentences from a multilingual corpus given a query sentence
corpus = ["I like Python because I can build AI applications",
          "Me gusta Python porque puedo hacer análisis de datos",
          "The cat sits on the ground",
         "El gato camina por la acera."]
# encode corpus to get corpus embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
sentence = "I like Javascript because I can build web applications"
# encode sentence to get sentence embeddings
sentence_embedding = model.encode(sentence, convert_to_tensor=True)
# top_k results to return
top_k=2
# compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]
# Sort the results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores.cpu(), range(top_k))[0:top_k]
print("\nSentence:", sentence)
print("Top", top_k, "most similar sentences in corpus:")
for idx in top_results[0:top_k]:
    print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))


Sentence 1: natural language understanding
Sentence 2: comprensión del lenguaje natural
Similarity score: 0.7081338167190552

Sentence: I like Javascript because I can build web applications
Top 2 most similar sentences in corpus:
I like Python because I can build AI applications (Score: 0.8240)
Me gusta Python porque puedo hacer análisis de datos (Score: 0.6684)


### Visualize Multilingual Semantic Similarity

With sentence embeddings in hand, we can take their dot-product to visualize how similar sentences are between languages. A darker color indicates the embeddings are semantically similar.

In [13]:
def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,
                         plot_title,
                         plot_width=1200, plot_height=600,
                         xaxis_font_size='12pt', yaxis_font_size='12pt'):

  assert len(embeddings_1) == len(labels_1)
  assert len(embeddings_2) == len(labels_2)

  # cosine based text similarity
  sim = util.pytorch_cos_sim(embeddings_1, embeddings_2)

  embeddings_1_col, embeddings_2_col, sim_col = [], [], []
  for i in range(len(embeddings_1)):
    for j in range(len(embeddings_2)):
      embeddings_1_col.append(labels_1[i])
      embeddings_2_col.append(labels_2[j])
      sim_col.append(np.float(sim[i][j]))
  df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                    columns=['embeddings_1', 'embeddings_2', 'sim'])

  mapper = bokeh.models.LinearColorMapper(
      palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),
      high=df.sim.max())

  p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,
                            x_axis_location="above",
                            y_range=[*reversed(labels_2)],
                            plot_width=plot_width, plot_height=plot_height,
                            tools="save",toolbar_location='below', tooltips=[
                                ('pair', '@embeddings_1 ||| @embeddings_2'),
                                ('sim', '@sim')])
  p.rect(x="embeddings_1", y="embeddings_2", width=1, height=1, source=df,
         fill_color={'field': 'sim', 'transform': mapper}, line_color=None)

  p.title.text_font_size = '12pt'
  p.axis.axis_line_color = None
  p.axis.major_tick_line_color = None
  p.axis.major_label_standoff = 16
  p.xaxis.major_label_text_font_size = xaxis_font_size
  p.xaxis.major_label_orientation = 0.25 * np.pi
  p.yaxis.major_label_text_font_size = yaxis_font_size
  p.min_border_right = 300

  bokeh.io.output_notebook()
  bokeh.io.show(p)


In [14]:
# Some texts of different lengths in different languages.
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']

# Compute embeddings.
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)

# visualize semantic similarity
visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')

### Build a simple cross-lingual document retrieval engine

#### Download Data to Index
Download news sentences in multiples languages (English and Spanish) from the [News Commentary Corpus](http://opus.nlpl.eu/News-Commentary-v11.php) [[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2874&rep=rep1&type=pdf).

To speed up the demo, we limit to 1000 sentences per language.


In [15]:
def download_files(corpus_metadata):
  language_to_sentences = {}
  language_to_news_path = {}
  for language_code, zip_file, news_file, language_name in corpus_metadata:
    zip_path = tf.keras.utils.get_file(
        fname=zip_file,
        origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,
        extract=True)
    news_path = os.path.join(os.path.dirname(zip_path), news_file)
    language_to_sentences[language_code] = pd.read_csv(news_path, sep='\t', header=None)[0][:1000]
    language_to_news_path[language_code] = news_path

    print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))
  return language_to_sentences, language_to_news_path


In [16]:
corpus_metadata = [
    ('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),
    ('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),
]
language_to_sentences, language_to_news_path = download_files(corpus_metadata)

1,000 English sentences
1,000 Spanish sentences


#### Get document embeddings 

In [17]:
sample_size = 1000
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nComputing {} embeddings'.format(language_name))
  with tqdm(total=len(language_to_sentences[language_code])) as pbar:
    df = pd.read_csv(language_to_news_path[language_code],
                             sep='\t',header=None).head(sample_size)
    for i in range(len(df)):
      language_to_embeddings.setdefault(language_code, []).append(embed_text(df[0][i]))



Computing English embeddings


  0%|                                                  | 0/1000 [00:28<?, ?it/s]



Computing Spanish embeddings


  0%|                                                  | 0/1000 [00:28<?, ?it/s]


#### Building an index of semantic vectors

Use the [SimpleNeighbors](https://pypi.org/project/simpleneighbors/) library---which is a wrapper for the [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus.

In [18]:
%%time

num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nAdding {} embeddings to index'.format(language_name))
  index = SimpleNeighbors(embedding_dimensions, metric='cosine')

  for i in trange(len(language_to_sentences[language_code])):
    index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])

  print('Building {} index with {} trees...'.format(language_name, num_index_trees))
  index.build(n=num_index_trees)
  language_name_to_index[language_name] = index


Adding English embeddings to index


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 4599.64it/s]


Building English index with 40 trees...

Adding Spanish embeddings to index


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 4173.77it/s]


Building Spanish index with 40 trees...
CPU times: user 611 ms, sys: 83.4 ms, total: 694 ms
Wall time: 745 ms


In [19]:
%%time

num_index_trees = 60
print('Computing mixed-language index')
combined_index = SimpleNeighbors(embedding_dimensions, metric='cosine')
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('Adding {} embeddings to mixed-language index'.format(language_name))
  for i in trange(len(language_to_sentences[language_code])):
    annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])
    combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])

print('Building mixed-language index with {} trees...'.format(num_index_trees))
combined_index.build(n=num_index_trees)

Computing mixed-language index
Adding English embeddings to mixed-language index


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 4204.85it/s]


Adding Spanish embeddings to mixed-language index


100%|█████████████████████████████████████| 1000/1000 [00:00<00:00, 4351.75it/s]


Building mixed-language index with 60 trees...
CPU times: user 611 ms, sys: 74.8 ms, total: 685 ms
Wall time: 678 ms


#### Validate the document retrieval pipeline

1.   Retrieve sentences from the corpus that are semantically similar to the given query.
2.   Cross-lingual: Issue queries in a distinct language than the indexed corpus
3.   Mixed-corpus: Issue queries in one language and retrieve similar documents from multiple languages

##### Cross-lingual

Issue queries in a distinct language than the indexed corpus.

In [20]:
sample_query = 'The stock market fell four points.'
index_language = 'Spanish'  #@param ["English", "Spanish"]
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(sample_query)
search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

Spanish sentences similar to: "The stock market fell four points."



['Los precios del oro incluso alcanzaron recientemente un récord de 1.300 dólares.',
 'Pero el acuerdo alcanzado tiene tres grandes defectos.',
 'Las grandes empresas deben aceptar la responsabilidad por sus acciones.',
 'Si se extendiera la política a empresas de terceros países, esto tendría un fuerte impacto liberalizador.',
 'Después me distraje durante aproximadamente 40 años.',
 'Ese aumento de los activos permitirá, a su vez, a los mercados crediticios locales, como, por ejemplo, el de la microfinanciación comenzar a funcionar.',
 'Desde entonces, el índice ha trepado por encima de 10.000.',
 'También existe el un creciente peligro del terrorismo interno.',
 'Los recursos de inteligencia se han redireccionado.',
 'Desde que aparecieron sus artículos, el precio del oro aumentó aún más.']

In [21]:
sample_query = 'Desde entonces, el índice ha trepado por encima de 10.000.'
index_language = 'English'  #@param ["English", "Spanish"]
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(sample_query)
search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

English sentences similar to: "Desde entonces, el índice ha trepado por encima de 10.000."



['Since then, the index has climbed above 10,000.',
 'His ratings have dipped below 50% for the first time.',
 'Since their articles appeared, the price of gold has moved up still further.',
 'But where will increased competitiveness come from?',
 'Gold prices even hit a record-high $1,300 recently.',
 'At $1,300, today’s price is probably more than double very long-term, inflation-adjusted, average gold prices.',
 'After adjusting for inflation, today’s price is nowhere near the all-time high of January 1980.',
 'International cooperation has increased markedly, in part because governments that cannot agree on many things can agree on the need to cooperate in this area.',
 'Last December, many gold bugs were arguing that the price was inevitably headed for $2,000.',
 'Of course, there are obvious differences between 1989 and now.']

##### Mixed-corpus capabilities

Issue a query in English and the results will come from the any of the indexed languages.

In [22]:
query = 'The stock market fell four points.'  #@param {type:"string"}
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(query)
search_results = combined_index.nearest(query_embedding, n=num_results)

print('Multilingual sentences similar to: "{}"\n'.format(query))
search_results

Multilingual sentences similar to: "The stock market fell four points."



['(English) Since their articles appeared, the price of gold has moved up still further.',
 '(English) Intelligence assets have been redirected.',
 '(English) But, globally, our innovation system needs much bigger changes.',
 '(English) But neither improved competitiveness, nor reduction of total debt, can be achieved overnight.',
 '(English) As legions of new consumers gain purchasing power, demand inevitably rises, driving up the price of scarce commodities.',
 '(English) Other IMF accounting practices, including how the capital expenditures of government-owned enterprises are treated, are also causing outrage. If a state owned enterprise',
 '(Spanish) Los precios del oro incluso alcanzaron recientemente un récord de 1.300 dólares.',
 '(English) His ratings have dipped below 50% for the first time.',
 '(English) With serious management of the new funds, food production in Africa will soar.',
 '(Spanish) Pero el acuerdo alcanzado tiene tres grandes defectos.']