# Cross-Lingual Semantic-Similarity Search Engine Use Case

This notebook has been adapted from [this paper](https://aclanthology.org/2020.acl-demos.12/).

## import packages

In [1]:
#@title Setup common imports and functions
import bokeh
import bokeh.models
import bokeh.plotting
import numpy as np
import os
import pandas as pd
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import sklearn.metrics.pairwise
from sentence_transformers import SentenceTransformer, util
import sklearn

from simpleneighbors import SimpleNeighbors
from tqdm import tqdm
from tqdm import trange

2022-07-06 20:33:10.917902: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /net/kdinxidk03/opt/NFS/su0/anaconda3/lib/
2022-07-06 20:33:10.917935: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Application of a pre-trained transformer model in document retrieval

### Load Transformer Model

We load `xlm-mlm-100-1280` transformer which is language model trained for masked token prediction task on a dataset with 100 languages

In [2]:
model = SentenceTransformer('xlm-mlm-100-1280')

def embed_text(input):
  return model.encode(input)

No sentence-transformers model found with name /net/kdinxidk03/opt/NFS/sentence_transformers_cache/xlm-mlm-100-1280. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /net/kdinxidk03/opt/NFS/sentence_transformers_cache/xlm-mlm-100-1280 were not used when initializing XLMModel: ['pred_layer.proj.bias', 'pred_layer.proj.weight']
- This IS expected if you are initializing XLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Example

In [3]:
# an example of calculating semantic similarity between two sentences
sentence1 = "natural language understanding"
sentence2 = "comprensión del lenguaje natural"
# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())


# Retrieve top K most similar sentences from a corpus given a query sentence
corpus = ["I like Python because I can build AI applications",
          "Me gusta Python porque puedo hacer análisis de datos",
          "The cat sits on the ground",
         "El gato camina por la acera."]
# encode corpus to get corpus embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
sentence = "I like Javascript because I can build web applications"
# encode sentence to get sentence embeddings
sentence_embedding = model.encode(sentence, convert_to_tensor=True)
# top_k results to return
top_k=2
# compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]
# Sort the results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores.cpu(), range(top_k))[0:top_k]
print("\nSentence:", sentence)
print("Top", top_k, "most similar sentences in corpus:")
for idx in top_results[0:top_k]:
    print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))


Sentence 1: natural language understanding
Sentence 2: comprensión del lenguaje natural
Similarity score: 0.7081338167190552

Sentence: I like Javascript because I can build web applications
Top 2 most similar sentences in corpus:
I like Python because I can build AI applications (Score: 0.8240)
Me gusta Python porque puedo hacer análisis de datos (Score: 0.6684)


### Visualize Multilingual Semantic Similarity
With text embeddings in hand, we can take their dot-product to visualize how similar sentences are between languages. A darker color indicates the embeddings are semantically similar.

In [4]:
def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,
                         plot_title,
                         plot_width=1200, plot_height=600,
                         xaxis_font_size='12pt', yaxis_font_size='12pt'):

  assert len(embeddings_1) == len(labels_1)
  assert len(embeddings_2) == len(labels_2)

  # cosine based text similarity
  sim = util.pytorch_cos_sim(embeddings_1, embeddings_2)

  embeddings_1_col, embeddings_2_col, sim_col = [], [], []
  for i in range(len(embeddings_1)):
    for j in range(len(embeddings_2)):
      embeddings_1_col.append(labels_1[i])
      embeddings_2_col.append(labels_2[j])
      sim_col.append(np.float(sim[i][j]))
  df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),
                    columns=['embeddings_1', 'embeddings_2', 'sim'])

  mapper = bokeh.models.LinearColorMapper(
      palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),
      high=df.sim.max())

  p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,
                            x_axis_location="above",
                            y_range=[*reversed(labels_2)],
                            plot_width=plot_width, plot_height=plot_height,
                            tools="save",toolbar_location='below', tooltips=[
                                ('pair', '@embeddings_1 ||| @embeddings_2'),
                                ('sim', '@sim')])
  p.rect(x="embeddings_1", y="embeddings_2", width=1, height=1, source=df,
         fill_color={'field': 'sim', 'transform': mapper}, line_color=None)

  p.title.text_font_size = '12pt'
  p.axis.axis_line_color = None
  p.axis.major_tick_line_color = None
  p.axis.major_label_standoff = 16
  p.xaxis.major_label_text_font_size = xaxis_font_size
  p.xaxis.major_label_orientation = 0.25 * np.pi
  p.yaxis.major_label_text_font_size = yaxis_font_size
  p.min_border_right = 300

  bokeh.io.output_notebook()
  bokeh.io.show(p)


In [5]:
# Some texts of different lengths in different languages.
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']

# Compute embeddings.
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)

# visualize semantic similarity
visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


## Build a simple cross-lingual document retrieval engine

### Download Data to Index
Download news sentences in multiples languages (English and Spanish) from the [News Commentary Corpus](http://opus.nlpl.eu/News-Commentary-v11.php) [1].

To speed up the demo, we limit to 1000 sentences per language.

In [6]:
def download_files(corpus_metadata):
  language_to_sentences = {}
  language_to_news_path = {}
  for language_code, zip_file, news_file, language_name in corpus_metadata:
    zip_path = tf.keras.utils.get_file(
        fname=zip_file,
        origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,
        extract=True)
    news_path = os.path.join(os.path.dirname(zip_path), news_file)
    language_to_sentences[language_code] = pd.read_csv(news_path, sep='\t', header=None)[0][:1000]
    language_to_news_path[language_code] = news_path

    print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))
  return language_to_sentences, language_to_news_path


In [7]:
corpus_metadata = [
    ('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),
    ('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),
]
language_to_sentences, language_to_news_path = download_files(corpus_metadata)

1,000 English sentences
1,000 Spanish sentences


### Get document embeddings 

In [8]:
sample_size = 1000
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nComputing {} embeddings'.format(language_name))
  with tqdm(total=len(language_to_sentences[language_code])) as pbar:
    df = pd.read_csv(language_to_news_path[language_code],
                             sep='\t',header=None).head(sample_size)
    for i in range(len(df)):
      language_to_embeddings.setdefault(language_code, []).append(embed_text(df[0][i]))



Computing English embeddings


  0%|                                                                                                                           | 0/1000 [00:13<?, ?it/s]



Computing Spanish embeddings


  0%|                                                                                                                           | 0/1000 [00:13<?, ?it/s]


### Building an index of semantic vectors

Use the [SimpleNeighbors](https://pypi.org/project/simpleneighbors/) library---which is a wrapper for the [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus.

In [9]:
%%time

num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('\nAdding {} embeddings to index'.format(language_name))
  index = SimpleNeighbors(embedding_dimensions, metric='cosine')

  for i in trange(len(language_to_sentences[language_code])):
    index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])

  print('Building {} index with {} trees...'.format(language_name, num_index_trees))
  index.build(n=num_index_trees)
  language_name_to_index[language_name] = index


Adding English embeddings to index


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 5286.84it/s]


Building English index with 40 trees...

Adding Spanish embeddings to index


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 5764.69it/s]


Building Spanish index with 40 trees...
CPU times: user 482 ms, sys: 63.9 ms, total: 546 ms
Wall time: 552 ms


In [10]:
%%time

num_index_trees = 60
print('Computing mixed-language index')
combined_index = SimpleNeighbors(embedding_dimensions, metric='cosine')
for language_code, zip_file, news_file, language_name in corpus_metadata:
  print('Adding {} embeddings to mixed-language index'.format(language_name))
  for i in trange(len(language_to_sentences[language_code])):
    annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])
    combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])

print('Building mixed-language index with {} trees...'.format(num_index_trees))
combined_index.build(n=num_index_trees)

Computing mixed-language index
Adding English embeddings to mixed-language index


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 5735.82it/s]


Adding Spanish embeddings to mixed-language index


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 6093.68it/s]


Building mixed-language index with 60 trees...
CPU times: user 449 ms, sys: 47.6 ms, total: 496 ms
Wall time: 487 ms


### Validate the document retrieval pipeline

1.   Retrieve sentences from the corpus that are semantically similar to the given query.
2.   Cross-lingual: Issue queries in a distinct language than the indexed corpus
3.   Mixed-corpus: Issue queries in one language and retrieve similar documents from multiple languages

#### Cross-lingual

Issue queries in a distinct language than the indexed corpus.

In [11]:
sample_query = 'The stock market fell four points.'
index_language = 'Spanish'  #@param ["English", "Spanish"]
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(sample_query)
search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

Spanish sentences similar to: "The stock market fell four points."



['Los precios del oro incluso alcanzaron recientemente un récord de 1.300 dólares.',
 'Pero el acuerdo alcanzado tiene tres grandes defectos.',
 'Las grandes empresas deben aceptar la responsabilidad por sus acciones.',
 'Si se extendiera la política a empresas de terceros países, esto tendría un fuerte impacto liberalizador.',
 'Después me distraje durante aproximadamente 40 años.',
 'Ese aumento de los activos permitirá, a su vez, a los mercados crediticios locales, como, por ejemplo, el de la microfinanciación comenzar a funcionar.',
 'Desde entonces, el índice ha trepado por encima de 10.000.',
 'También existe el un creciente peligro del terrorismo interno.',
 'Los recursos de inteligencia se han redireccionado.',
 'Desde que aparecieron sus artículos, el precio del oro aumentó aún más.']

In [12]:
sample_query = 'Desde entonces, el índice ha trepado por encima de 10.000.'
index_language = 'English'  #@param ["English", "Spanish"]
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(sample_query)
search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)

print('{} sentences similar to: "{}"\n'.format(index_language, sample_query))
search_results

English sentences similar to: "Desde entonces, el índice ha trepado por encima de 10.000."



['Since then, the index has climbed above 10,000.',
 'His ratings have dipped below 50% for the first time.',
 'Since their articles appeared, the price of gold has moved up still further.',
 'But where will increased competitiveness come from?',
 'Gold prices even hit a record-high $1,300 recently.',
 'At $1,300, today’s price is probably more than double very long-term, inflation-adjusted, average gold prices.',
 'After adjusting for inflation, today’s price is nowhere near the all-time high of January 1980.',
 'International cooperation has increased markedly, in part because governments that cannot agree on many things can agree on the need to cooperate in this area.',
 'Last December, many gold bugs were arguing that the price was inevitably headed for $2,000.',
 'Of course, there are obvious differences between 1989 and now.']

#### Mixed-corpus capabilities

Issue a query in English and the results will come from the any of the indexed languages.

In [13]:
query = 'The stock market fell four points.'  #@param {type:"string"}
num_results = 10  #@param {type:"slider", min:0, max:100, step:10}

query_embedding = embed_text(query)
search_results = combined_index.nearest(query_embedding, n=num_results)

print('Multilingual sentences similar to: "{}"\n'.format(query))
search_results

Multilingual sentences similar to: "The stock market fell four points."



['(English) Since their articles appeared, the price of gold has moved up still further.',
 '(English) Intelligence assets have been redirected.',
 '(English) But, globally, our innovation system needs much bigger changes.',
 '(English) But neither improved competitiveness, nor reduction of total debt, can be achieved overnight.',
 '(English) As legions of new consumers gain purchasing power, demand inevitably rises, driving up the price of scarce commodities.',
 '(English) Other IMF accounting practices, including how the capital expenditures of government-owned enterprises are treated, are also causing outrage. If a state owned enterprise',
 '(Spanish) Los precios del oro incluso alcanzaron recientemente un récord de 1.300 dólares.',
 '(English) His ratings have dipped below 50% for the first time.',
 '(English) With serious management of the new funds, food production in Africa will soar.',
 '(Spanish) Pero el acuerdo alcanzado tiene tres grandes defectos.']

[1] J. Tiedemann, 2012, [Parallel Data, Tools and Interfaces in OPUS](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)