# Multilingual Unviersal Sentence Encoder (MUSE)

## What is it?

* Multilingual document embedding model that converts text into a 512 dimension vector representation

## Salient Features

* Works with as many as 16 languages. Model is language-agnostic, meaning language of the text does not have to be specified to the model

* Optimized for multi-word text like sentences and paragraphs. 

## How to use it?

* The pre-trained model is available for use as a module on tensorflow-hub. 
* See the below code example for a simple demo

## Limitations

* The model size is pretty large. Inference can be time-consuming when running on a large dataset. Recommend using the input pipeline from tensorflow (tf.data) and leveraging batching when running the model on large datasets


## Resources
* https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
* https://arxiv.org/abs/1907.04307
* https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html


## Demo with English Sentences

In [32]:
## Example Data
sentences = ["I have a dog","I would love to have a pet","I live in a big city"]

In [33]:
# Load the model
import tensorflow_hub as hub
import numpy as np
import tensorflow_text

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"
embed = hub.load(module_url)

In [34]:
# Compute embeddings for our sample data
embeddings = embed(sentences)

In [35]:
embeddings

<tf.Tensor: shape=(3, 512), dtype=float32, numpy=
array([[ 0.06540088, -0.03345164,  0.03330484, ..., -0.02112507,
         0.04501928, -0.02898191],
       [ 0.02742887, -0.01307626, -0.01204817, ..., -0.01096772,
         0.01694283,  0.09336831],
       [ 0.05165894, -0.05038729, -0.0263377 , ..., -0.07095326,
         0.01125661, -0.06504603]], dtype=float32)>

### Use embeddings to compute sentence similarity scores
Embeddings returned by MUSE are approximately normalized, hence the cosine similarity of two sentences can be approximated as the dot product of their embeddings

In [36]:
print(f"Similarity score between \"{sentences[0]}\" and \"{sentences[1]}\" is {np.dot(embeddings[0],embeddings[1])}")
print(f"Similarity score between \"{sentences[0]}\" and \"{sentences[2]}\" is {np.dot(embeddings[0],embeddings[2])}")
print(f"Similarity score between \"{sentences[1]}\" and \"{sentences[2]}\" is {np.dot(embeddings[1],embeddings[2])}")


Similarity score between "I have a dog" and "I would love to have a pet" is 0.6374889016151428
Similarity score between "I have a dog" and "I live in a big city" is 0.34745800495147705
Similarity score between "I would love to have a pet" and "I live in a big city" is 0.26571404933929443


## Demo with multilingual data

Now let's see what happens when our input data is in multiple languages

In [37]:
sentences_2 = ["tengo un perro", "J'aimerais avoir un animal de compagnie", "Vivo in una grande città"]

In [38]:
embeddings_2 = embed(sentences_2)

### Examine similarity scores between sentences in different languages

In [39]:
# Similarity scores for direct translations 
print(f"Similarity between \"{sentences[0]}\" and its direct translation into Spanish is {np.dot(embeddings[0],embeddings_2[0])}")
print(f"Similarity between \"{sentences[1]}\" and its direct translation into French is {np.dot(embeddings[1],embeddings_2[1])}")
print(f"Similarity between \"{sentences[2]}\" and its direct translation into Italian is {np.dot(embeddings[2],embeddings_2[2])}")


Similarity between "I have a dog" and its direct translation into Spanish is 0.922921895980835
Similarity between "I would love to have a pet" and its direct translation into French is 0.8489859700202942
Similarity between "I live in a big city" and its direct translation into Italian is 0.9239956140518188


In [40]:
# Similarity scores for different sentences written in different languages
print(f"Similarity score between \"{sentences[0]}\" and \"{sentences[1]}\" is {np.dot(embeddings[0],embeddings[1])}")
print(f"Similarity score between \"{sentences[0]}\" and \"{sentences_2[1]}\" is {np.dot(embeddings[0],embeddings_2[1])}")
print(f"Similarity score between \"{sentences_2[0]}\" and \"{sentences_2[1]}\" is {np.dot(embeddings_2[0],embeddings_2[1])}")


Similarity score between "I have a dog" and "I would love to have a pet" is 0.6374889016151428
Similarity score between "I have a dog" and "J'aimerais avoir un animal de compagnie" is 0.5772875547409058
Similarity score between "tengo un perro" and "J'aimerais avoir un animal de compagnie" is 0.5664279460906982
