Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-852: Add crosslingual MUSE embeddings #853

Merged
merged 1 commit into from
Jul 1, 2019

Conversation

alanakbik
Copy link
Collaborator

Closes #852

Use the new MuseCrosslingualEmbeddings() class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.

Here's how to embed:

# initialize embeddings
embeddings = MuseCrosslingualEmbeddings()

# two sentences in different languages
sentence_1 = Sentence("This red shoe is new .")
sentence_2 = Sentence("Dieser rote Schuh ist rot .")

# language code is auto-detected
print(sentence_1.get_language_code())
print(sentence_2.get_language_code())

# embed sentences
embeddings.embed([sentence_1, sentence_2])

# print similarities
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for token_1, token_2 in zip (sentence_1, sentence_2):
    print(f"'{token_1.text}' and '{token_2.text}' similarity: {cos(token_1.embedding, token_2.embedding)}")

@yosipk
Copy link
Collaborator

yosipk commented Jul 1, 2019

👍

1 similar comment
@alanakbik
Copy link
Collaborator Author

👍

@alanakbik alanakbik merged commit 34f2490 into master Jul 1, 2019
@alanakbik alanakbik deleted the GH-852-crosslingual-muse branch July 16, 2019 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crosslingual MUSE embeddings
2 participants