In [1]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00

In [2]:
import json
import os
from sentence_transformers import SentenceTransformer, util

In [3]:
# Load the papers dataset
file = 'emnlp2016-2018.json'

if not os.path.exists(file):
  util.http_get('https://sbert.net/datasets/emnlp2016-2018.json',file)

with open(file) as f:
  papers = json.load(f)

print(len(papers), "papers loaded")

  0%|          | 0.00/1.10M [00:00<?, ?B/s]

974 papers loaded


In [11]:
with open('./emnlp2016-2018.json','r') as f:
  d = json.load(f)

In [13]:
import pandas as pd
df = pd.DataFrame(d)
df.head()

Unnamed: 0,title,abstract,url,venue,year
0,Rule Extraction for Tree-to-Tree Transducers b...,Finite-state transducers give efficient repres...,http://aclweb.org/anthology/D16-1002,EMNLP,2016
1,A Neural Network for Coordination Boundary Pre...,We propose a neural-network based model for co...,http://aclweb.org/anthology/D16-1003,EMNLP,2016
2,"Distinguishing Past, On-going, and Future Even...",The tremendous amount of user generated data t...,http://aclweb.org/anthology/D16-1005,EMNLP,2016
3,Nested Propositions in Open Information Extrac...,"We introduce Graphene, an Open IE system whose...",http://aclweb.org/anthology/D16-1006,EMNLP,2016
4,Learning to Recognize Discontiguous Entities,This paper focuses on the study of recognizing...,http://aclweb.org/anthology/D16-1008,EMNLP,2016


In [17]:
# Load the model

model = SentenceTransformer('allenai-specter')

# To encode the papers, we must combine the title and the abstracts 
# to a single string

paper_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]
paper_texts[:3]

['Rule Extraction for Tree-to-Tree Transducers by Cost Minimization[SEP]Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restrictions encountered, without involving the use of a large set of complex rules difficult to analyze. We here show that these representations can be made very compact, indicate how to perform the corresponding minimization, and point out interesting linguistic side-effects of this operation.',
 'A Neural Network for Coordination Boundary Prediction[SEP]We propose a neural-network based model for coordination boundary prediction. The network is designed to incorporate two signals: the similarity between conjuncts and the observation that replacing the whole coordination phrase with a conjunct tends to produce a coherent sentences. The modeling makes use of several LSTM networks. The model is trained solely on conjunction annotations in a Treebank, without using external resources.

In [18]:
# Encode the papers
embeddings = model.encode(paper_texts,convert_to_tensor=True)
print("Shape of embeddings:",embeddings.shape)

Shape of embeddings: torch.Size([974, 768])


In [19]:
# Define a function to search for similar papers given title and abtract

def search_papers(title,abstract):
  query_embedding = model.encode(title+'[SEP]'+abstract,convert_to_tensor=True)

  search_hits = util.semantic_search(query_embedding,embeddings)
  search_hits = search_hits[0]

  print("Paper:",title)
  print("Most similar papers:")
  for hit in search_hits:
    related_paper = papers[hit['corpus_id']]
    print('{:.2f}\t{}\t{} {}'.format(hit['score'],related_paper['title'],
                                     related_paper['venue'],
                                     related_paper['year']))

In [20]:
# This paper was the EMNLP 2019 Best Paper
search_papers(title='Specializing Word Embeddings (for Parsing) by Information Bottleneck', 
              abstract='Pre-trained word embeddings like ELMo and BERT contain\
              rich syntactic and semantic information, resulting in\
              state-of-the-art performance on various tasks. We propose a very\
              fast variational information bottleneck (VIB) method to\
              nonlinearly compress these embeddings, keeping only the\
              information that helps a discriminative parser. We compress each\
              word embedding to either a discrete tag or a continuous vector.\
              In the discrete version, our automatically compressed tags form\
              an alternative tag set: we show experimentally that our tags\
              capture most of the information in traditional POS tag\
              annotations, but our tag sequences can be parsed more accurately\
              at the same level of tag granularity. In the continuous version,\
              we show experimentally that moderately compressing the word\
              embeddings by our method yields a more accurate parser in 8 of 9\
              languages, unlike simple dimensionality reduction.')


Paper: Specializing Word Embeddings (for Parsing) by Information Bottleneck
Most similar papers:
0.88	An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing	EMNLP 2018
0.87	NORMA: Neighborhood Sensitive Maps for Multilingual Word Embeddings	EMNLP 2018
0.87	Generalizing Word Embeddings using Bag of Subwords	EMNLP 2018
0.87	Word Embeddings for Code-Mixed Language Processing	EMNLP 2018
0.87	LAMB: A Good Shepherd of Morphologically Rich Languages	EMNLP 2016
0.87	Word Mover's Embedding: From Word2Vec to Document Embedding	EMNLP 2018
0.87	Charagram: Embedding Words and Sentences via Character n-grams	EMNLP 2016
0.87	Segmentation-Free Word Embedding for Unsegmented Languages	EMNLP 2017
0.86	Addressing Troublesome Words in Neural Machine Translation	EMNLP 2018
0.86	Conditional Word Embedding and Hypothesis Testing via Bayes-by-Backprop	EMNLP 2018


In [21]:
# EMNLP 2020 paper on making Sentence-BERT multilingual
search_papers(title='Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation',
              abstract='We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training is lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.')


Paper: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Most similar papers:
0.90	Sentence Compression for Arbitrary Languages via Multilingual Pivoting	EMNLP 2018
0.90	Learning Crosslingual Word Embeddings without Bilingual Corpora	EMNLP 2016
0.89	Unsupervised Multilingual Word Embeddings	EMNLP 2018
0.89	InferLite: Simple Universal Sentence Representations from Natural Language Inference Data	EMNLP 2018
0.88	Improving Cross-Lingual Word Embeddings by Meeting in the Middle	EMNLP 2018
0.88	Dynamic Meta-Embeddings for Improved Sentence Representations	EMNLP 2018
0.88	Porting an Open Information Extraction System from English to German	EMNLP 2016
0.88	Unsupervised Statistical Machine Translation	EMNLP 2018
0.87	Contextual Parameter Generation for Universal Neural Machine Translation	EMNLP 2018
0.87	Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations	EMNLP 2018
