# Using embeddings to find missing citations in papers

In this notebook we will be exploring the problem of finding missing citations automatically using text embeddings (powered by `intfloat/multilingual-e5-large-instruct`) and the S2ORC API for acquiring paper metadata.

The focus will be on using citing sentences (which we know are citations) to find the papers they are citing, as opposed to classifying whether a sentence is a citation or not.

We will be using the `database` module, which is a wrapper over an ElasticSearch instance, to index text as embeddings, as well as perform knn searches.

In [2]:
import os
import requests

%load_ext autoreload
%autoreload 2

S2ORC_API_KEY = os.getenv("S2ORC_API_KEY")

from database.vector_database import VectorDatabase

db = VectorDatabase('http://localhost:9200')

db.es.indices.delete(index="paper_abstracts", ignore_unavailable=True)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Connected to Elasticsearch!
{'name': '82cd70c8ec7d', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'PSnM0wksRKWsSl_HOGhSbA', 'version': {'number': '8.15.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'f97532e680b555c3a05e73a74c28afb666923018', 'build_date': '2024-10-09T22:08:00.328917561Z', 'build_snapshot': False, 'lucene_version': '9.11.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


ObjectApiResponse({'acknowledged': True})

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/multilingual-e5-large-instruct')

model.device

device(type='cuda', index=0)

## Acquiring the data

We will be using the Semantic Scholar Open Research Corpus (S2ORC) API to acquire the data we need.

Let's start by getting a papers full text, citations and references.

In [2]:
from paper_text_extractor.paper_text_extractor import get_paper_text

pdf_text = get_paper_text("../data/loose_pdfs/transformer_model.pdf", remove_references=True)

print(pdf_text)

3
2
0
2

g
u
A
2

]
L
C
.
s
c
[

7
v
2
6
7
3
0
.
6
0
7
1
:
v
i
X
r
a

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.

Attention Is All You Need

Ashish Vaswani∗
Google Brain
avaswani@google.com

Noam Shazeer∗
Google Brain
noam@google.com

Niki Parmar∗
Google Research
nikip@google.com

Jakob Uszkoreit∗
Google Research
usz@google.com

Llion Jones∗
Google Research
llion@google.com

Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu

Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com

Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com

Abstract

The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely o

In [19]:
# Get information about the paper, including citations and references
rsp = requests.get('https://api.semanticscholar.org/graph/v1/paper/search/match',
                   headers={'X-API-KEY': S2ORC_API_KEY},
                   params={'query': 'Attention Is All You Need',
                           'limit': 1, 'fields': 'title,abstract,year,references,references.abstract,references.title,references.year'})

paper_info = rsp.json()['data'][0]

year = paper_info['year']

# Turn the response into a list of papers, drop the ones whose paperId is `None`
references = [reference for reference in paper_info['references'] if reference['paperId'] and reference['abstract']]

print(f"Found {len(references)} references. Samples:")
[reference['title'] for reference in references[:5]]

Found 39 references. Samples:


['A Deep Reinforced Model for Abstractive Summarization',
 'Convolutional Sequence to Sequence Learning',
 'Massive Exploration of Neural Machine Translation Architectures',
 'A Structured Self-attentive Sentence Embedding',
 'Factorization tricks for LSTM networks']

In [3]:
papers = references
len(papers)

39

In [6]:
db.create_index("test_index", {'properties': {
                                    'abstract_embedding': {
                                        'type': 'dense_vector',
                                        'similarity': 'cosine'
                                    },
                                    'title_embedding': {
                                        'type': 'dense_vector',
                                        'similarity': 'cosine'
                                    }
                                }
                            })

abstract_embeddings = model.encode([paper['abstract'] for paper in papers], normalize_embeddings=True)
title_embeddings = model.encode([paper['title'] for paper in papers], normalize_embeddings=True)

# add embeddings to the paper dicts
for paper, abstract_embedding, title_embedding in zip(papers, abstract_embeddings, title_embeddings):
    paper['abstract_embedding'] = abstract_embedding
    paper['title_embedding'] = title_embedding

db.insert_documents_in_index(papers, "test_index")

ObjectApiResponse({'errors': False, 'took': 200, 'items': [{'index': {'_index': 'test_index', '_id': 'B30IQpUBg0RW2KZAO9IN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'test_index', '_id': 'CH0IQpUBg0RW2KZAO9IN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'test_index', '_id': 'CX0IQpUBg0RW2KZAO9IN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'test_index', '_id': 'Cn0IQpUBg0RW2KZAO9IN', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'test_index', '_id': 'C30IQpUBg0RW2KZAO9IN', '_version': 1, 'result': 'created', '_shard

## Gathering other papers

Simply recognizing the citations and references in a paper is not enough. We need to find other papers that are not cited in the paper, but are related to it.
This will help us evaluate whether the embeddings are producing meaningful results, free of false positives.

In [3]:
# index more papers from the S2ORC dataset
from datasets import load_dataset

ds = load_dataset("sentence-transformers/s2orc", "title-abstract-pair")['train']

Resolving data files:   0%|          | 0/185 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/62 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/62 [00:00<?, ?it/s]

In [15]:
# grab 50000 papers
ds = ds.shuffle(seed=42).select(range(50000))

In [16]:
# generate and index embeddings in batches of 256
for i in range(0, len(ds), 256):
    print(f"Indexing papers {i} to {i + 256}")
    batch_abstracts = ds['abstract'][i:i + 256]
    batch_titles = ds['title'][i:i + 256]

    abstract_embeddings = model.encode(batch_abstracts, normalize_embeddings=True)
    title_embeddings = model.encode(batch_titles, normalize_embeddings=True)

    documents = []

    for j in range(0, len(batch_abstracts)):
        documents.append({
            'title': batch_titles[j],
            'abstract': batch_abstracts[j],
            'abstract_embedding': abstract_embeddings[j],
            'title_embedding': title_embeddings[j],
        })

    db.insert_documents_in_index(documents, "test_index")

Indexing papers 0 to 256
Indexing papers 256 to 512
Indexing papers 512 to 768
Indexing papers 768 to 1024
Indexing papers 1024 to 1280
Indexing papers 1280 to 1536
Indexing papers 1536 to 1792
Indexing papers 1792 to 2048
Indexing papers 2048 to 2304
Indexing papers 2304 to 2560
Indexing papers 2560 to 2816
Indexing papers 2816 to 3072
Indexing papers 3072 to 3328
Indexing papers 3328 to 3584
Indexing papers 3584 to 3840
Indexing papers 3840 to 4096
Indexing papers 4096 to 4352
Indexing papers 4352 to 4608
Indexing papers 4608 to 4864
Indexing papers 4864 to 5120
Indexing papers 5120 to 5376
Indexing papers 5376 to 5632
Indexing papers 5632 to 5888
Indexing papers 5888 to 6144
Indexing papers 6144 to 6400
Indexing papers 6400 to 6656
Indexing papers 6656 to 6912
Indexing papers 6912 to 7168
Indexing papers 7168 to 7424
Indexing papers 7424 to 7680
Indexing papers 7680 to 7936
Indexing papers 7936 to 8192
Indexing papers 8192 to 8448
Indexing papers 8448 to 8704
Indexing papers 8704 to

## Evaluation

We will now perform KNN searches on the indexed papers. The queries will be sentences from the paper we extracted earlier where we know a citation is missing.

The embedding model we are using is an instruction-based model, so we will also include the task description in the query.

In [3]:
citation_to_paper = {
    "We used the Adam optimizer.":
        "Adam: A Method for Stochastic Optimization",
    "In contrast to RNN sequence-to-sequence models":
        "Grammar as a foreign language",
    "the Transformer outperforms the Berkeley-Parser":
        "Learning accurate, compact, and interpretable tree annotation",
    "We trained a 4-layer transformer with dmodel = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank":
        "Building a Large Annotated Corpus of English: The Penn Treebank",
    "we replace our sinusoidal positional encoding with learned positional embeddings":
        "Convolutional Sequence to Sequence Learning",
    "Recent work has achieved significant improvements in computational efficiency through factorization tricks":
        "Factorization tricks for LSTM networks",
}

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

def extract_citation(query: str) -> str:
    return query.split("Query: ")[1]

task = "Given a sentence where a paper is cited, find the abstract of the paper it cites."

queries = [get_detailed_instruct(task, query) for query in citation_to_paper.keys()]

In [4]:
# Search for the query in the abstracts and titles of the indexed papers
results = {}

for query in queries:
    query_embedding = model.encode(query, normalize_embeddings=True)

    abstract_results = db.knn_search(query_embedding, "test_index", "abstract_embedding", k=50)
    title_results = db.knn_search(query_embedding, "test_index", "title_embedding", k=50)

    results[query] = (abstract_results, title_results)

    citation = extract_citation(query)
    cited_paper = citation_to_paper[citation]

    title_results = [result['_source']['title'] for result in title_results]
    abstract_results = [result['_source']['title'] for result in abstract_results]

    print(f"Query: {citation}\n")
    print(f"Cited paper: {cited_paper}")

    found = False

    if cited_paper in title_results:
        print("Found in title results")
        print(f"Position: {title_results.index(cited_paper) + 1}\n")
        found = True

    if cited_paper in abstract_results:
        print("Found in abstract results")
        print(f"Position: {abstract_results.index(cited_paper) + 1}\n")
        found = True

    if not found:
        print("Not found in results")
        print(title_results)
        print(abstract_results)

    print("")

Query: We used the Adam optimizer.

Cited paper: Adam: A Method for Stochastic Optimization
Found in title results
Position: 2

Found in abstract results
Position: 1


Query: In contrast to RNN sequence-to-sequence models

Cited paper: Grammar as a foreign language
Not found in results
['Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling', 'Sequence to Sequence Learning with Neural Networks', 'Convolutional Sequence to Sequence Learning', 'Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation', 'Multi-task Sequence to Sequence Learning', 'Generating Sequences With Recurrent Neural Networks', 'Effective Approaches to Attention-based Neural Machine Translation', 'Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation', 'Exploring the Limits of Language Modeling', 'Long Short-Term Memory-Networks for Machine Reading']
['Convolutional Sequence to Sequence Learning', 'Sequence to Sequence Learning 

## Results

The model shows fairly strong results, even without any reranking (which might bring the correct result even further up) or filtering (e.g: some found papers might be newer than the citing paper).

The correct paper shows up in the first couple of results for the most part, with some outliers like "Grammar as a foreign language" not having enough context in the citation to show up at all. This is fairly reasonable, as the citation is fairly vague and could be referring to a number of papers.

# Reranking and filtering

Now that we have a list of fairly relevant papers, we can use a reranking model to get a better order, rerankers are fairly slow, but they can improve the ranking significantly.

We will use the `BAAI/bge-reranker-v2-m3` model for this purpose.

In [8]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-v2-m3')
rerank_model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-v2-m3')
rerank_model = rerank_model.to('cuda' if torch.cuda.is_available() else 'cpu')
rerank_model.eval()


XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 1024, padding_idx=1)
      (position_embeddings): Embedding(8194, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-23): 24 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=1024, o

In [20]:
from time import sleep

for query in queries:
    pairs = []

    for result in results[query][0]:
        title = result['_source']['title']
        abstract = ""

        if 'abstract' not in result['_source']:
            res = requests.get('https://api.semanticscholar.org/graph/v1/paper/search/match',
                               headers={'X-API-KEY': S2ORC_API_KEY},
                               params={'query': result['_source']['title'],
                                       'limit': 1, 'fields': 'abstract'})

            if res.status_code == 200 and res.json()['data'][0]['abstract'] is not None:
                abstract = res.json()['data'][0]['abstract']
                sleep(1)

        else:
            abstract = result['_source']['abstract']

        pairs.append([query, title + '\n\n' + abstract])


    for result in results[query][1]:
        title = result['_source']['title']
        abstract = ""

        if 'abstract' not in result['_source']:
            res = requests.get('https://api.semanticscholar.org/graph/v1/paper/search/match',
                               headers={'X-API-KEY': S2ORC_API_KEY},
                               params={'query': result['_source']['title'],
                                       'limit': 1, 'fields': 'abstract,year'})

            if res.status_code == 200 and res.json()['data'][0]['abstract'] is not None:
                abstract = res.json()['data'][0]['abstract']
                sleep(1)
            if res.status_code == 200 and res.json()['data'][0]['year'] > year:
                continue
        else:
            abstract = result['_source']['abstract']

        pairs.append([query, title + '\n\n' + abstract])

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    with torch.no_grad():
        inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to(device)
        scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
        # get list of paper titles sorted by score
        sorted_titles = [pair[0][1].split('\n\n')[0] for pair in sorted(zip(pairs, scores), key=lambda x: x[1], reverse=True)]

        cited_paper = citation_to_paper[extract_citation(query)]
        citation = extract_citation(query)

        print(f"Query: {citation}\n")
        print(f"Cited paper: {cited_paper}")

        found = False

        sorted_titles = list(dict.fromkeys(sorted_titles))

        if cited_paper in sorted_titles:
            print("Found in title results")
            print(f"Position: {sorted_titles.index(cited_paper) + 1}\n")
            found = True

        if not found:
            print("Not found in results")
            print(sorted_titles)

        print("")

Query: We used the Adam optimizer.

Cited paper: Adam: A Method for Stochastic Optimization
Found in title results
Position: 1


Query: In contrast to RNN sequence-to-sequence models

Cited paper: Grammar as a foreign language
Not found in results
['Sequence to Sequence Learning with Neural Networks', 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling', 'Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation', 'A Deep Reinforced Model for Abstractive Summarization', 'End-To-End Memory Networks', 'Recurrent Neural Network Grammars', 'Multi-task Sequence to Sequence Learning', 'Convolutional Sequence to Sequence Learning', 'Long Short-Term Memory-Networks for Machine Reading', 'Exploring the Limits of Language Modeling', 'Generating Sequences With Recurrent Neural Networks', 'Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation', 'Effective Approaches to Attention-based Neural Machine Translat

The reranking somewhat improves the results, but it is still not perfect. Unfortunately, this reranker reaches the limits of what I can efficiently use on my own machine. Moving further, I will attempt to use remote LLMs to reorder these papers.