# Using embeddings to find citations in papers

In this notebook we will be exploring the problem of finding missing citations automatically using text embeddings and the S2ORC API.

We will be using the `database` module, which is a wrapper over an ElasticSearch instance, to index text as embeddings.

In [34]:
import os
import re
from time import sleep

import pypdf
import requests

from database.paper_database import PaperDatabase

db = PaperDatabase("http://localhost:9200")
S2ORC_API_KEY = os.getenv("S2ORC_API_KEY")

Connected to Elasticsearch!
{'name': '82cd70c8ec7d', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'PSnM0wksRKWsSl_HOGhSbA', 'version': {'number': '8.15.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'f97532e680b555c3a05e73a74c28afb666923018', 'build_date': '2024-10-09T22:08:00.328917561Z', 'build_snapshot': False, 'lucene_version': '9.11.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}
API key looks good!
Model loaded and on GPU!


## Acquiring the data

We will be using the Semantic Scholar Open Research Corpus (S2ORC) API to acquire the data we need.

Let's start by getting a papers full text, citations and references.

In [36]:
# Let's start by getting the full text of the paper
py_pdf = pypdf.PdfReader("./data/loose_pdfs/BERT.pdf")

# Concatenate all pages into a single string
pdf_text = ''.join([page.extract_text() for page in py_pdf.pages])

# Delete all newlines that are not preceded by a period
pdf_text = re.sub(r'(?<!\.)\n', ' ', pdf_text)

# Delete all "- "
pdf_text = re.sub(r'- ', '', pdf_text)

# Drop the first paragraph
pdf_text = pdf_text.split('\n', 1)[1]

print(pdf_text)

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
1 Introduction Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the relationships between sentences by analyzing them holistically, as well as token-level tasks such as named entity recognition and question answering, where models are required to produ

In [26]:
# Get information about the paper, including citations and references
rsp = requests.get('https://api.semanticscholar.org/graph/v1/paper/search/match',
                   headers={'X-API-KEY': S2ORC_API_KEY},
                   params={'query': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
                           'limit': 1, 'fields': 'title,abstract,citations,citations.abstract,citations.title,references,references.abstract,references.title'})

paper_info = rsp.json()['data'][0]

# Turn the response into a list of papers, drop the ones whose paperId is `None`
citations = [citation for citation in paper_info['citations'] if citation['paperId'] and citation['abstract']]
references = [reference for reference in paper_info['references'] if reference['paperId'] and reference['abstract']]

len(references)

59

## Gathering other papers

Simply recognizing the citations and references in a paper is not enough. We need to find other papers that are not cited in the paper, but are related to it.
This will help us evaluate whether the embeddings are producing meaningful results, free of false positives.

In [27]:
# Get some recommended papers from the S2ORC API that are not cited
def get_recommendations(paper_title):
    sleep(1)
    r = requests.get('https://api.semanticscholar.org/graph/v1/paper/search/match',
                       headers={'X-API-KEY': S2ORC_API_KEY},
                       params={'query': paper_title,
                               'limit': 1, 'fields': 'title'})

    print(r.json()['data'][0]['paperId'])

    sleep(1)

    r = requests.get(f"https://api.semanticscholar.org/recommendations/v1/papers/forpaper/{r.json()['data'][0]['paperId']}",
                         headers={'X-API-KEY': S2ORC_API_KEY},
                         params={'fields': 'title,abstract', 'limit': 100, 'from': 'all-cs'})

    return [paper for paper in r.json()['recommendedPapers'] if paper['paperId'] and paper['abstract']]


In [28]:
papers = references
len(papers)

59

## Indexing the papers

Now that we have all the papers we need, we can index them in ElasticSearch.

In [29]:
# Filter out the papers that are already indexed and index the rest
papers = [paper for paper in papers if not db.check_already_indexed(paper['title'])]
for paper in papers:
    db.insert_document_in_index(paper, "paper_abstracts")

print(f"Indexed {len(papers)} new papers")

Indexed 0 new papers


## Finding the missing citations

Now that our paper's references are indexed, we can attempt to find the missing citations.

In [32]:
paragraphs = pdf_text.split('\n')
sentences = pdf_text.split('. ')
embeddings = [db.get_embedding(paragraph) for paragraph in paragraphs]


In [33]:
titles = []

for embedding, paragraph in zip(embeddings, paragraphs):
    # Search for the paragraph in the indexed papers
    results = db.es.search(
        index="paper_abstracts",
        knn={
            'field': 'embedding',
            'query_vector': embedding,
            'num_candidates': 10000,
            'k': 20,
        }
    )

    titles.append(results['hits']['hits'][0]['_source']['title'])

    # Check if the found paper is in the references
    if results['hits']['hits'][0]['_source']['title'] in references:
        print(paragraph)

titles

['Making Neural Machine Reading Comprehension Faster',
 'Enhancing Joint Entity and Relation Extraction with Language Modeling and Hierarchical Attention',
 'Encoder-Agnostic Adaptation for Conditional Language Generation',
 'Tree Transformer: Integrating Tree Structures into Self-Attention',
 'Tree Transformer: Integrating Tree Structures into Self-Attention',
 'Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training',
 'Tree Transformer: Integrating Tree Structures into Self-Attention',
 'Tree Transformer: Integrating Tree Structures into Self-Attention',
 'Grammar inference, automata induction, and language acquisition',
 "Word Mover's Embedding: From Word2Vec to Document Embedding",
 'DisSent: Sentence Representation Learning from Explicit Discourse Relations',
 'Distributed syntactic representations with an application to part-of-speech tagging',
 'Tree Transformer: Integrating Tree Structures into Self-Attention',
 "Word Mover's Embedding: From Word2V

## Results

Unfortunately, we weren't able to find anything :(