<a href="https://colab.research.google.com/github/aaalexlit/omdena_climate_change_challenge_notebooks/blob/main/Index_scientific_papers_abstracts_for_searches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download csv with the abstracts retrieved from OpenAlex using keyword search

This is a short version that has only about 2000 abstracts

In [36]:
!gdown https://drive.google.com/uc?id=1RzgO3aWvYO4OXTD-z_LfQyuO2IE9PmbT -O 'openalex_abstracts.csv'

Downloading...
From: https://drive.google.com/uc?id=1RzgO3aWvYO4OXTD-z_LfQyuO2IE9PmbT
To: /content/openalex_abstracts.csv
  0% 0.00/2.91M [00:00<?, ?B/s]100% 2.91M/2.91M [00:00<00:00, 65.5MB/s]


# Check if GPU is available to install gpu version of faiss

In [37]:
import torch
import os
faiss_to_install = "faiss-gpu"
if not torch.cuda.is_available():
  faiss_to_install = "faiss"

ret_code = os.system(f"pip install farm-haystack[{faiss_to_install}]")
if not ret_code:
  print(f"Installed {faiss_to_install}")

Installed faiss


# Index documents 

In [38]:
import os
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
import logging
from timeit import default_timer as timer
from haystack import Document
import pandas as pd

In [39]:
# model to build semantic encodings
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L-6-v3'
# embedding size used by msmarco-MiniLM-L-6-v3
EMBEDDING_DIM = 384

chunk_size = 100
start_from_row = 0 * chunk_size

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)


In [40]:
class FAISSIndexer():
    def __init__(self, path_to_index_dir,
                 model_name, embedding_dim
                 , path_to_postgres=None) -> None:
        self.path_to_index_dir = path_to_index_dir
        # our db is postgres, only need to set path to faiss index
        if path_to_postgres:
            self.path_to_db = path_to_postgres
            self._set_path_to_index()
        # our db is SQLLite
        else:
            self._set_path_to_index_and_db()
        self.embedding_dim = embedding_dim
        self.model_name = model_name
        self.document_store = self._init_document_store()
        self.retriever = self._init_retriever()

    def _set_path_to_index(self):
        if not os.path.exists(self.path_to_index_dir):
            os.makedirs(self.path_to_index_dir)
        self.path_to_index = os.path.join(self.path_to_index_dir, "faiss_index")

    def _set_path_to_index_and_db(self):
        self._set_path_to_index()
        self.path_to_db = f"sqlite:///{os.path.join(self.path_to_index_dir, 'faiss_document_store.db')}"

    def _init_document_store(self):
        if os.path.exists(self.path_to_index):
            return FAISSDocumentStore.load(index_path=self.path_to_index)
        else:
            return FAISSDocumentStore(
                sql_url=self.path_to_db,
                return_embedding=True,
                similarity='cosine',
                embedding_dim=self.embedding_dim,
                duplicate_documents='skip'
            )

    def _init_retriever(self, progress_bar=True):
        return EmbeddingRetriever(
            document_store=self.document_store,
            embedding_model=self.model_name,
            model_format='sentence_transformers',
            # include article title into the embedding
            embed_meta_fields=["title"],
            progress_bar=progress_bar
        )

    def write_documents(self, docs):
        self.document_store.write_documents(docs)

        print('Updating embeddings ...')

        self.document_store.update_embeddings(
            retriever=self.retriever,
            update_existing_embeddings=False
        )

        print(f'current embedding count is {self.document_store.get_embedding_count()}')
        self.document_store.save(index_path=self.path_to_index)

    def retrieve_matches_for_a_phrase(self, phrase, top_k=10):
        return self.retriever.retrieve(phrase, top_k=top_k)

    def retrieve_matches_for_phrases(self, phrases, top_k=10):
        return self.retriever.retrieve_batch(phrases, top_k=top_k)

In [41]:
def index_docs_from_csv(filename, docs_extractor,
                        indexer, chunk_size, start_from_row):
    for docs in docs_extractor(filename, chunk_size, start_from_row):
        indexer.write_documents(docs)

In [42]:
def convert_openalex_abstracts_to_haystack_documents(row):
    meta_information = {
        'title': row['title'],
        'publication_year': row['publication_year'],
        'authors': row['authors'],
        'doi': row['doi'],
        'open_alex_id': row['id']
    }
    return Document(content=row['abstract'],
                    meta=meta_information)

In [43]:
def read_csv_yield_haystack_documents(filename, chunk_size, start_from_row):
    chunk_number = 1
    for df in pd.read_csv(filename, chunksize=chunk_size, skiprows=start_from_row):
        print(f'starting to index chunk number {chunk_number}')
        df.fillna("", inplace=True)
        row_dict = df.to_dict('records')
        chunk_number += 1
        yield [convert_openalex_abstracts_to_haystack_documents(row)
               for row in row_dict]

In [44]:
start = timer()

STORE_PATH = '/content/data/faiss/'
csv_path = '/content/openalex_abstracts.csv'

faiss_indexer = FAISSIndexer(STORE_PATH, 
                             MODEL_NAME, 
                             EMBEDDING_DIM)
index_docs_from_csv(csv_path,
                    read_csv_yield_haystack_documents,
                    faiss_indexer,
                    chunk_size,
                    start_from_row
                    )

end = timer()
print(end - start)

starting to index chunk number 1


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/100 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 100
starting to index chunk number 2


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/200 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 200
starting to index chunk number 3


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/300 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 300
starting to index chunk number 4


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/400 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 400
starting to index chunk number 5


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/500 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 500
starting to index chunk number 6


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/600 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 600
starting to index chunk number 7


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/700 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 700
starting to index chunk number 8


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/800 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 800
starting to index chunk number 9


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/900 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 900
starting to index chunk number 10


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1000 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1000
starting to index chunk number 11


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1100 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1100
starting to index chunk number 12


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1200 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1200
starting to index chunk number 13


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1300 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1300
starting to index chunk number 14


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1400 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1400
starting to index chunk number 15


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1500 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1500
starting to index chunk number 16


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1600 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1600
starting to index chunk number 17


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1700 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1700
starting to index chunk number 18


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1800 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1800
starting to index chunk number 19


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1900 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 1900
starting to index chunk number 20


Writing Documents:   0%|          | 0/100 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/2000 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

current embedding count is 2000
633.2427071990005


In [45]:
def get_abstracts_matching_claims(claims, store_path, top_k=10, debug=False):
    start = timer()
    faiss_indexer = FAISSIndexer(store_path, MODEL_NAME, EMBEDDING_DIM)
    all_matches = faiss_indexer.retrieve_matches_for_phrases(claims,
                                                             top_k=top_k)
    if debug:
        for claim_n, matches in enumerate(all_matches):
            print(f"Claim:\n{claims[claim_n]}\n")
            for i, match in enumerate(matches):
                print(f'Evidence {i}:\n',
                      f'Similarity: {match.score:.3f}\n'
                      f'Quote: {match.content}\n',
                      f'Article Title: {match.meta.get("title", "")}\n',
                      f'DOI: {match.meta.get("doi", "")}\n',
                      f'year: {match.meta.get("publication_year", "")}\n', )
    end = timer()
    print(f"Took {(end - start):.0f} seconds")
    return all_matches

In [48]:
get_abstracts_matching_claims(['CO2 is not the cause of our current warming trend.',
                               'Natural variation explains a substantial part of global warming observed since 1850'], 
                              store_path=STORE_PATH, debug=True);

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Claim:
CO2 is not the cause of our current warming trend.

Evidence 0:
 Similarity: 0.770
Quote: The global temperature rose by 0.2 degrees C between the middle 1960's and 1980, yielding a warming of 0.4 in past century. This increase is consistent with calculated greenhouse effect due to measured increases atmospheric carbon dioxide. Variations volcanic aerosols possibly solar luminosity appear be primary causes observed fluctuations about mean trend increasing temperature. It shown that anthropogenic dioxide should emerge from noise level natural climate variability end century, there high probability 1980's. Potential effects on 21st century include creation drought-prone regions North America central Asia as part shifting climatic zones, erosion West Antarctic ice sheet consequent worldwide rise sea level, opening fabled Northwest Passage.
 Article Title: Climate Impact of Increasing Atmospheric Carbon Dioxide
 DOI: https://doi.org/10.1126/science.213.4511.957
 year: 1981

Evidence