<a href="https://colab.research.google.com/github/aaalexlit/omdena_climate_change_challenge_notebooks/blob/main/Index_claims_from_abstracts_for_searches_with_sem_scholar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Index claims from abstracts for searches

## Download csv with the abstracts retrieved from OpenAlex using keyword search and then enriched with information from Semantic Scholar

This is a full version that has  
874028 abstracts in `mod_abstracts_abstract_keyword_search.csv` and  
9271 abstracts in `mod_abstracts_title_keyword_search.csv`

The csv was obtained 
1. using [this code](https://github.com/aaalexlit/cc-claim-verification/blob/main/download/query_openalex.py) which is a slightly modified version of the code available [here](https://github.com/mcallaghan/NLP-climate-science-tutorial-CCAI/blob/main/A_obtaining_data.ipynb) 
2. Enriching with citation information from Semantic Scholar [Link to the code](https://github.com/aaalexlit/cc-claim-verification/blob/main/download/semanticscholar/add_info.py)

In [None]:
# import locale
# locale.getpreferredencoding = lambda: 'UTF-8'

In [None]:
!gdown https://drive.google.com/uc?id=1-C-fbIchuiv8Kz9f3fjwym3zPkvnsRiF -O 'faiss_index.json'
!gdown https://drive.google.com/uc?id=1-A-eRyAT-8j0qJlMNg0URpxWljl4GzVh -O 'faiss_index'
!gdown https://drive.google.com/uc?id=1-4uNKkIvzmc_RZkilHJ1b_XuhfMjMsE8 -O 'faiss_document_store.db'

!gdown https://drive.google.com/uc?id=1l5_zGXV5p9jESnqb1bQFGTRTfgFrthFp -O 'openalex_abstracts_abstract_keyword_search.csv'

Downloading...
From: https://drive.google.com/uc?id=1-C-fbIchuiv8Kz9f3fjwym3zPkvnsRiF
To: /content/faiss_index.json
100% 187/187 [00:00<00:00, 413kB/s]
Downloading...
From: https://drive.google.com/uc?id=1-A-eRyAT-8j0qJlMNg0URpxWljl4GzVh
To: /content/faiss_index
100% 114M/114M [00:01<00:00, 69.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-4uNKkIvzmc_RZkilHJ1b_XuhfMjMsE8
To: /content/faiss_document_store.db
100% 188M/188M [00:05<00:00, 32.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1l5_zGXV5p9jESnqb1bQFGTRTfgFrthFp
To: /content/openalex_abstracts_abstract_keyword_search.csv
100% 928M/928M [00:15<00:00, 58.6MB/s]


# Check if GPU is available to install gpu version of faiss

In [None]:
import torch
import os
faiss_to_install = "faiss-gpu"
if not torch.cuda.is_available():
  faiss_to_install = "faiss"

ret_code = os.system(f"pip install farm-haystack[{faiss_to_install}]")
if not ret_code:
  print(f"Installed {faiss_to_install}")

Installed faiss-gpu


In [None]:
%%capture
!python -m spacy download en_core_web_sm

# Index documents 

In [None]:
import os
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
import logging
from timeit import default_timer as timer
from haystack import Document
import pandas as pd
import torch
from transformers import AutoTokenizer, pipeline, \
RobertaForSequenceClassification, AutoModelForSequenceClassification
import gc
import itertools
import spacy

### Model to use for semantic embeddings

In [None]:
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
# embedding size used by all-MiniLM-L6-v2
EMBEDDING_DIM = 384

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
pd.options.mode.chained_assignment = None

nlp = spacy.load('en_core_web_sm',
                 enable=['tok2vec', 'senter'],
                 config={"nlp": {"disabled": []}}
                 )

In [None]:
def split_into_sentences(text):
    return [sent.text for sent in nlp(text).sents]

In [None]:
%%timeit
ss = split_into_sentences("""Atmospheric carbon dioxide concentration is expected to exceed 500 parts per million and global temperatures to rise by at least 2\u00b0C by 2050 to 2100, values that significantly exceed those of at least the past 420,000 years during which most extant marine organisms evolved. Under conditions expected in the 21st century, global warming and ocean acidification will compromise carbonate accretion, with corals becoming increasingly rare on reef systems. The result will be less diverse reef communities and carbonate reef structures that fail to be maintained. Climate change also exacerbates local stresses from declining water quality and overexploitation of key species, driving reefs increasingly toward the tipping point for functional collapse. This review presents future scenarios for coral reefs that predict increasingly serious consequences for reef-associated fisheries, tourism, coastal protection, and people. As the International Year of the Reef 2008 begins, scaled-up management intervention and decisive action on global emissions are required if the loss of coral-dominated ecosystems is to be avoided.""")
len(ss)
ss

9.21 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
import gc
gc.collect()

952

## Class that deals with initiating and writing to faiss store

In [None]:
class FAISSIndexer():
    def __init__(self, path_to_index_dir,
                 model_name, embedding_dim
                 , path_to_postgres=None) -> None:
        self.path_to_index_dir = path_to_index_dir
        # our db is postgres, only need to set path to faiss index
        if path_to_postgres:
            self.path_to_db = path_to_postgres
            self._set_path_to_index()
        # our db is SQLLite
        else:
            self._set_path_to_index_and_db()
        self.embedding_dim = embedding_dim
        self.model_name = model_name
        self.document_store = self._init_document_store()
        self.retriever = self._init_retriever()

    def _set_path_to_index(self):
        if not os.path.exists(self.path_to_index_dir):
            os.makedirs(self.path_to_index_dir)
        self.path_to_index = os.path.join(self.path_to_index_dir, "faiss_index")

    def _set_path_to_index_and_db(self):
        self._set_path_to_index()
        self.path_to_db = f"sqlite:///{os.path.join(self.path_to_index_dir, 'faiss_document_store.db')}"

    def _init_document_store(self):
        if os.path.exists(self.path_to_index):
            return FAISSDocumentStore.load(index_path=self.path_to_index)
        else:
            return FAISSDocumentStore(
                sql_url=self.path_to_db,
                return_embedding=True,
                similarity='cosine',
                embedding_dim=self.embedding_dim,
                duplicate_documents='skip'
            )

    def _init_retriever(self, progress_bar=True):
        return EmbeddingRetriever(
            document_store=self.document_store,
            embedding_model=self.model_name,
            model_format='sentence_transformers',
            # include article title into the embedding
            embed_meta_fields=["title"],
            progress_bar=progress_bar
        )

    def write_documents(self, docs):
        self.document_store.write_documents(docs)

        print('Updating embeddings ...')

        self.document_store.update_embeddings(
            retriever=self.retriever,
            update_existing_embeddings=False
        )

        print(f'current embedding count is {self.document_store.get_embedding_count()}')
        self.document_store.save(index_path=self.path_to_index)

    def retrieve_matches_for_a_phrase(self, phrase, top_k=10):
        return self.retriever.retrieve(phrase, top_k=top_k)

    def retrieve_matches_for_phrases(self, phrases, top_k=10):
        return self.retriever.retrieve_batch(phrases, top_k=top_k)

## Functions for indexing

In [None]:
def convert_openalex_rows_to_haystack_document(row):
  sentences = split_into_sentences(row['abstract'])
  # long_sentences = [sent.text for sent in sentences if len(sent.text.split()) > 500]
  # print('long_sentences', long_sentences)
  return [{'content': sent,
            'meta': {
                'title': row['title'],
                'year': row['year'],
                'doi': row['doi'],
                'openalex_id': row['openalex_id'],
                'citation_count': row['citationCount'],
                'influential_citation_count': row['influentialCitationCount'],
            }} for sent in sentences ]

In [None]:
def convert_abstracts_from_openalex_to_haystack_docs(filename, chunk_size, start_from_row):
    chunk_number = 1
    for df in pd.read_csv(filename, chunksize=chunk_size, skiprows=range(1, start_from_row)):
        print(f'starting to index chunk number {chunk_number}')
        df.fillna("", inplace=True)
        row_dict = df.to_dict('records')
        yield list(itertools.chain(*[convert_openalex_rows_to_haystack_document(row) for row in row_dict]))
        chunk_number += 1


In [None]:
def index_docs_from_csv(filename, docs_extractor,
                        indexer,
                        chunk_size, start_from_row):
  for docs in docs_extractor(filename, chunk_size, start_from_row):
    indexer.write_documents(docs)

## Launch indexing process

In [None]:
start = timer()
STORE_PATH = '/content/'
csv_path = '/content/openalex_abstracts_abstract_keyword_search.csv'

chunk_size = 4096
start_from_row = 1 * 2024 + 25 * chunk_size

faiss_indexer = FAISSIndexer(STORE_PATH, MODEL_NAME, EMBEDDING_DIM)

index_docs_from_csv(csv_path,
                    convert_abstracts_from_openalex_to_haystack_docs,
                    faiss_indexer,
                    chunk_size,
                    start_from_row
                    )

end = timer()
print(end - start)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


starting to index chunk number 1


Writing Documents:   0%|          | 0/36528 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/111041 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/204 [00:00<?, ?it/s]

current embedding count is 111041
starting to index chunk number 2


Writing Documents:   0%|          | 0/36298 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/147339 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/197 [00:00<?, ?it/s]

current embedding count is 147339
starting to index chunk number 3


Writing Documents:   0%|          | 0/37119 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/184458 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/223 [00:00<?, ?it/s]

current embedding count is 184458
starting to index chunk number 4


Writing Documents:   0%|          | 0/36678 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/221136 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/209 [00:00<?, ?it/s]

current embedding count is 221136
starting to index chunk number 5


Writing Documents:   0%|          | 0/36672 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/257808 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/209 [00:00<?, ?it/s]

current embedding count is 257808
starting to index chunk number 6


Writing Documents:   0%|          | 0/37110 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/294918 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/223 [00:00<?, ?it/s]

current embedding count is 294918
starting to index chunk number 7


Writing Documents:   0%|          | 0/37191 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/332109 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/225 [00:00<?, ?it/s]

current embedding count is 332109
starting to index chunk number 8


Writing Documents:   0%|          | 0/37077 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/369186 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/222 [00:00<?, ?it/s]

current embedding count is 369186
starting to index chunk number 9


Writing Documents:   0%|          | 0/37202 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/406388 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/226 [00:00<?, ?it/s]

current embedding count is 406388
starting to index chunk number 10


Writing Documents:   0%|          | 0/36657 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/443045 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/209 [00:00<?, ?it/s]

current embedding count is 443045
starting to index chunk number 11


Writing Documents:   0%|          | 0/37350 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/480395 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/230 [00:00<?, ?it/s]

current embedding count is 480395
starting to index chunk number 12


Writing Documents:   0%|          | 0/37247 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/517642 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/227 [00:00<?, ?it/s]

current embedding count is 517642
starting to index chunk number 13


Writing Documents:   0%|          | 0/37370 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/555012 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/231 [00:00<?, ?it/s]

current embedding count is 555012
starting to index chunk number 14


Writing Documents:   0%|          | 0/37285 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/592297 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/228 [00:00<?, ?it/s]

current embedding count is 592297
starting to index chunk number 15


Writing Documents:   0%|          | 0/37040 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/629337 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/220 [00:00<?, ?it/s]

current embedding count is 629337
starting to index chunk number 16


Writing Documents:   0%|          | 0/37407 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/666744 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/232 [00:00<?, ?it/s]

current embedding count is 666744
starting to index chunk number 17


Writing Documents:   0%|          | 0/37509 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/704253 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/235 [00:00<?, ?it/s]

current embedding count is 704253
starting to index chunk number 18


Writing Documents:   0%|          | 0/37888 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/742141 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/247 [00:00<?, ?it/s]

current embedding count is 742141
starting to index chunk number 19


Writing Documents:   0%|          | 0/37742 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/779883 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/242 [00:00<?, ?it/s]

current embedding count is 779883
starting to index chunk number 20


Writing Documents:   0%|          | 0/37067 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/816950 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/221 [00:00<?, ?it/s]

current embedding count is 816950
starting to index chunk number 21


Writing Documents:   0%|          | 0/37095 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/854045 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/222 [00:00<?, ?it/s]

current embedding count is 854045
starting to index chunk number 22


Writing Documents:   0%|          | 0/37654 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/891699 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/240 [00:00<?, ?it/s]

current embedding count is 891699
starting to index chunk number 23


Writing Documents:   0%|          | 0/37476 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/929175 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/234 [00:00<?, ?it/s]

current embedding count is 929175
starting to index chunk number 24


Writing Documents:   0%|          | 0/37312 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/966487 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/229 [00:00<?, ?it/s]

current embedding count is 966487
starting to index chunk number 25


Writing Documents:   0%|          | 0/37626 [00:00<?, ?it/s]

Updating embeddings ...


Updating Embedding:   0%|          | 0/1004113 [00:00<?, ? docs/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/239 [00:00<?, ?it/s]

current embedding count is 1004113


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-6a8f34df8730>", line 10, in <cell line: 10>
    index_docs_from_csv(csv_path,
  File "<ipython-input-12-36da364fab42>", line 5, in index_docs_from_csv
    indexer.write_documents(docs)
  File "<ipython-input-9-b5d092f758bf>", line 60, in write_documents
    self.document_store.save(index_path=self.path_to_index)
  File "/usr/local/lib/python3.9/dist-packages/haystack/document_stores/faiss.py", line 669, in save
    faiss.write_index(self.faiss_indexes[self.index], str(index_path))
  File "/usr/local/lib/python3.9/dist-packages/faiss/swigfaiss.py", line 9843, in write_index
    return _swigfaiss.write_index(*args)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dis

TypeError: ignored

In [None]:
def get_abstracts_matching_claims(claims, store_path, top_k=10, debug=False):
    start = timer()
    faiss_indexer = FAISSIndexer(store_path, MODEL_NAME, EMBEDDING_DIM)
    all_matches = faiss_indexer.retrieve_matches_for_phrases(claims,
                                                             top_k=top_k)
    if debug:
        for claim_n, matches in enumerate(all_matches):
            print(f"Claim:\n{claims[claim_n]}\n")
            for i, match in enumerate(matches):
                print(f'Evidence {i}:\n',
                      f'Similarity: {match.score:.3f}\n'
                      f'Abstract: {match.content}\n',
                      f'Article Title: {match.meta.get("title", "")}\n',
                      f'DOI: {match.meta.get("doi", "")}\n',
                      f'year: {match.meta.get("year", "")}\n',
                      f'citation_count: {match.meta.get("citation_count", "")}\n',
                      f'influential_citation_count: {match.meta.get("influential_citation_count", "")}\n',)
    end = timer()
    print(f"Took {(end - start):.0f} seconds")
    return all_matches

In [None]:
get_abstracts_matching_claims(['CO2 is not the cause of our current warming trend.',
                               'Natural variation explains a substantial part of global warming observed since 1850'], 
                              store_path=STORE_PATH, debug=True);

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Claim:
CO2 is not the cause of our current warming trend.

Evidence 0:
 Similarity: 0.836
Abstract: The yield somewhat similar results regarding atmospheric CO 2 levels, but they reach substantially different conclusions climate change.
 Article Title: Fossil-fuel constraints on global warming
 DOI: 10.1016/j.enpol.2009.06.068
 year: 2010
 citation_count: 138.0
 influential_citation_count: 4.0

Evidence 1:
 Similarity: 0.832
Abstract: Specifically, it confirmed former, especially CO2, are main drivers of recent warming.
 Article Title: On the causal structure between CO2 and global temperature
 DOI: 10.1038/srep21691
 year: 2016
 citation_count: 132.0
 influential_citation_count: 3.0

Evidence 2:
 Similarity: 0.829
Abstract: The Anthropocene idea cannot be justified by anthropogenic CO2.
 Article Title: Role of Atmospheric Convection in Global Warming
 DOI: 10.9734/jgeesi/2019/v19i430091
 year: 2019
 citation_count: 26.0
 influential_citation_count: 1.0

Evidence 3:
 Similarity: 0.826


In [None]:
%load_ext sql


In [None]:
%%sql
sqlite:///faiss_document_store.db

In [None]:
%%sql
SELECT * FROM document where created_at > '2023-04-08' limit 10;

 * sqlite:///faiss_document_store.db
(sqlite3.OperationalError) no such table: document
[SQL: SELECT * FROM document where created_at > '2023-04-08' limit 10;]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
