<font size="6.5">
    CORD-19 Dataset
</font>

<br>

CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge">[1] COVID-19 Open Research Dataset Challenge</a>

Mining this literature is of great interest to any scientist interested in studying COVID-19.

<br>
<font size="5">
    Dataset
</font>

<br>

We extracted the abstracts from all the articles in the CORD dataset, and saved them as text files.


<br>
<font size="5">
    Haystack NLP
</font>

We will be using Haystack to perform NLP functions on the CORD-19 dataset.

In [1]:
from haystack import Finder
from haystack.database.sql import SQLDocumentStore
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.io import write_documents_to_db, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.retriever.tfidf import TfidfRetriever
from haystack.utils import print_answers

## Indexing & cleaning documents

In [10]:
# Documents are saved as text files
preprint_dir = "C:\\data\\CORD-19\\biorxiv_medrxiv\\biorxiv_medrxiv_txt"
comm_dir = "C:\\data\\CORD-19\\comm_use_subset\\comm_use_subset_txt"
noncomm_dir = "C:\\data\\CORD-19\\noncomm_use_subset\\noncomm_use_subset_txt"
custom_dir = "C:\\data\\CORD-19\\custom_license\\custom_license_txt"

# Use SQLite to store data
document_store = SQLDocumentStore(url="sqlite:///cord.db")

# Save docs in DB
write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=False)
write_documents_to_db(document_store=document_store, document_dir=comm_dir, clean_func=clean_wiki_text, only_empty_db=False)
write_documents_to_db(document_store=document_store, document_dir=noncomm_dir, clean_func=clean_wiki_text, only_empty_db=False)
write_documents_to_db(document_store=document_store, document_dir=custom_dir, clean_func=clean_wiki_text, only_empty_db=False)

03/22/2020 16:26:40 - INFO - haystack.indexing.io -   Wrote 885 docs to DB
03/22/2020 16:27:05 - INFO - haystack.indexing.io -   Wrote 9118 docs to DB
03/22/2020 16:27:10 - INFO - haystack.indexing.io -   Wrote 2353 docs to DB
03/22/2020 16:27:43 - INFO - haystack.indexing.io -   Wrote 16959 docs to DB


## Initalize Reader, Retriever & Finder

In [11]:
# A retriever identifies the k most promising chunks of text that might contain the answer for our question
# Retrievers use some simple but fast algorithm, here: TF-IDF
retriever = TfidfRetriever(document_store=document_store)

03/22/2020 16:27:56 - INFO - haystack.retriever.tfidf -   Found 21976 candidate paragraphs from 30200 docs in DB


In [12]:
# A reader scans the text chunks in detail and extracts the k best answers
# Reader use more powerful but slower deep learning models
# You can select a local model or  any of the QA models published on huggingface's model hub (https://huggingface.co/models)
# here: a medium sized BERT QA model trained via FARM on Squad 2.0
reader = FARMReader(model_name_or_path="deepset/bert-base-cased-squad2", use_gpu=False)

# OR: use alternatively a reader from huggingface's transformers package (https://github.com/huggingface/transformers)
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

03/22/2020 16:28:22 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/22/2020 16:28:22 - INFO - farm.infer -   Could not find `deepset/bert-base-cased-squad2` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
03/22/2020 16:28:29 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None


In [13]:
# The Finder sticks together reader and retriever in a pipeline to answer our actual questions
finder = Finder(reader, retriever)

## Voilà! Ask a question!

In [14]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = finder.get_answers(question="Do patients have gastroenteritis?", top_k_retriever=10, top_k_reader=5)

03/22/2020 16:28:43 - INFO - haystack.retriever.tfidf -   Identified 10 candidates via retriever:
  paragraph_id  document_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

03/22/2020 16:28:43 - INFO - haystack.finder -   Reader is looking for detailed answer in 16372 chars ...
Inferencing Samples: 100%|██████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.64s/ Batches]


In [15]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Uncommon viral causes of acute gastroenteritis and viruses '
                  'causing gastroenteritis in immunodeficient',
        'context': ' measures are reviewed. Uncommon viral causes of acute '
                   'gastroenteritis and viruses causing gastroenteritis in '
                   'immunodeficient patients are mentioned. '},
    {   'answer': 'Uncommon viral causes of acute gastroenteritis and viruses '
                  'causing gastroenteritis in immunodeficient patients',
        'context': 'e briefly reviewed. Uncommon viral causes of acute '
                   'gastroenteritis and viruses causing gastroenteritis in '
                   'immunodeficient patients are mentioned. The '},
    {   'answer': 'HBoV-2 was more frequently associated with cases of '
                  'gastroenteritis',
        'context': 'mples were genotyped, with HBoV-1 predominantly found in '
                   'controls whilst HBoV-2 was more frequently associated with