# Free form questioning on COVID-19 dataset 

>We will use universal sentence encoder to encode text from COVID-19 dataset and use to answer queries

- Based on the paper at: https://arxiv.org/abs/1803.11175

- Dataset available at: https://pages.semanticscholar.org/coronavirus-research

- By Dattaraj J Rao (Persistent Systems) - https://www.linkedin.com/in/dattarajrao

## We will use the Sentence Transformers library

Approach is to encode relevant text corpus from COVID-19 dataset and then match the question embedding with this to fiond top 3 matching answers.

In [1]:
#hide
pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/07/32/e3d405806ea525fd74c2c79164c3f7bc0b0b9811f27990484c6d6874c76f/sentence-transformers-0.2.5.1.tar.gz (52kB)
[K     |████████████████████████████████| 61kB 2.3MB/s eta 0:00:011
[?25hCollecting transformers==2.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 3.1MB/s eta 0:00:01
Collecting torch>=1.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/4a/72/0282449efe6e8a7ab6354ac990b8275bd8c881dcbf95b3ef0a041da3897b/torch-1.4.0-cp37-none-macosx_10_9_x86_64.whl (81.1MB)
[K     |████████████████████████████████| 81.1MB 54.3MB/s eta 0:00:01     |███████████                     | 27.6MB 38.1MB/s eta 0:00:02
[?25hCollecting numpy
[?25l  Downloading https://files.pythonhosted.org/packages/81/14/6d7c914dac1cb2b596

### Read the dataset, create embeddings vector

In [2]:
#hide
import os
import json
import warnings
warnings.simplefilter('ignore')

JSON_PATH = 'CORD-19-research-challenge/2020-03-13/noncomm_use_subset/noncomm_use_subset/'

json_files = [pos_json for pos_json in os.listdir(JSON_PATH) if pos_json.endswith('.json')]

corpus = []

# loop through the files
for jfile in json_files[::]:
    # for each file open it and read as json
    with open(os.path.join(JSON_PATH, jfile)) as json_file:
        covid_json = json.load(json_file)
        # read abstract
        for item in covid_json['abstract']:
            corpus.append(item['text'])
        # read body text
        for item in covid_json['body_text']:
            corpus.append(item['text'])
            
print("Corpus size = %d"%(len(corpus)))

from sentence_transformers import SentenceTransformer
import scipy

embedder = SentenceTransformer('bert-base-nli-mean-tokens')
corpus_embeddings = embedder.encode(corpus)

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/CORD-19-research-challenge/2020-03-13/noncomm_use_subset/noncomm_use_subset/'

In [None]:
#hide
def ask_question(query):
    queries = [query]
    query_embeddings = embedder.encode(queries)

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    closest_n = 5
    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        
        # get the closest answers
        for idx, distance in results[0:closest_n]:
            print("\n\n======================\n\n")        
            print("ANSWER = \n", corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

## Question 1: Does smoking or pre-existing pulmonary disease increase risk of COVID-19?

In [None]:
ask_question('Does smoking or pre-existing pulmonary disease increase risk of COVID-19?')

## Question 2: Are neonates and pregnant women ar greater risk of COVID-19?

In [None]:
ask_question('Are neonates and pregnant women ar greater risk of COVID-19?')

## Question3: Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups

In [None]:
ask_question('Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups')

## Question 4: Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.

In [None]:
ask_question('Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.')

## Question 5: Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups

In [None]:
ask_question('Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups')

# Ask your own question

In [3]:
ask_question(input())

NameError: name 'ask_question' is not defined