# Free form questioning on COVID-19 dataset 

>We will use universal sentence encoder to encode text from COVID-19 dataset and use to answer queries

- Based on the paper at: https://arxiv.org/abs/1803.11175

- Dataset available at: https://pages.semanticscholar.org/coronavirus-research

- By Dattaraj J Rao (Persistent Systems) - https://www.linkedin.com/in/dattarajrao

## We will use the Sentence Transformers library

Approach is to encode relevant text corpus from COVID-19 dataset and then match the question embedding with this to fiond top 3 matching answers.

In [1]:
#hide
# ! pip install -U sentence-transformers
# !pip install nbinteract

### Read the dataset, create embeddings vector

In [2]:
#hide
# ! curl --header "Host: storage.googleapis.com" --header "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header "Accept-Language: en-US,en;q=0.9" --header "Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-data-sets/551982/1008364/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1584988666&Signature=EdfLusMOxjufDKEVw7EiScOcwd9dCyQlCHlz2yIh0lfVFpn9NUpLnxIi96Ftz7b44hNl0uT1G8EhUnN%2FWprajpyigrZBTMkVh3qzNszPIDmIqNestubkCAJOl81r6dtpy3%2FP5E1dDntrhBQPPpeW2h5pd31EgkSJIwvwGzMGVuyw7ayW8EyWpBeAKdjOVS3LRSFZSGLoeievgWMxv7Vm00LwMmK3YfwSRtFVEuDpYdEYAcr6P3hl1MGnJxS5uJqF6LVNlmm%2Bxwim%2BTAU2eQfB2T2U8F7hfuawQtpkkuHkv3lf2QsXvZlLugmV0wDyrUHXxYkftdQdHPTLASe9Hfj1Q%3D%3D&response-content-disposition=attachment%3B+filename%3DCORD-19-research-challenge.zip" -o "CORD-19-research-challenge.zip" -L

In [3]:
#hide
# ! mkdir data

In [4]:
#hide
# ! unzip -d data/CORD-19-research-challenge.zip

In [5]:
#hide
import os
import json
import pickle
from sentence_transformers import SentenceTransformer
import scipy

JSON_PATH = 'data/2020-03-13/noncomm_use_subset/noncomm_use_subset/'

json_files = [pos_json for pos_json in os.listdir(JSON_PATH) if pos_json.endswith('.json')]

corpus = []

# loop through the files
for jfile in json_files[::]:
    # for each file open it and read as json
    with open(os.path.join(JSON_PATH, jfile)) as json_file:
        covid_json = json.load(json_file)
        # read abstract
        for item in covid_json['abstract']:
            corpus.append(item['text'])
        # read body text
        for item in covid_json['body_text']:
            corpus.append(item['text'])
            


embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# corpus_embeddings = embedder.encode(corpus) 
# pickle.dump(corpus_embeddings, open('corpus_embedding.pkl','wb'))

In [None]:
! curl --header "Host: doc-0k-6g-docs.googleusercontent.com" --header "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header "Accept-Language: en-US,en;q=0.9" --header "Cookie: AUTH_54o58k3mrosb0ntis1d6ki80s5u2suja_nonce=an1gmni0poujk; NID=198=Ur67qjvAf1i-tnwiUIgNirKuT_vpDT-dQFBEpw1j-TqAwCLeMW_JiF4XgHk3lTe93MRbZ02lUdw5qfXH0Em0ewYTjs-dfoit2-G7LdsU167QVcVzy0GVGfPKjWniJ6ymSJxjYLcq1uc6NvhRyj-emsL0X13_WSAzTdiTHTsTQb0" --header "Connection: keep-alive" "https://doc-0k-6g-docs.googleusercontent.com/docs/securesc/lggtpbl23ghctm8snd2br4sbmhr20ifs/foacldgvepp7u9cotr6nrqoko3vh8n2g/1584733950000/09981225647225991988/09981225647225991988/1sCbqmVvxj0EacUoceZEv1EC_NME7ZEMF?e=download&h=11307479845000005816&authuser=0&nonce=an1gmni0poujk&user=09981225647225991988&hash=ttt49oojsrjkqdjp51g9348od17m0jdq" -o "corpus_embedding.pkl" -L

In [6]:
corpus_embeddings = pickle.load(open('corpus_embedding.pkl', 'rb'))

In [16]:
#hide
def ask_question(query):
    queries = [query]
    query_embeddings = embedder.encode(queries)

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    closest_n = 5

    for query, query_embedding in zip(queries, query_embeddings):
        distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        
        i = 0
        # get the closest answers
        for idx, distance in results[0:closest_n]:
            i += 1
            print("\n======================\n")        
            print(f"Answer {i}: \n", corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

#### Question 1: Does smoking or pre-existing pulmonary disease increase risk of COVID-19?

In [17]:
ask_question('Does smoking or pre-existing pulmonary disease increase risk of COVID-19?')



Answer 1: 
 Among patients with cancer 7,8 , pulmonary diseases 5,9*11 , immunological conditions 8, 11, 12 , renal diseases 13 , or organ transplantation 8,14,15 corticosteroid therapy has been associated with increased risk of ON. Further, important lifestyle factors, including tobacco use 16, 17 and high alcohol consumption 16*18 may increase ON risk. (Score: 0.7884)


Answer 2: 
 The role of eosinophils in COPD and the mechanism of eosinophil influx into airways remain to be clarified. Hogg et al. 5 reported that eosinophils do exist in the small airways of various severity of COPD. In a prospective clinical observation, nearly one third of COPD patients had sputum eosinophilia and the number of eosinophils was significantly correlated to the level of exhaled nitric oxide 17 . The degree of eosinophilic inflammation has been related to early changes in lung function and smoking habits. The higher counts of eosinophils in induced sputum is associated with higher pack-years and low

#### Question 2: Are neonates and pregnant women ar greater risk of COVID-19?

In [13]:
ask_question('Are neonates and pregnant women ar greater risk of COVID-19?')



Answer 1: 
 Newborns are considered at high risk of COVID-19 in case that they are born to mothers diagnosed with COVID-19, or have close contact with someone with probable or confirmed COVID-19, or live in or travel to the epidemic area. Clinical manifestations of infected neonates, especially preterm infants, might be nonspecific, which might include temperature instability, gastrointestinal and cardiovascular dysfunction, and dominant respiratory problems. Some severe patients could rapidly develop acute respiratory distress syndrome. All infants with suspected COVID-19 should be isolated and monitored regardless of whether or not they present with symptoms. Diagnosis of neonatal COVID-19 could be confirmed if the suspected patients have positive nucleic acid test for COVID-19 from the respiratory tract, stool or blood specimens. 4 Infants with highly suspected or confirmed COVID-19 should be referred to the designated neonatal ward. All medical staff involved should wear protecti

#### Question3: Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups

In [10]:
ask_question('Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups')



Answer 62769: 
 In outbreaks of infectious disease, healthcare personnel (HCP) are at increased risk of contracting emerging infections in the process of patient care [1, 2] . (Score: 0.8333)


Answer 1633: 
 High morbidity and mortality in influenza are seen especially among those at the extremes of age (elderly and very young), those with underlying health conditions and pregnant women. 29 Underlying health conditions especially associated with an increased risk for complicated influenza are immunecompromised individuals, either due to the underlying disease, or to immunomodulatory treatment, like organ transplant recipients and those taking medication for autoimmune conditions. 30 Furthermore, chronic pulmonary disease 31 , diabetes mellitus, cardiovascular disease and malignancies are also considered risk factors for developing severe influenza or complications. 32 Impact on Travellers Even a relatively mild, self-limiting seasonal influenza virus infection can have drastic impac

#### Question 4: Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.

In [None]:
ask_question('Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.')

#### Question 5: Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups

In [None]:
ask_question('Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups')

# Ask your own question

In [None]:
from ipywidgets import interact

In [None]:
interact(ask_question, query='')