# 1. Summary

### 1.1 THE CHALLENGE

The challenge of choice was "What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?". A set of questions were raised in the challenge outline and we set out to find the asnwers to those questions using data analysis. We did not want to lose the focus on what was asked and decided for a clean simple yet very effective approach.

### 1.2 THE APPROACH

Our approach is based on selective models benchmarked in reference to the provided dataset in order to maximise the answering power and keep the results tied to the scope of the challenge. After reviewing a series of references referring to cuttin-edge methods in the field, we compared benchmarks in various papers and decided to find the best BERT model for this purpose. We tried several models, found the best results using a specific BERT model fine tuned for questions-answers.

Other submissions picked this model, but the peculiarity that we found extremely effective was that the other solutions did not apply extensively the model, in particular decided to first filte the papers database by searching keywords and strings, and only apply BERT to find the answer in the reulting filtered data. Our approach flips this logic upside down, since we first identify the best answers across all papers database, and only after filter by keyword where needed.

### 1.3 THE RESULTS
Our results provide satisfying answers for 7 out of 8 questions. The time taken to extract an answer is reduced to a less than 30 seconds per question on a faster machine but we decided to pre-calculate them because of the limited power available on Kaggle notebooks. The summary of the results - which can be found in extended form in the relative section - follow:

<table width="100%">
    <thead>
    <tr>
        <td width="30%"  style="text-align:left"> Question </td>
        <td  style="text-align:left"> Answer </td>
    </tr>
    </thead>
    <tbody>
    <tr>
        <td style="text-align:left"> Risk factors: Smoking, pre-existing pulmonary disease. </td> 
        <td style="text-align:left"> Very likely but not demonstrated. It was demonstrated that smoking has no clinical or preventive significance for risk of influenza in the elderly. However, asthma is to be considered a risk factor for rhinovirus, which in turn often exaplains more underlying pulmonary illness. </td>
    </tr>
    <tr>
        <td style="text-align:left"> Risk factors: Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities. </td>
        <td style="text-align:left"> Likely not. It was observed that very often infected individuals with The COVID-19 coronavirus but without any symptoms could still transfer the virus to others . However, for pneumonia in relation to viruses similar to Covid-19, i.e. HCoV-NL63 and influenza A/H1N1, it was observed that co-infections caused significantly higher rates of breathing difficulties, cough, and sore throat than those of single infections. </td>
    </tr>
    <tr>
        <td style="text-align:left"> Neonates and pregnant women. </td>
        <td style="text-align:left"> Likely. While multiple studies demonstrated that pregnant women are more likely to develop severe illness after infection with Covid-19 and other influenza viruses, there is no study demonstrating that Covid-19 poses higher risk for neonates. However, neonates are thought to be susceptible to the virus, because their immune system is not well developed. </td>
    </tr>
    <tr>
        <td style="text-align:left"> Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences. </td>
        <td style="text-align:left"> Likely but not demonstrated. While no study looked directly at socio-economic and behavioral factors in relation to Covid-19, previous studies focused on risk factors for population from poorer areas of the World, identifying different health risks when compared to local population - such as other infectious diseases, inadequate immunity to vaccine-preventable illnesses, higher likelihood of having malnutrition and developmental delay and psychological trauma. </td>
    </tr> 
    <tr>
        <td style="text-align:left"> Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors. </td>
        <td style="text-align:left"> Covid-19 is spread by human-to-human transmission via direct contact, droplets and aerosol transmissions, and infection has been estimated to have mean incubation period of 6.4 days and a basic reproduction number of 2.24–3.58. With regards to enviromental factors, tests on virus surrogaes for Covid demonstrated that low air temperature and low humidity are likely to increase virus survival on surfaces up to 28 days, with studies going as far as suggesting hospitals to keep high temperatures at relatively high RH to reduce the survival of airborne influenza virus. It is also demonstrated that an enviroment that allows for substantial undocumented infection facilitates the rapid dissemination of COVID-19, while after travel restrictions and control measures are imposed the reproductive number number falls considerably. Also, testing strategies that do not produce repeated false-negatives as the existing test do are deemed beneficial for the containment of the virus. </td>
    </tr> 
    <tr>
        <td style="text-align:left"> Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups. </td>
        <td style="text-align:left"> The risk of fatality among infected individuals is 0.3% to 0.6%, but among hospitalised cases this increases to 14%. Based on deaths reported, high risk groups are pregnant women, older than the general population (mean age 45),and people in hospices. </td>
    </tr> 
    <tr>
        <td style="text-align:left"> Susceptibility of populations. </td>
        <td style="text-align:left"> Unkonwn </td>
    </tr> 
        <tr>
        <td style="text-align:left"> Public health mitigation measures that could be effective for control. </td>
        <td style="text-align:left"> Although little data is available for Covid-19 control measures, based on a forecasting model on Covid-19 data, it has been showed to be effective closin borders, schools, and suspending community services and commuters. More in general, in a number of recent studies, it has been shown how different intervention measures like travel restrictions, school closures, treatment and prophylaxis might allow to control outbreaks of diseases, such as SARS, pandemic influenza and others. It is also worth noting that because the elevated death risk estimates from Covid-19 are likely associated with a breakdown of the medical/health system, enhanced public health interventions including social distancing and movement restrictions should be effectively implemented to bring the epidemic under control.  </td>
    </tr> 
       
    </tbody>
</table>


# 2. Methodology

### 2.1 Data preparation

We first explored the dataset as well as clustering techniques, taking inspiration from Carrot [https://search.carrot2.org/#/pubmed]. However we decided to focus on actually providing the answers to the very specific questions that the challenge requested. Therefore, we first broke the challenge questions into a more specific set of questions that we could use to compare models and obtain overall better results.

In [1]:
# task 1: What do we know about COVID-19 risk factors?
# Task details: What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?
# https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=558

questions = [{'question':"Is smoking a risk factor?",'keyword':None},
             {'question':"Is a pre-existing pulmonary disease a risk factor?",'keyword':None},
             {'question':"Do co-existing conditions make the virus more transmissible?",'keyword':None},
             {'question':"Is being a pregnant woman a risk factor?",'keyword':'pregnant'},
             {'question':"Is being a neonate a risk factor?",'keyword':'neonate'},
             {'question':"Are there differences in risk factors associated to socio-economic factors?",'keyword':None},
             {'question':"How does the transmission happen?",'keyword':'transmission'},
             {'question':"What is the reproductive rate?",'keyword':None},
             {'question':"What is the incubation period?",'keyword':None},
             {'question':"What are the modes of transmission?",'keyword':None},
             {'question':"What are the enviromental factors?",'keyword':None},
             {'question':"How long is the serial interval?",'keyword':None},
             {'question':"What is the severity of disease among high risk groups and patients?",'keyword':None},
             {'question':"What is the risk of death among high risk groups and patients?",'keyword':None},
             {'question':"What is the susceptibility of populations?",'keyword':None},
             {'question':"What are the public health mitigation measures that could be effective for control?",'keyword':None}]

### 2.2 Model selection

Then, we looked at various models trained on different datasets - including SBERT [https://arxiv.org/pdf/1908.10084.pdf] and SCIBERT model [https://github.com/allenai/scibert] - and we compared results across the questions proposed. We compared results by manually veriying the top papers picked by each method and scoring them from 1 to 5. Eventually we set for a BERT model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute span start logits and span end logits). 


### 2.2 Detailed Methodology (Code)

The detailed methodology follows, including the code for each step.

## Initialise environment

We first intall the packages needed, get the papers corpus and load the local files, including the models that we run on another machine to save time to execution on the Kaggle machine. If the files for the models not in the folder, this same code will generate the files.

In [2]:
!pip install transformers
import torch
import pandas as pd
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
#device_available = torch.cuda.is_available()
device_available = False
from IPython.core.display import display, HTML
import seaborn as sns
import matplotlib.pyplot as plt


# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
#sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (20,8)

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
if device_available:
    model.cuda()

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')



  from IPython.core.display import display, HTML


In [3]:
# this is just to load the files needed instead of running the model each time.

import pickle

def load_or_run_answer_question_dict(question,keyword):
    pickle_name = question.replace(' ','_').replace('?','_')
    path_to_file = F"/kaggle/input/kaggle/{pickle_name}.pickle"
    print(path_to_file)
    try:
      df = pickle.load(open(path_to_file, "rb"))
    except (OSError, IOError) as e:
        df = answer_question_dict(question, keyword)
        pickle.dump(df, open(path_to_file, "wb"))
    return df
# print(os.listdir("../input"))
# print(os.listdir("../input/datacompetition"))

## Get data

In [5]:
import textwrap

def get_dataset(csv_path):
    corpus = []
    csv_df = pd.read_csv(csv_path).dropna(subset=['authors', 'abstract']).drop_duplicates(subset='abstract')
    csv_df = csv_df[csv_df['abstract']!='Unknown']
    for ix,row in csv_df.iterrows():
        if row['abstract'] and not pd.isna(row['abstract']):
            temp_dict = dict()
            temp_dict['abstract'] = row['abstract']
            temp_dict['title'] = row['title']
            temp_dict['authors'] = row['authors']
            temp_dict['url'] = row['doi']
            temp_dict['publish_time'] = row['publish_time']
            corpus.append(temp_dict)
    return corpus

wrapper = textwrap.TextWrapper(width=80) 

corpus = get_dataset('metadata.csv')

## Set up question-answer algorithm

Here we define our main answering function. As you can see, we are scoring every token in every abstract and not just a pre-selected few papers or abstracts like in the other submissions we saw. This allows us to obtain the full database scored, which we then rank and return only the top results. 

In [6]:
def answer_question_dict(question, keyword=None, show_visualization=False):

    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # select corpus
    answer_text = corpus

    # Initializing answers list
    answers = {}
    min_score = 0
    counter = 0 # for stopping iterations earlier
    
    for answer_option in answer_text:
      if keyword and keyword not in answer_option['abstract']:
        continue

      # ======== Tokenize ========
      # Apply the tokenizer to the input text, treating them as a text-pair.
      input_ids = tokenizer.encode(question, answer_option['abstract'],max_length=512)

      # Report how long the input sequence is.
      #print('Query has {:,} tokens.\n'.format(len(input_ids)))

      # ======== Set Segment IDs ========
      # Search the input_ids for the first instance of the `[SEP]` token.
      sep_index = input_ids.index(tokenizer.sep_token_id)

      # The number of segment A tokens includes the [SEP] token istelf.
      num_seg_a = sep_index + 1

      # The remainder are segment B.
      num_seg_b = len(input_ids) - num_seg_a

      # Construct the list of 0s and 1s.
      segment_ids = [0]*num_seg_a + [1]*num_seg_b
    

      # There should be a segment_id for every input token.
      assert len(segment_ids) == len(input_ids)
      

      # ======== Evaluate ========
      # Run our example question through the model.
        
      input_ids_tensor = torch.tensor([input_ids])
      segment_ids_tensor = torch.tensor([segment_ids])
      if device_available:
         input_ids_tensor = input_ids_tensor.to('cuda:0')
         segment_ids_tensor = segment_ids_tensor.to('cuda:0')

      start_scores, end_scores = model(input_ids_tensor, # The tokens representing our input text.
                                  token_type_ids=segment_ids_tensor) # The segment IDs to differentiate question from answer_text
    
      # only review answers with score above threshold
      score = round(torch.max(start_scores).item(), 3)

      if score>min_score and score>0:

        # ======== Reconstruct Answer ========
        
        # Find the tokens with the highest `start` and `end` scores.
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores)


        # Get the string versions of the input tokens.
        tokens = tokenizer.convert_ids_to_tokens(input_ids)

        # Start with the first token.
        answer = tokens[answer_start]

        # Select the remaining answer tokens and join them with whitespace.
        for i in range(answer_start + 1, answer_end + 1):
            
            # If it's a subword token, then recombine it with the previous token.
            if tokens[i][0:2] == '##':
                answer += tokens[i][2:]
            
            # Otherwise, add a space then the token.
            else:
                answer += ' ' + tokens[i]

        # ======== Add Answer to best answers list ========

        if len(answers)>4:
          min_score = min([d for d in answers.keys()])
          
        if len(answers)==10:
          answers.pop(min_score)
        answers[score] = [answer, score, '<a href="https://doi.org/'+str(answer_option['url'])+'" target="_blank">' + str(answer_option['title']) +'</a>', answer_option['abstract'], answer_option['publish_time']]
        
        visualization_start = max(answer_start-20,0)
        visualization_end = min((answer_end+1)+20,len(tokens))
        # Variables needed for graphs
        s_scores = start_scores.cpu().detach().numpy().flatten()
        e_scores = end_scores.cpu().detach().numpy().flatten()
        
        # We'll use the tokens as the x-axis labels. In order to do that, they all need
        # to be unique, so we'll add the token index to the end of each one.
        token_labels = []
        for (i, token) in enumerate(tokens):
            token_labels.append('{:} - {:>2}'.format(token, i))
        answers[score] = [answer, score, '<a href="https://doi.org/'+str(answer_option['url'])+'" target="_blank">' + str(answer_option['title']) +'</a>', answer_option['abstract'], answer_option['publish_time'], s_scores, e_scores, token_labels, visualization_start, visualization_end]
        
    # Return dataframe with relevant data
    df_columns = ['Answer', 'Confidence', 'Title', 'Abstract', 'Published', 's_scores', 'e_scores', 'token_labels', 'visualization_start', 'visualization_end']
    df = pd.DataFrame.from_dict(answers, orient='index',columns = df_columns)
    df.sort_values(by=['Confidence'], inplace=True, ascending=False)
    return df

# 3. Results & Discussion

### 3.1 Results in full

Here the questions defined in the sections above are answered and results are presented in full. All answers in the results above come from these results.

In [7]:
for question in questions:
    print("======================")
    print(question)
    df = load_or_run_answer_question_dict(question['question'], question['keyword'])
    display(HTML(df[['Answer', 'Confidence', 'Title', 'Abstract', 'Published']].to_html(render_links=True, escape=False, index=False)))
    print("======================")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'question': 'Is smoking a risk factor?', 'keyword': None}
/kaggle/input/kaggle/Is_smoking_a_risk_factor_.pickle


TypeError: max(): argument 'input' (position 1) must be Tensor, not str

### 3.2 Discussion

We believe our method is a novel approach to this challenge and it allowed to answer to the posed questions in an accurate and clear manner. The code is simple, the results are surfaced and can be used easily even by non-experts without too much effort.

There are a few gaps in the answers - in particular to susceptible population - this is an interesting point about data gaps because we do know there are publications out there addressing this questions (e.g. https://www.sciencedirect.com/science/article/pii/S0896841120300469 ) but they are not in the database and therefore the algorithm did not surface them. Also, a lot of the data is unstructured and not parsed - in particular in the full text form.

Also, the database is very heterogenous and the results are often relative to old research or other viruses or outbreaks - rhinovirus, Sars etc - presenting the risk of oversimplification and generalization of the findings that the careful reader can identify when looking at the best set of results, but the models (not ours, all of those that we checked) will struggle to grasp. This could lead to really dangerous results and therefore we do think this tools should be used just to rank the findings and surface the possible answers and then - as demonstrated in the Summary above - it'd be very simple for a reader to pick the best answer. This is often based on the full abstract, the title and the year of publication, so we included those elements (as well as a link to the full text) in our results output.

In terms of models parameters - different scoring methods affect the results of the model. Therefore, we assessed the scores and compared results using two different scoring methods - first using only the starting token score, and then using a composite score based on both start and end tokens. An exemplificative set of tokens scores for the first question is showed in the figures below - we found that by taking the highest scores in a 40 tokens range around the start (green spike in the first figure below) and end (green spike in the second figure below) token gives the best answers, so we implemented that in the question/answer model.

In [8]:
def start_word_plot(token_labels, s_scores):
  ax = sns.barplot(x=token_labels, y=s_scores, ci=None)
  ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")
  ax.grid(True)
  plt.title('Start Word Scores [for first answer]')
  plt.show()

def end_word_plot(token_labels,e_scores):
  ax = sns.barplot(x=token_labels, y=e_scores, ci=None)
  ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")
  ax.grid(True)
  plt.title('End Word Scores [for first answer]')
  plt.show()

df = load_or_run_answer_question_dict(questions[0]['question'], question['keyword'])
start_word_plot(df.iloc[0]['token_labels'],df.iloc[0]['s_scores'])
end_word_plot(df.iloc[0]['token_labels'],df.iloc[0]['e_scores'])

/kaggle/input/kaggle/Is_smoking_a_risk_factor_.pickle


TypeError: max(): argument 'input' (position 1) must be Tensor, not str

### 3.3 Pros and cons and future work

We think that our model is accurate and can surface the best answers. The mode is however not fast on a chaep machine, and needs some pre-training on faster machines. Also, it can be largely be improved - our plans for the next submission is to proceed with a split of the answers by outbreak and viruses, as it seems that many papers related to rhinovirus, SARS-CoV-2 or other respiratory pathogen; also we plan to extend the answers model to the full text of the papers, as well as provide more guidance to the doctors on how to pick the best answers in the surfaced set.

# 4. REUSABILITY
---
The code needed to train and run the models is all in this file, as well as the results. To add a question, simply add it to the list of questions in section 2.1, and re-run the remaining of the code. 


# 5. REFERENCES
---

Scoring and other notes we used: 

https://docs.google.com/spreadsheets/d/1PKQVCSBK2Xsuvh_oLbQkejBDI0Jve23hJCeDMkYWu80/edit?ts=5e7a9121#gid=164381334


Other references that we used:

https://arxiv.org/pdf/1908.10084.pdf

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

https://search.carrot2.org/#/pubmed

https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

https://arxiv.org/pdf/1512.01337.pdf

https://www.aclweb.org/anthology/D14-1070.pdf

https://arxiv.org/abs/1903.10676

https://arxiv.org/pdf/1706.03762.pdf

https://rajpurkar.github.io/SQuAD-explorer/

https://arxiv.org/pdf/1810.04805.pdf