# Analysis of COVID-19 medical dataset

In [1]:
from question_answering.paths import extractive_qa_paths
from question_answering.utils import core_qa_utils
from transformers import AutoTokenizer

In [2]:
# Load COVID-19 dataset
df_train, df_val, df_test = core_qa_utils.load_train_val_test_datasets(
    extractive_qa_paths.medical_dataset_dir
)

train_dataset, val_dataset, test_dataset = core_qa_utils.convert_dataframes_to_datasets(
    [df_train, df_val, df_test]
)

# How do the raw samples look like?

In [16]:
first_sample = train_dataset[1]

In [17]:
question = first_sample['question']
print(f"Question: {question}")

Question: What entities with no genes satisfy the criteria for life?


In [18]:
context = first_sample['context']
context_char_len = len(context)
context_word_len = len(context.split())

print(f"Context: {context}\n")
print(f"Context char length: {context_char_len}")
print(f"Context words length: {context_word_len}")

Context: Viruses and Evolution – Viruses First? A Personal Perspective

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6433886/

SHA: f3b9fc0f8e0a431366196d3e835e1ec368b379d1

Authors: Moelling, Karin; Broecker, Felix
Date: 2019-03-19
DOI: 10.3389/fmicb.2019.00523
License: cc-by

Abstract: The discovery of exoplanets within putative habitable zones revolutionized astrobiology in recent years. It stimulated interest in the question about the origin of life and its evolution. Here, we discuss what the roles of viruses might have been at the beginning of life and during evolution. Viruses are the most abundant biological entities on Earth. They are present everywhere, in our surrounding, the oceans, the soil and in every living being. Retroviruses contributed to about half of our genomic sequences and to the evolution of the mammalian placenta. Contemporary viruses reflect evolution ranging from the RNA world to the DNA-protein world. How far back can we trace their contribution? Earliest r

In [20]:
unstructured_samples = train_dataset.filter(lambda sample: "Text: " not in sample['context'])
len(unstructured_samples)

Filter:   0%|          | 0/1413 [00:00<?, ? examples/s]

106

In [23]:
print(unstructured_samples[0]['context'])

Estimating the number of infections and the impact of non-
pharmaceutical interventions on COVID-19 in 11 European countries

30 March 2020 Imperial College COVID-19 Response Team

Seth Flaxmani Swapnil Mishra*, Axel Gandy*, H JulietteT Unwin, Helen Coupland, Thomas A Mellan, Harrison
Zhu, Tresnia Berah, Jeffrey W Eaton, Pablo N P Guzman, Nora Schmit, Lucia Cilloni, Kylie E C Ainslie, Marc
Baguelin, Isobel Blake, Adhiratha Boonyasiri, Olivia Boyd, Lorenzo Cattarino, Constanze Ciavarella, Laura Cooper,
Zulma Cucunuba’, Gina Cuomo—Dannenburg, Amy Dighe, Bimandra Djaafara, Ilaria Dorigatti, Sabine van Elsland,
Rich FitzJohn, Han Fu, Katy Gaythorpe, Lily Geidelberg, Nicholas Grassly, Wi|| Green, Timothy Hallett, Arran
Hamlet, Wes Hinsley, Ben Jeffrey, David Jorgensen, Edward Knock, Daniel Laydon, Gemma Nedjati—Gilani, Pierre
Nouvellet, Kris Parag, Igor Siveroni, Hayley Thompson, Robert Verity, Erik Volz, Caroline Walters, Haowei Wang,
Yuanrong Wang, Oliver Watson, Peter Winskill, Xiaoyue X

In [13]:
answer = first_sample['answer_text']
answer_char_len = len(answer)
answer_words_len = len(answer.split())

print(f"Answer: {answer}")
print(f"Answer char length: {answer_char_len}")
print(f"Answer words length: {answer_words_len}")

Answer: ion channel activity of the 3a protein
Answer char length: 38
Answer words length: 7


## Processing and normalization
Theoretically this problem could be problematic, because after text normalization the start position of the answer might change.
A possible solution could be to try to find if the answer text appears in text context only once or more. If some samples contain this text more than once, then for each sample we should keep information about the number of the answer's occurrence in the context. After the process of normalizing both question and context, we should find the start position of the answer in the context based on the previously aquired info.

# Maximum number of tokens in any sample across datasets for various tokenizers

In [None]:
# Check the maximum number of tokens in any sample across datasets
# def tokenize_sample(sample, tokenizer, max_tokens=None, padding=False)