# Analysis of COVID-19 medical dataset

In [1]:
from question_answering.paths import extractive_qa_paths
from question_answering.utils import core_qa_utils
from transformers import AutoTokenizer

In [2]:
# Load COVID-19 dataset
df_train, df_val, df_test = core_qa_utils.load_train_val_test_datasets(
    extractive_qa_paths.medical_dataset_dir
)

train_dataset, val_dataset, test_dataset = core_qa_utils.convert_dataframes_to_datasets(
    [df_train, df_val, df_test]
)

In [14]:
train_dataset, val_dataset, test_dataset

(Dataset({
     features: ['index', 'context', 'question', 'is_impossible', 'id', 'answer_text', 'answer_start'],
     num_rows: 1413
 }),
 Dataset({
     features: ['index', 'context', 'question', 'is_impossible', 'id', 'answer_text', 'answer_start'],
     num_rows: 303
 }),
 Dataset({
     features: ['index', 'context', 'question', 'is_impossible', 'id', 'answer_text', 'answer_start'],
     num_rows: 303
 }))

# How do the raw samples look like?

In [33]:
first_sample = train_dataset[0]

In [34]:
question = first_sample["question"]
print(f"Question: {question}")

Question: What ion channel is essential for 3a-mediated IL-1Beta secretion?


In [35]:
context = first_sample["context"]
context_char_len = len(context)
context_word_len = len(context.split())

print(f"Context: {context}\n")
print(f"Context char length: {context_char_len}")
print(f"Context words length: {context_word_len}")

Context: Severe Acute Respiratory Syndrome Coronavirus Viroporin 3a Activates the NLRP3 Inflammasome

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6361828/

SHA: f02d0c1e8b0109648e578662dc250abe349a033c

Authors: Chen, I-Yin; Moriyama, Miyu; Chang, Ming-Fu; Ichinohe, Takeshi
Date: 2019-01-29
DOI: 10.3389/fmicb.2019.00050
License: cc-by

Abstract: Nod-like receptor family, pyrin domain-containing 3 (NLRP3) regulates the secretion of proinflammatory cytokines interleukin 1 beta (IL-1β) and IL-18. We previously showed that influenza virus M2 or encephalomyocarditis virus (EMCV) 2B proteins stimulate IL-1β secretion following activation of the NLRP3 inflammasome. However, the mechanism by which severe acute respiratory syndrome coronavirus (SARS-CoV) activates the NLRP3 inflammasome remains unknown. Here, we provide direct evidence that SARS-CoV 3a protein activates the NLRP3 inflammasome in lipopolysaccharide-primed macrophages. SARS-CoV 3a was sufficient to cause the NLRP3 inflammasome a

# Do all samples follow the structure of:
* trash
* abstract
* text?

In [38]:
unstructured_samples = train_dataset.filter(
    lambda sample: "Text: " not in sample["context"]
)
len(unstructured_samples)

Filter:   0%|          | 0/1413 [00:00<?, ? examples/s]

106

In [39]:
print(unstructured_samples[0]["context"])

Estimating the number of infections and the impact of non-
pharmaceutical interventions on COVID-19 in 11 European countries

30 March 2020 Imperial College COVID-19 Response Team

Seth Flaxmani Swapnil Mishra*, Axel Gandy*, H JulietteT Unwin, Helen Coupland, Thomas A Mellan, Harrison
Zhu, Tresnia Berah, Jeffrey W Eaton, Pablo N P Guzman, Nora Schmit, Lucia Cilloni, Kylie E C Ainslie, Marc
Baguelin, Isobel Blake, Adhiratha Boonyasiri, Olivia Boyd, Lorenzo Cattarino, Constanze Ciavarella, Laura Cooper,
Zulma Cucunuba’, Gina Cuomo—Dannenburg, Amy Dighe, Bimandra Djaafara, Ilaria Dorigatti, Sabine van Elsland,
Rich FitzJohn, Han Fu, Katy Gaythorpe, Lily Geidelberg, Nicholas Grassly, Wi|| Green, Timothy Hallett, Arran
Hamlet, Wes Hinsley, Ben Jeffrey, David Jorgensen, Edward Knock, Daniel Laydon, Gemma Nedjati—Gilani, Pierre
Nouvellet, Kris Parag, Igor Siveroni, Hayley Thompson, Robert Verity, Erik Volz, Caroline Walters, Haowei Wang,
Yuanrong Wang, Oliver Watson, Peter Winskill, Xiaoyue X

In [40]:
answer = first_sample["answer_text"]
answer_char_len = len(answer)
answer_words_len = len(answer.split())

print(f"Answer: {answer}")
print(f"Answer char length: {answer_char_len}")
print(f"Answer words length: {answer_words_len}")

Answer: ion channel activity of the 3a protein
Answer char length: 38
Answer words length: 7


In [64]:
print(f"Can all questions be answered: {all([sample['is_impossible'] == 0 for sample in train_dataset])}")

Can all questions be answered: True


# How long are the texts?

In [41]:
def get_max_num_of_words(sentences: list[str]):
    return len(max([sentence.split() for sentence in sentences], key=len))

In [43]:
print(f"Max word length of questions in train set: {get_max_num_of_words(train_dataset['question'])}")
print(f"Max word length of questions in val set: {get_max_num_of_words(val_dataset['question'])}")
print(f"Max word length of questions in test set: {get_max_num_of_words(test_dataset['question'])}")

print("--------------------------")

print(f"Max word length of contexts in train set: {get_max_num_of_words(train_dataset['context'])}")
print(f"Max word length of contexts in val set: {get_max_num_of_words(val_dataset['context'])}")
print(f"Max word length of contexts in test set: {get_max_num_of_words(test_dataset['context'])}")

Max word length of questions in train set: 28
Max word length of questions in val set: 25
Max word length of questions in test set: 27
--------------------------
Max word length of contexts in train set: 11368
Max word length of contexts in val set: 11368
Max word length of contexts in test set: 11368


### Seems like there is a limit of words after which the context is cut off

In [57]:
def find_samples_with_word_count(dataset, num_words: int, key: str = 'context'):
    return dataset.filter(lambda sample: len(sample[key].split()) == num_words)

In [58]:
max_length_samples = find_samples_with_word_count(train_dataset, num_words=11368)

Filter:   0%|          | 0/1413 [00:00<?, ? examples/s]

In [63]:
max_length_samples

Dataset({
    features: ['index', 'context', 'question', 'is_impossible', 'id', 'answer_text', 'answer_start'],
    num_rows: 83
})

In [59]:
print(f"How many: {len(max_length_samples)}")

How many: 83


In [55]:
some_cut_off_context = max_length_sentences[82]
print(some_cut_off_context)

MERS coronavirus: diagnostics, epidemiology and transmission

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4687373/

SHA: f6fcf1a99cbd073c5821d1c4ffa3f2c6daf8ae29

Authors: Mackay, Ian M.; Arden, Katherine E.
Date: 2015-12-22
DOI: 10.1186/s12985-015-0439-5
License: cc-by

Abstract: The first known cases of Middle East respiratory syndrome (MERS), associated with infection by a novel coronavirus (CoV), occurred in 2012 in Jordan but were reported retrospectively. The case first to be publicly reported was from Jeddah, in the Kingdom of Saudi Arabia (KSA). Since then, MERS-CoV sequences have been found in a bat and in many dromedary camels (DC). MERS-CoV is enzootic in DC across the Arabian Peninsula and in parts of Africa, causing mild upper respiratory tract illness in its camel reservoir and sporadic, but relatively rare human infections. Precisely how virus transmits to humans remains unknown but close and lengthy exposure appears to be a requirement. The KSA is the focal point of ME

### It is all the same context -> In this dataset there are the same contexts that repeat with varying questions and answers. Therefore, the number of contexts to take into account in normalization is limited and perhaps managable

## Processing and normalization
Theoretically this problem could be problematic, because after text normalization the start position of the answer might change.
A possible solution could be to try to find if the answer text appears in text context only once or more. If some samples contain this text more than once, then for each sample we should keep information about the number of the answer's occurrence in the context. After the process of normalizing both question and context, we should find the start position of the answer in the context based on the previously aquired info.

# Maximum number of tokens in any sample across datasets for various tokenizers

In [None]:
# Check the maximum number of tokens in any sample across datasets
# def tokenize_sample(sample, tokenizer, max_tokens=None, padding=False)