<a href="https://colab.research.google.com/github/dilaratank/roBERTa-Symptom-Tracking/blob/main/Symptom_tracking_roBERTa_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

## Motivation

The COVID-19 pandemic is a challenging time for all of us. AI-models in NLP could help process COVID-19 information in medical interviews more interpretable and visual. This research will focus on symptom extraction from medical dialogue before and after COVID-19.  

## RQs

What are the most common symptoms before and after COVID-19 in medical interviews? 

## Example as a task illustration

The code is designed to work as follows:
- Feed it a medical dialogue, where symptoms are discussed
- The code will extract the symptoms
- The code will display the most common symptoms in that conversation

An example can be illustrated with the following conversation: \\
Patient: Hello doctor, these last few days I have been coughing a lot. \\
Doctor: That is unfortunate to hear, do you have any other symptoms like sore throat, chest pains, etc? \\
Patient: I have a cough and in addition to that also a sore throat, but that's about it. \\
Doctor: Alright, a cough and a sore throat are symptoms of the Coronavirus but because of the time of year I assume you just have a cold. It is advised to take a test and take medicine for your cough and sore throat. \\

The model would then output 'cough' and 'sore throat' as most common symtpoms discussed in this dialogue. 

## Related literature
- BERT model
- roBERTa model
- Named Entity Recognition 

# Experimental Setup

## Imports and installations 

In [22]:
# Imports
import zipfile
import nltk
import os
from nltk.tokenize import word_tokenize
from nltk import tokenize
nltk.download('punkt')
import pandas as pd
import numpy as np

# Installations

# To avoid version conflic on Colab notebook
# %pip install pip -U
# %pip install sentencepiece
# %pip install sortedcontainers==2.1.0

# # The model
# %pip install tner

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Datasets

The following datasets will be used: 
- COVID-19 Dialogue Datase (during/after covid) https://www.kaggle.com/xuehaihe/covid-dialogue-dataset?select=COVID-Dialogue-Dataset-English.txt
- MedDialog Dataset (English) (before covid) https://github.com/UCSD-AI4H/Medical-Dialogue-System
  - This dataset consists of 4 datasets, the 'icliniq' dataset will be used for this project because of the wide variety of medical subjects

These datasets are structured as follows: 
- ID number
- Description
- Dialogue


### Download data
Let's first download the data!

In [4]:
!git clone https://github.com/dilaratank/roBERTa-Symptom-Tracking.git
%cd roBERTa-Symptom-Tracking/
with zipfile.ZipFile("/content/roBERTa-Symptom-Tracking/data.zip","r") as zip_ref:
    zip_ref.extractall("/content/")
%cd /content/

Cloning into 'roBERTa-Symptom-Tracking'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 21 (delta 4), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (21/21), done.
/content/roBERTa-Symptom-Tracking/roBERTa-Symptom-Tracking
/content


### Dataset Preprocessing

The roBERTa model requires the data to be structured per sentence, which is done in the preprocessing steps. First, the data is split on dialogue to get rid of other unneccessary information. Thereafter, the data is split on sentences and saved as a .csv file for later use. 

In [7]:
# Helper functions

def split_on_dialogue(data_path):
    """
    Returns list with conversations
    Format conversatoins: [[conversation1], [conversation2], ..., [conversation_n]]
    """
    
    with open(data_path) as f:
        lines = f.readlines()
        f.close()

    i = 0
    j = 0
    dialogue_i = 0
    convo=[]
    conversations=[]

    for line in lines:
        i += 1
        tokens = word_tokenize(line)


        if line[:8] == 'Dialogue':
            dialogue_i = i+1

        if i == dialogue_i+j:
            convo.append(line)
            j +=1
            if len(tokens) == 0:
                conversations.append(convo)
                convo = []
                j = 0
                continue
    return conversations

def split_on_sentences(conversations):
  """
  A Function that splits the conversations in sentences. 
  """

  sentence_list = []

  for conversation in conversations:
      for sentences in conversation:
          token_sen = tokenize.sent_tokenize(sentences)
          for sentence in token_sen:
              if sentence != 'Patient:' and sentence != 'Doctor:':
                  sentence_list.append(sentence)
    
  return sentence_list


def save(df, save_preprocessed_dataframe_path, name):
    """
    Function that saves the created dataframe as a csv.
    """
    
    df.to_csv(save_preprocessed_dataframe_path+name+'.csv', index=False)


In [12]:
# Final Function 

def preprocess_to_csv(data_path, save_to):
  """
  A function that preprocesses the data (so that it is displayed per sentence),
  and saves is as a .csv file for later use. 
  """
    
  # Split on dialogue 
  conversations = split_on_dialogue(data_path)
  
  # Split on sentence
  sentences = split_on_sentences(conversations)
  
  # Make dataframe
  df_sent = pd.DataFrame(np.array(sentences), columns=['sentences'])
  
  # Save
  name = os.path.basename(data_path)
  save(df_sent, save_to, name[:-4])
  
  print(name, 'done')
  

Preprocessing the data might take a while!

In [23]:
preprocess_to_csv('/content/data/COVID-Dialogue-Dataset-English.txt', '/content/data/')
preprocess_to_csv('/content/data/icliniq_dialogue.txt', '/content/data/')

COVID-Dialogue-Dataset-English.txt done
icliniq_dialogue.txt done


The data now looks like this:

In [24]:
covid_dialogue_df = pd.read_csv('/content/data/COVID-Dialogue-Dataset-English.csv')
covid_dialogue_df.head()

Unnamed: 0,sentences
0,"Hello doctor, I get a cough for the last few d..."
1,No raise in temperature but feeling tired with...
2,No contact with any Covid-19 persons.
3,It has been four to five days and has drunk a ...
4,Doctors have shut the OP so do not know what t...


## Aproaches

- Use pre-trained roBERTa to extract symptoms from medical dialogue before and after COVID-19
- Display most common symtoms
- Model evaluation using accuracy score


## Implementation details

In [None]:
# TODO: explain and write functions that extract the symptoms from the code 

# Evaluation

## Metrics
The model will be evaluated using the accuracy metric. 

In [None]:
def get_predicted_symptoms(prediction):
  """
  This function takes in the prediction of a sentence of the pre-trained model 
  and returns the symptoms mentioned in that sentence. 
  """
  symptoms = []

  # Check if there is a predicted entity
  if len(prediction[0]['entity']) > 0:

    number_of_entities = len(prediction[0]['entity'])

    # Loop over predicted entities and get symptoms (here called: disease)
    for i in range(number_of_entities):
      if prediction[0]['entity'][i]['type'] == 'disease':
        symptoms.append(prediction[0]['entity'][i]['mention'])

  return symptoms 

def accuracy(df):
  """
  This function computes the accuracy score, given a dataframe. 
  """

  number_of_symptoms = 0
  number_of_well_predicted = 0

  for ind in df.index:

    sentence = df['sentences'][ind]

    # Padding needed because algorithm is not used to small sentences
    if len(sentence) < 45:
      sentence = sentence+'...'

    prediction = trainer.predict([sentence])

    predicted_symptom = get_predicted_symptoms(prediction)
    predicted_symptom = [x.lower() for x in predicted_symptom]
    predicted_symptom = [x.split(', ')[0] for x in predicted_symptom]

    gt_symptom = df['symptoms'][ind]


    # If it's not nan
    if isinstance(gt_symptom, str):
      gt_list = gt_symptom.split(', ')

      # Keep track of symptoms 
      for symptom in gt_list:
        number_of_symptoms += 1

        # Keep track of well predicted symptoms 
        if symptom in predicted_symptom:
          number_of_well_predicted += 1

  print('Ground truth symptoms: ', number_of_symptoms)
  print('Correctly predicted symptoms ', number_of_well_predicted)
  print('accuracy: ', number_of_well_predicted/number_of_symptoms)

## Results

## Error analysis

# Findings

## Illustration

## Interpretation

## Discussion

# Conclusions

## Summary

## Lessons learned