<a href="https://colab.research.google.com/github/elif-tr/COVID19-Text-Processing/blob/main/Covid19_Project_original.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COVID19 related papers that published during the first week of May, 2020

In [None]:
import pandas as pd
import os
import json
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /Users/elif/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Define a global variable for the folder name that contains the json files of the papers

In [None]:
DATA_DIR = 'archive (1)'

### Working with meta data file

First read in the meta data file to filter the papers for the time frame we want.

In [None]:
def meta_data(folder_name, metadata = "metadata.csv"):
    ''' Function that takes in the folder name and returns the data fram fro the meta data
    
    :param folder_name: name of folder where covid data is located
    :param metadata: file name for where metadata is saved
    :return: data frame of the metadata
    
    '''
    
    file_name = os.path.join(os.getcwd(),folder_name, metadata )
    data = pd.read_csv((file_name), usecols=['pdf_json_files', 'publish_time'])
    return data

In [None]:
def json_files(start_date = '2020-05-01', end_date = '2020-05-07'):
    '''Function that filters the meta data file for a time frame we want. 
    If nothing specified, first week of may will be used
    
    :param start_date: reading in the json files from when they were published 
    :param end_date: reading in the json files until when they were published 
    :return: lisf of json file names that are within the specified publication date
    
    '''
    
    global DATA_DIR
    
    file = meta_data(DATA_DIR)
    file['publication_date'] = pd.to_datetime(file['publish_time'])
    may_first_week = file[(file['publication_date'] > start_date) & (file['publication_date'] <= end_date)]
    
    return list(may_first_week['pdf_json_files'].dropna())

### Files that contain multiple papers

I observed that some of the files contain more than one paper which makes it harder for us to read them in individually. For that, we will bring all our json files into same format of containing 1 file per document.

In [None]:
#Those files were separated by ; sign instead of , sign.

all_files = []
for file in json_files():
    all_files.extend(map(str.strip, file.split(";")))


### Extracting only the columns needed for our analysis 

Some of the code was take from: https://www.kaggle.com/davidbetancur8/symptoms-word-cloud

In [None]:
def read_json_files(file_list = all_files):
    
    '''Function that takes in date filtered json files and outputs a data frame with only 3 columns: paper_id, title and body text of the paper
    
    :param file_list: list of json files that will be read in by locating in the directory 
    :return: return a data frame of those json files with three columns only "paper_id", "title", "full_text"
    
    '''
    docs = []
    for file in file_list:
        file_name = os.path.join(os.getcwd(),DATA_DIR, file)
        with open(file_name) as f:
            data_json = json.load(f)
                        
            
        title = data_json["metadata"]["title"]
        paper_id = data_json['paper_id']
        

        full_text = ""
        i = 1
        for text in data_json["body_text"]:
            i+=1
            full_text += text["text"].lower()
        docs.append([paper_id, title, full_text])

    df = pd.DataFrame(docs, columns=["paper_id", "title", "full_text"])

    return df
    

### Extracting sentences that contain systoms 

Defining the symptoms that we will extract from our sentences. The list was taken from: https://www.kaggle.com/davidbetancur8/symptoms-word-cloud

In [None]:
symptoms = [
    "weight loss","chills","shivering","convulsions","deformity","discharge","dizziness",
    "vertigo","fatigue","malaise","asthenia","hypothermia","jaundice","muscle weakness",
    "pyrexia","sweats","swelling","swollen","painful lymph node","weight gain","arrhythmia",
    "bradycardia","chest pain","claudication","palpitations","tachycardia","dry mouth","epistaxis",
    "halitosis","hearing loss","nasal discharge","otalgia","otorrhea","sore throat","toothache","tinnitus",
    "trismus","abdominal pain","fever","bloating","belching","bleeding","blood in stool","melena","hematochezia",
    "constipation","diarrhea","dysphagia","dyspepsia","fecal incontinence","flatulence","heartburn",
    "nausea","odynophagia","proctalgia fugax","pyrosis","steatorrhea","vomiting","alopecia","hirsutism",
    "hypertrichosis","abrasion","anasarca","bleeding into the skin","petechia","purpura","ecchymosis and bruising",
    "blister","edema","itching","laceration","rash","urticaria","abnormal posturing","acalculia","agnosia","alexia",
    "amnesia","anomia","anosognosia","aphasia and apraxia","apraxia","ataxia","cataplexy","confusion","dysarthria",
    "dysdiadochokinesia","dysgraphia","hallucination","headache","akinesia","bradykinesia","akathisia","athetosis",
    "ballismus","blepharospasm","chorea","dystonia","fasciculation","muscle cramps","myoclonus","opsoclonus",
    "tremor","flapping tremor","insomnia","loss of consciousness","syncope","neck stiffness","opisthotonus",
    "paralysis and paresis","paresthesia","prosopagnosia","somnolence","abnormal vaginal bleeding",
    "vaginal bleeding in early pregnancy", "miscarriage","vaginal bleeding in late pregnancy","amenorrhea",
    "infertility","painful intercourse","pelvic pain","vaginal discharge","amaurosis fugax","amaurosis",
    "blurred vision","double vision","exophthalmos","mydriasis","miosis","nystagmus","amusia","anhedonia",
    "anxiety","apathy","confabulation","depression","delusion","euphoria","homicidal ideation","irritability",
    "mania","paranoid ideation","suicidal ideation","apnea","hypopnea","cough","dyspnea","bradypnea","tachypnea",
    "orthopnea","platypnea","trepopnea","hemoptysis","pleuritic chest pain","sputum production","arthralgia",
    "back pain","sciatica","Urologic","dysuria","hematospermia","hematuria","impotence","polyuria",
    "retrograde ejaculation","strangury","urethral discharge","urinary frequency","urinary incontinence","urinary retention"]

## Check the papers that contain the words in  our list of symptoms and then we will extract those sentences only

I had to create a seperate data frame without the title column to work on as it kept getting mixed with the full text when using nltk.tokenize. 

I spent quite long time on this figure out why it was happening therefore, decided to use the data frame with paper id and fulltext only.

In [None]:
data_frame = read_json_files()

In [None]:
df_no_title = data_frame[['paper_id', 'full_text']]

## nltk.tokenize

I used nltk.tokenize to split the full text into one sentence per row

In [None]:
sentences = []
for row in df_no_title.itertuples():            
     for sentence in sent_tokenize(row[2]):
            sentences.append((row[1], sentence))
    
new_df = pd.DataFrame(sentences, columns=['Paper_Id', 'Sentence'])


Added in the Sentence ID column to keep track of which sentences will be retreived when we check for the symptoms in each sentence. Started the id values from 1 thus the increment of 1 on the existing data frame index.

In [None]:
new_df['Sentence_ID'] = new_df.index + 1

In [None]:
new_df['Sentence'][108]

'fifty-two patients had a severe hypercapnia with a p aco2 ≥ 60 mmhg and 28 a severe acidosis with a ph < 7.2. all subjects were sedated and invasive mechanically ventilated in a pressure controlled mode with a shorter duration before pecla in the survivor group.'

### Final data frame that only includes sentences from each text that contains any of the symptoms from our list of symptoms

In [None]:
final_df = new_df[new_df['Sentence'].str.contains('|'.join(symptoms))]

In [None]:
final_df['Sentence'].head()

112    11% died due to infaust neurologic prognosis (...
259    we retrospectively analyzed medical charts of ...
260    the patient first had 4-5 episodes of watery d...
261    she was given intravenous (iv) fluids and disc...
262    however, she returned to the ed the next day w...
Name: Sentence, dtype: object

### Rearranged the display of the columns

In [None]:
columnsTitles = ['Paper_Id', 'Sentence_ID', 'Sentence']

covid_df = final_df.reindex(columns=columnsTitles)

### Save the sentences in a csv file where there are 3 columns: paper ID, sentence ID, and sentence text to be our testing data.

I did not save the indexes as it would create multiple index columns when we reread in the file.

In [None]:
covid_df.to_csv('/Users/elif/Desktop/covid_testing_data.csv', index = False)

###   Create another spreadsheet where there are three columns: sentence #, word, and tag. 
### Here the sentence # should be consecutive integers (similar to a surrogate key in a database table) and is not the same as sentence ID in the first spreadsheet.


    Place one word in a row and label symptoms words in the tag column: 
    mark the beginning (B-Sym) and inside (I-Sym) of each symptom term. 
    If a term consists of only one word, simply mark it as B-Sym with no I-Sym.  
    Label all other words as O. 

In [None]:
frame = final_df[['Sentence']]

In [None]:
frame.head()

Unnamed: 0,Sentence
112,11% died due to infaust neurologic prognosis (...
259,we retrospectively analyzed medical charts of ...
260,the patient first had 4-5 episodes of watery d...
261,she was given intravenous (iv) fluids and disc...
262,"however, she returned to the ed the next day w..."


### Tokenize sentences

In this part of the task, I tried using string comparison but could not get it to work in my code or got stuck with labeling again. 

Instead, I went over my head and tried to use regular expressions to do the tagging.

In [None]:
#Create a pattern for regex and use it with nltk.tagger to tag the symptoms 
pattern = [(symptom, ' '.join(['B-SYM']+['I-SYM']*(symptom.count(' '))))  for symptom in symptoms]
tagger = nltk.RegexpTagger(pattern)

In [None]:
words = []
for row in frame.itertuples():            
    for word, tag in tagger.tag(word_tokenize(row[1])):

        if tag == None:
            tag = 'O'
        words.append((row[0], word, tag))
    
tag_df = pd.DataFrame(words, columns=['Sentence', 'Words', 'Tag'])

    - Cheking the words to tag them according to the tags we have on the provided excel sheet.

    - Above tagging is working however, I need to identify the symptoms that come from multiple words..
    
    I am stuck here.
    
I observed that there are 43 multiple word symptoms and 375 rows that contain multiple word symptoms that I need to indetify.

In [None]:
#Styling of sentence numbers ~ instead of duplicating the same number over and over again,
#we leave out the duplicated ones by replacing them with space
tag_df['Sentence_ID'] = tag_df['Sentence']
is_duplicate = tag_df['Sentence'].duplicated()

tag_df['Sentence_ID'] = tag_df['Sentence'].where(~is_duplicate, ' ')

In [None]:
tagged_data = tag_df[['Sentence_ID', 'Words', 'Tag']]

In [None]:
tagged_data.to_csv('/Users/elif/Desktop/covid_tagged_data.csv', index = False)

## THANK YOU!