## COVID19 related papers that published during the first week of May, 2020

Papers extracted from `https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge`

In [1]:
import pandas as pd
import numpy as np
import os
import json
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import parse_cord as cord

[nltk_data] Downloading package punkt to /Users/elif/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf

In [3]:
# For Notebooks
init_notebook_mode(connected=True)

# For offline use
cf.go_offline()

### Define a global variable for the folder name that contains the json files of the papers

In [4]:
DATA_DIR = 'data/archive'

Call the module to read in the data

In [5]:
# Use the module that reads in the json files - parse_cord file
data_frame = cord.read_json_files(cord.json_files(DATA_DIR), DATA_DIR)

### Extracting sentences that contain symptoms

Defining the symptoms that we will extract from our sentences. The list was taken from: https://www.kaggle.com/davidbetancur8/symptoms-word-cloud

The list was modified by Max to have more up to date symptoms and I added in the following symptoms which were taken from CDC's website or I observed them frequintly in the papers:

    -"difficulty breathing"
    -"muscle ache"
    -"congestion"
    -"runny nose"
    -"trouble breathing"
    -"persistent pain"
    -"pressure in the chest"
    -"inability to wake"
    -"stay awake"
    -"bluish lips"
    -"bluish face"
    -"fevers"
    -"decreased appetite"

In [6]:
symptoms = [
    "weight loss","chills","shivering","convulsions","deformity","discharge","dizziness", "lymphopenia", "sneezing",
    "vertigo","fatigue","malaise","asthenia","hypothermia","jaundice","muscle weakness", "chest discomfort",
    "pyrexia","sweats","swelling","swollen","painful lymph node","weight gain","arrhythmia", "loss of smell", 
    "loss of appetite", "loss of taste", "bradycardia","chest pain","claudication","palpitations","tachycardia",
    "dry mouth","epistaxis", "dysgeusia", "hypersomnia", "taste loss", "halitosis","hearing loss","nasal discharge", 
    "nasal inflammation", "otalgia","otorrhea","sore throat","toothache","tinnitus", "dysphonia",
    "trismus","abdominal pain","fever","bloating","belching","bleeding","bloody stool","melena","hematochezia", 
    "burning sensation in the chest", "constipation","diarrhea","dysphagia","dyspepsia","fecal incontinence",
    "flatulence", "heartburn", "chest tightness", "chest pressure","nausea","odynophagia","proctalgia fugax",
    "pyrosis","steatorrhea","vomiting","alopecia","hirsutism", "tachypnoea", "nasal obstruction",
    "hypertrichosis","abrasion","anasarca","bleeding into skin","petechia","purpura","ecchymosis", "bruising", 
    "blister","edema","itching","laceration","rash","urticaria","abnormal posturing","acalculia","agnosia","alexia",
    "amnesia","anomia","anosognosia","aphasia","apraxia","ataxia","cataplexy","confusion","dysarthria", 
    "nasal congestion","dysdiadochokinesia","dysgraphia","hallucination","headache","akinesia","bradykinesia",
    "ballismus","blepharospasm","chorea","dystonia","fasciculation","muscle cramps","myoclonus","opsoclonus",
    "tremor","flapping tremor","insomnia","loss of consciousness","syncope","neck stiffness","opisthotonus",
    "paralysis","paresis","paresthesia","prosopagnosia","somnolence","abnormal vaginal bleeding", "neuralgia",
    "vaginal bleeding in early pregnancy", "miscarriage","vaginal bleeding in late pregnancy","amenorrhea", "body aches",
    "infertility","painful intercourse","pelvic pain","vaginal discharge","amaurosis fugax","amaurosis", "skin lesions",
    "blurred vision","double vision","exophthalmos","mydriasis","miosis","nystagmus","amusia","anhedonia",
    "anxiety","apathy","confabulation","depression","delusion","euphoria","homicidal ideation","irritability",
    "mania","paranoid ideation","suicidal ideation","apnea","hypopnea","cough","dyspnea","bradypnea","tachypnea",
    "orthopnea","platypnea","trepopnea","hemoptysis","pleuritic chest pain","sputum production","arthralgia",
    "back pain","sciatica","urologic","dysuria","hematospermia","hematuria","impotence","polyuria",
    "retrograde ejaculation","strangury","urethral discharge","urinary frequency","urinary incontinence", 
    "anosmia", "myalgia", "rhinorrhea", "shortness of breath", "difficulty breathing", "muscle ache", "congestion",
    "runny nose", "trouble breathing", "persistent pain", "pressure in the chest", "inability to wake", "stay awake",
    "bluish lips", "bluish face","akathisia","athetosis", "urinary retention", "fevers", 
    "decreased appetite"]

In [7]:
len(symptoms)

209

## Check papers that contain any of the symptoms from list of symptoms and then extract those sentences only

In [8]:
#I had to create a df with no title as I could not work withe the frame that includes paper titles as well
df_no_title = data_frame[['paper_id', 'full_text']]

### Final data frame that only includes sentences from each text that contains any of the symptoms from our list of symptoms

    - The split text module takes in a data frame that has paper id and full text. 

    - It Preprocesses the texts by splitting them into one sentence per row and which paper id they belog to
    
    - Then it extractes those sentences that include any of the given sentences and produce a data frame

In [9]:
#Use split text module to preprocess the texts and outputs a data frame of sentences that include one or more of the symptoms
import split_text as sp

final_df = sp.sentence_w_symptoms(sp.split_sentences(df_no_title), symptoms)

Note: If you wish to retrieve sentences that check for different symptoms - then you only need to change the second argument on the split sentence function above


In [10]:
final_df['Sentence'].head()

96     Subjects who died in hospital were significant...
202    We retrospectively analyzed medical charts of ...
203    The patient first had 4-5 episodes of watery d...
205    However, she returned to the ED the next day w...
217    The patient reported an overall improvement in...
Name: Sentence, dtype: object

### Rearranged the display of the columns

In [None]:
columnsTitles = ['Paper_Id', 'Sentence_ID', 'Sentence']

covid_df = final_df.reindex(columns=columnsTitles)

In [12]:
covid_df.head()

Unnamed: 0,Paper_Id,Sentence_ID,Sentence
96,84d22b71f6df277a11824433ccf14137303f55f5,97,Subjects who died in hospital were significant...
202,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,203,We retrospectively analyzed medical charts of ...
203,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,204,The patient first had 4-5 episodes of watery d...
205,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,206,"However, she returned to the ED the next day w..."
217,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,218,The patient reported an overall improvement in...


### Output of the test data file 

Save the sentences in a csv file where there are 3 columns: paper ID, sentence ID, and sentence text to be our testing data. This dataframe includes full texts per row and the paper id they belog to.

I did not save the indexes as it would create multiple index columns when we reread in the file.

In [None]:
covid_df.to_csv('/Users/elif/Desktop/covid_testing_data_May.csv', index = False, encoding= 'utf-8')

## Characteristics of data

Create a program that will take a csv file and report:
 - number of sentences
 - number of sentences with terms
 - individual term counts (how many times each term appears
 - given a collection of symptom terms X in a csv file, for each term in X, its count in the collection, sorted in descending order

## Function that produces total number of sentences, papers that include our symptoms 

In [13]:
def char_of_data(df):
    '''
    function produces data characteristics such as number of sentences in the data frame, how many of those have
    any of the symptoms and how many times each symptom accour in these sentences.
    
    :param df: data frame that includes the sentence and paper_id for our data
    returns sorted in descending order of the symptoms and the summary 
    '''
    #print('Total number of papers published in the month of May', len(new_df['Paper_Id'].unique()))
    #print('Total number of sentences from the papers that are published in the given time frame is', len(new_df['Sentence']))
    print('Total number of sentences in the final data frame with symptoms is', len(df['Sentence']))
    print('Total number of unique papers in the final data frame is', len(df['Paper_Id'].unique()))


In [14]:
char_of_data(covid_df)

Total number of sentences in the final data frame with symptoms is 27265
Total number of unique papers in the final data frame is 5970


## Function that produces the count of symptoms in the given data frame

In [15]:
#use df symptom count module
import df_symptom_count as sym_count
sym_df = sym_count.symptoms_df(covid_df, symptoms)

In [16]:
sym_df.head()

Unnamed: 0_level_0,Counts
Symptoms,Unnamed: 1_level_1
fever,7145
anxiety,5130
cough,4775
depression,2990
discharge,2601


Sym_df shows how many time each symptom from our list of symptoms appear ont he given data frame.
If you wish to check the count of different list of symptoms, please update the second parameter in the above function call.
### Visualization of symptom counts

In [17]:
sym_df.iplot(kind='scatter',y='Counts',mode='markers',size=10)

Convert symptoms to dictionary with their counts

In [16]:
sym_df.to_dict()

{'Counts': {'fever': 7145,
  'anxiety': 5130,
  'cough': 4775,
  'depression': 2990,
  'discharge': 2601,
  'diarrhea': 1582,
  'fatigue': 1393,
  'dyspnea': 1358,
  'bleeding': 1207,
  'headache': 1138,
  'edema': 1092,
  'lymphopenia': 1048,
  'shortness of breath': 1026,
  'anosmia': 945,
  'vomiting': 909,
  'confusion': 787,
  'nausea': 733,
  'urologic': 727,
  'myalgia': 721,
  'sore throat': 651,
  'rash': 585,
  'sneezing': 501,
  'abdominal pain': 490,
  'arrhythmia': 457,
  'congestion': 440,
  'chest pain': 409,
  'weight loss': 398,
  'tachycardia': 325,
  'insomnia': 294,
  'dizziness': 245,
  'swelling': 238,
  'dysgeusia': 222,
  'chills': 222,
  'loss of smell': 214,
  'urticaria': 207,
  'malaise': 200,
  'paralysis': 185,
  'dysphagia': 175,
  'rhinorrhea': 174,
  'nasal congestion': 158,
  'ataxia': 154,
  'suicidal ideation': 153,
  'skin lesions': 151,
  'apnea': 144,
  'weight gain': 143,
  'loss of taste': 142,
  'runny nose': 129,
  'bradycardia': 127,
  'nasal

In [18]:
#List of symptoms that appeared in the published papers that were extracted with our list of symptoms
counted_symptoms = sym_df.index.to_list()

### Check to see if any of the symptoms do not appear in the papers from our original symptoms list

In [19]:
main_list = np.setdiff1d(symptoms,counted_symptoms)
# yields the elements in `list_2` that are NOT in `list_1`
len(main_list)

28

It appears that that are 28 symptoms on our original symptoms list that do not occur in the papers

### 3. Encountered problem: 'fevers' symptom appears on the list of symptoms that does not occur in the papers. 

I am not able to find out why. It is clear that it does appear over 196 times

In [21]:
df_test1= covid_df.Sentence.str.extractall('({})'.format('|'.join(main_list)), flags = re.IGNORECASE)\
                           .iloc[:, 0].str.get_dummies().sum(level=0)
print(df_test1.sum(axis=0))

Fevers      4
fevers    192
dtype: int64


###   Create another spreadsheet with three columns: sentence #, word, and tag. 


    Place one word in a row and label symptoms words in the tag column: 
    mark the beginning (B-Sym) and inside (I-Sym) of each symptom term. 
    If a term consists of only one word, simply mark it as B-Sym with no I-Sym.  
    Label all other words as O. 

In [22]:
import symp_search
covid_data = symp_search.symptom_search(covid_df, symptoms)

In [23]:
covid_data.head()


Unnamed: 0,Paper_Id,Sentence_ID,Sentence,Token
96,84d22b71f6df277a11824433ccf14137303f55f5,Sentence #1,Subjects who died in hospital were significant...,Subjects who died in hospital were significant...
202,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,Sentence #2,We retrospectively analyzed medical charts of ...,We retrospectively analyzed medical charts of ...
203,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,Sentence #3,The patient first had 4-5 episodes of watery d...,The patient first had 4-5 episodes of watery B...
205,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,Sentence #4,"However, she returned to the ED the next day w...","However, she returned to the ED the next day w..."
217,b382ff1b00757c3cb6a7408d8e993aa6d94d3e28,Sentence #5,The patient reported an overall improvement in...,The patient reported an overall improvement in...


### Tokenize sentences

In this part of the task, I tokenized words and tokens columns and then replaces the tokens
with the actual display of B-SYM and I-SYM with the dashes added in the middle.

Then, provided a dataframe with three columns only

In [26]:
#Use symptom tagging module to tag the words based on our symptom list
import symptom_tagging as sym_tag
df = sym_tag.tokenize_sentences(covid_data)
tagged_data = sym_tag.remove_duplicate_sentence_ids(df)


In [27]:
tagged_data.head(10)

Unnamed: 0,Sentence_ID,Words,Tag
0,Sentence #1,Subjects,O
1,,who,O
2,,died,O
3,,in,O
4,,hospital,O
5,,were,O
6,,significantly,O
7,,older,O
8,,and,O
9,,SOFA,O


In [None]:
#Final tagged covid data
tagged_data.to_csv('/Users/elif/Desktop/covid_tagged_data_May.csv', index = False, encoding= 'utf-8')