## Working with Colloquial datasets

This notebook works with already tagged colloquial files that are provided. We will read in the existing 2 data sets and analyze the number of symptoms on each paper and create a one merged list of symptoms from both data sets.
Steps taken on this notebook are as follows:

    1. Read in patient-site_lableled.xlsx and plm_dataset_labeled.xlsx files
    2. Merge them together into one tagged file
    3. Exrtact the list of symptoms from both files
    4. Provide the count of each symptom occurance in the files
    5. Join sentence words together to retrive the full sentences and only selec the ones that include any of the sympmtoms from the symptom list
    6. Remove sentences that have 1 or less characters in them.
    7. After extracting the sentences that include one or more symptoms, split each sentence back to words format and then tag them
    8. Final output file is the tagged version of both files combined and includes on sentences that contain any of the given symptms

In [1]:
import pandas as pd
import numpy as np
import xlrd
import operator
from collections import Counter

In [2]:
#Patient Site data
paSi_df = pd.read_excel('/Users/elif/Downloads/OneDrive_1_10-29-2020/patient-site_lableled.xlsx', sheet_name='in', usecols="A,B,C")
#Patience like me site data
plm_df = pd.read_excel('/Users/elif/Downloads/OneDrive_1_10-29-2020/plm_dataset_labeled.xlsx', sheet_name='plm_dataset', usecols="A,B,C")

In [3]:
#Combine both colloquial dataframes into one
combined_colloquial_df = pd.concat([paSi_df,plm_df],ignore_index=True)

Use `colloq_data_symp_count` module to get a dictionary of marked symptoms and their counts of how many times they appear in the  tagged dataframe

In [4]:
import Colloq_data_symp_count as coll_count
plm_sym_dict = coll_count.symAndCount(plm_df)
paSi_sym_dict = coll_count.symAndCount(paSi_df)

Full dictionary of symptoms collected from both tagged data files that are from Jul, 2020 and until Nov, 2020.

In [5]:
new_dict = dict(Counter(plm_sym_dict) + Counter(paSi_sym_dict))
#Sort the dictionary of symptoms along with their value counts

sorted_d = dict( sorted(new_dict.items(), key=operator.itemgetter(1),reverse=True))
# sorted_d


Create a dataframe of the symptoms and their counts as well as a full list of symptoms that are collected from both colloquial datasets.

In [6]:
#Create a data frame for combined symptoms and how many times they appear in both colloquial tagged data
df_sym_freq = pd.DataFrame(list((dict(new_dict)).items()), columns = ['Symptom', 'Frequency'])

#Output the combined symptoms from both colloquial datasets
df_sym_freq.to_csv('sym_freq.csv',index=False)

#List of symptoms from colloquial datasets
colloq_data_symps = list((dict(new_dict)).keys())

In [7]:
#Wordcloud of symptoms

# wc = WordCloud(background_color = "black", width = 1000, height= 1000).generate_from_frequencies(new_dict)
# fig = plt.figure(figsize = (15,15))
# plt.imshow(wc, interpolation = "bilinear")
# plt.axis("off")
# plt.show()

In [8]:
df_sym_freq.groupby('Symptom').sum().sort_values(['Frequency'],ascending=False).head()

Unnamed: 0_level_0,Frequency
Symptom,Unnamed: 1_level_1
cough,41
fever,32
sore throat,16
fatigue,15
headache,15


In [9]:
# colloq_data_symps
print('Total number of symptoms retrived from both data sets is:', len(colloq_data_symps))


Total number of symptoms retrived from both data sets is: 400


### Create a version of the COLL-DATA that only contains *sentences* with symptom terms

Use `colloq_tagged_data_processing` module to process tagged colloquial data frames and processes them to find sentences with symptoms.

In [10]:
import colloq_tagged_data_processing as coll_process

In [11]:
sent_value, colloquial_df = coll_process.colloquial_data_processing(combined_colloquial_df, colloq_data_symps)

In [12]:
def colloqual_df_info(sent, df):
    print('Total number of sentences in the combined colloquial file is:', len(sent))
    print('Number of sentences that include any of the provided symptoms is:', len(df))
    print('Number of sentences that do NOT include any of the provided symptoms is:', (len(sent) - len(df)))

colloqual_df_info(sent_value, colloquial_df)

Total number of sentences in the combined colloquial file is: 1384
Number of sentences that include any of the provided symptoms is: 469
Number of sentences that do NOT include any of the provided symptoms is: 915


## Put the colloquial data sentences into tagged format

Tagging was done on the combined dataframes from both sites and the symptoms used for tagging this datasets were those that extracted from the colloquial files.

In [13]:
colloquial_df.reset_index(drop=True, inplace=True)

import split_text
plm_colloq_df = split_text.sentence_w_symptoms(colloquial_df, colloq_data_symps)
plm_colloq_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 469 entries, 0 to 468
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  469 non-null    object
dtypes: object(1)
memory usage: 7.3+ KB


In [18]:
import symp_search
colloq_df = symp_search.symptom_search(plm_colloq_df, colloq_data_symps)


In [19]:
import symptom_tagging
#df5 = symptom_tagging.tokenize_sentences(plm_colloq_df)
plm_peS_df_out= symptom_tagging.remove_duplicate_sentence_ids(symptom_tagging.tokenize_sentences(colloq_df))

AttributeError: module 'regex' has no attribute 'compile'

In [20]:
colloq_df

Unnamed: 0,Sentence,Token,Sentence_ID
0,I ' ve had 2 sets of prohylactic antibiotics t...,I ' ve had 2 sets of prohylactic antibiotics t...,Sentence #1
1,I ' ve had severe neuropathy during the illnes...,I ' ve had BSYM ISYM during the illness and so...,Sentence #2
2,I went to hospital Day 14 after a week of seri...,I went to hospital Day 14 after a week of seri...,Sentence #3
3,I was prescribed a 2 nd course of antibiotics ...,I was prescribed a 2 nd course of antibiotics ...,Sentence #4
4,"But ' better ' is still bed ridden , just mean...","But ' better ' is still BSYM ISYM , just means...",Sentence #5
...,...,...,...
464,I also gargled regularly with mouthwash with a...,I also gargled regularly with mouthwash with a...,Sentence #465
465,I recently had respiratory failure and aspirat...,I recently had BSYM ISYM and aspiration pneumo...,Sentence #466
466,I ' m still currently going through data and t...,I ' m still currently going through data and t...,Sentence #467
467,Since this virus attacks violently and has a h...,Since this virus attacks violently and has a h...,Sentence #468
