## Scraped Data from the sites

    Created one csv file where all the text data from both sites is stored
    Loaded the csv file to select the sentences that include symptoms
    Output the tagged file for symptoms

In [1]:
import pandas as pd
scraped_data = pd.read_csv('/Users/elif/Downloads/combined_csv-4.csv')


Use `df_symptom_count` module to have the count of symptoms on this dataset.
I created a merged symptom list that contains the symptoms that were taken from Kaggle as well as
the ones we extracted from the tagged colloquial datasets. Merged symptom list is used on this file.

`combined_symptoms_list` module contains all the symptoms: ones gathered from online searching and Kaggle as well as the ones retrieved from tagged data sets.

In [2]:
import combined_symptoms_list
import df_symptom_count

df_symptom_count.symptoms_df(scraped_data, combined_symptoms_list.combined_symptoms)

Unnamed: 0_level_0,Counts
Symptoms,Unnamed: 1_level_1
symptoms,146
feeling,63
cough,49
breath,49
taste,42
...,...
lung issues,1
feeling lungs are clogged up,1
pressure in my chest,1
pins and needles in my hands,1


We see that 174 of the symptoms appear in the new scraped file from our merged list of symptoms.

The combined list of symptoms include 609 symptoms in total.


In [3]:
len(combined_symptoms_list.combined_symptoms)


609

### Double checking that the sentences in the scraped csv file only contain one sentence per row

The `scraped_data` csv file is a merged version of some 50+ individual csv files.
The reason for this is that when we scrape the information from the sites using Agenty scraping tool,
It is only able to save one page information at a time thus we had to save all different comments retrieved from different pages into different csv files.
At the end we merged them into one csv file but for the benefit of our model, we will go over those sentence one more time to make sure there is only one sentence per row.


In [4]:
#Make sure there is only one sentence per row
import re
sentences = []
sentence_pattern = r'(?<=[^A-Z].[.?]) +(?=[A-Z])'
for row in scraped_data.itertuples():
    for sentence in re.split(sentence_pattern, row[1]):
        sentences.append((row[0], sentence))


collo_df_new = pd.DataFrame(sentences, columns=['Index','Sentence'])

collo_df_new.drop('Index', axis = 1, inplace = True)
print('Number of sentences retrieved from the sites for the past 3 months is: ', len(collo_df_new))



Number of sentences retrieved from the sites for the past 3 months is:  1292


In [5]:
import split_text
colloquial_new_data = split_text.sentence_w_symptoms(collo_df_new, combined_symptoms_list.combined_symptoms)
print('Of those sentences ', len(colloquial_new_data), 'of them include any of the symptoms')


Of those sentences  474 of them include any of the symptoms


In [6]:
import symp_search
import symptom_tagging
df5 = symp_search.symptom_search(colloquial_new_data, combined_symptoms_list.combined_symptoms)

df6 = symptom_tagging.tokenize_sentences(df5)
tagged_scraped_data = symptom_tagging.remove_duplicate_sentence_ids(df6)

In [7]:
tagged_scraped_data

Unnamed: 0,Sentence_ID,Words,Tag
0,Sentence #1,So,O
1,,I,O
2,,'m,O
3,,experiencing,O
4,,a,O
...,...,...,...
12223,,else,O
12224,,had,O
12225,,that,O
12226,,happen,O


In [8]:
#Save the tagged file in csv
tagged_scraped_data.to_csv('/Users/elif/Desktop/New_Colloquial_data_Nov5.csv', index = False, encoding= 'utf-8')


## Things to address - Nov 6 Meeting

    - Changed the way we split sentences into one sentence per row format, thus, we now have a different testing set for the MAY papers which is saved in MAX&ELIF folder

    - The existing tagged colloquial data sets were split in a different fashion thus the quality of it might be different that the other files - such as the split of I've or numbers split in different cells

    - There were not many posts on those 2 sites that were covid related as they had many posts about just updates on recent changes in the vaccine process

    - Scraped data was saved in different csv files and we merged them into one csv file to capture all the information

    - Some of the posts on the site had the edit date as - 2 weeks or months - but some of the comments dated back to 4-5 months ago so not sure about what part of that post is actually recently edited

    - The sites do not have acrtual time stamps on the posts thus, I went back 3 months to collect all the information that we have right now.


