# Data Cleaning
This notebook contains code to clean data by performing the following:
* Clean up random code for special characters
* Remove irrelevant and unwanted content in corpus

In [1]:
import json
import os
import re
import unicodedata  # to convert unicode to proper English text

### Code testing on data cleaning automation

**Load raw data**

In [4]:
# list all files in directory
folder = '/Users/dahliama/Desktop/SJSU/data298/github/projectmarley/data-collect/petmd/adultdog'
filenames = os.listdir(folder)
filenames

['nutrition_how-find-your-dogs-body-condition-score.json',
 'general-health_understanding-how-probiotics-for-dogs-work.json',
 'general-health_best-hunting-dogs.json',
 'general-health_summer-dog-walking-tips.json',
 'general-health_signs-dog-dying.json',
 'symptom_runny-nose-in-dogs.json',
 'digestive_food-allergies-dogs.json',
 'dog_sago-palm-poisoning-dogs.json',
 'behavior_why-do-dogs-howl.json',
 'general-health_side-effects-pet-medications.json',
 'behavior_why-do-dogs-chase-their-tails.json',
 'parasites_how-to-remove-a-tick-from-dog-cat.json',
 'general-health_CBD-for-dogs-what-you-need-to-know.json',
 'general-health_fish-oil-dogs.json',
 'symptoms_why-my-dog-peeing-lot.json',
 'training_balanced-dog-training-vs-positive-reinforcement.json',
 'nutrition_what-to-feed-a-dog-with-diarrhea.json',
 'general-health_imodium-dogs-it-good-idea.json',
 'systemic_pollen-allergies-dogs.json',
 'general-health_dog-in-heat.json',
 'general-health_dog-hospice-and-palliative-care.json',
 'gen

In [5]:
len(filenames)

150

In [33]:
filename = filenames[1]

file = f'{folder}/{filename}'
file_obj = open(file, 'r')
json_data = json.load(file_obj)

print(f'File: {filename}\n')
json_data

File: general-health_understanding-how-probiotics-for-dogs-work.json



[{'@context': 'https://schema.org',
  '@type': 'BreadcrumbList',
  'itemListElement': [{'@type': 'ListItem',
    'position': 0,
    'name': 'Home',
    'item': 'https://www.petmd.com/'},
   {'@type': 'ListItem',
    'position': 1,
    'name': 'Care & Healthy Living',
    'item': 'https://www.petmd.com/dog/care'}]},
 {'@context': 'https://schema.org',
  '@type': 'Article',
  'headline': 'Understanding How Probiotics for Dogs Work',
  'inLanguage': 'en-US',
  'isFamilyFriendly': 'True',
  'isAccessibleForFree': 'True',
  'articleSection': '',
  'thumbnailUrl': 'https://image.petmd.com/files/styles/863x625/public/2023-04/dog-probiotics.jpg',
  'image': {'@type': 'ImageObject',
   'url': 'https://image.petmd.com/files/styles/863x625/public/2023-04/dog-probiotics.jpg'},
  'author': {'@type': 'Person',
   'name': 'Jennifer Coates, DVM',
   'url': 'https://www.petmd.com/author/dr-jennifer-coates'},
  'keywords': 'Dog, Adult',
  'publisher': {'@type': 'Organization',
   'name': 'PetMD',
   'lo

In [34]:
# extract raw article body
raw_text = json_data[-1]['articleBody']

**Convert unicode raw text to proper English**

In [35]:
# clean up and convert special encodings to regular English (i.e. '\xa0')
text = unicodedata.normalize("NFKD", raw_text).replace('&#039;',"'").replace('&quot;','')
text

'Here’s a fun fact: A mammal’s intestinal tract contains approximately 10 times more microbes than the number of cells in the rest of their body. This is likely true for dogs, cats, and even people. Is it any wonder why the intestinal microbiome’s role in our health is the focus of such intense research? Nutritional supplements that contain probiotics and prebiotics are popular ways to support a healthy microbiome. Let’s take a look at some common pre- and probiotics and what they do inside a dog’s body. Probiotics for Dogs Probiotics are nutritional supplements containing beneficial living microorganisms that are normally found in the gastrointestinal tract. While we don’t know exactly what each microorganism does, most probiotics for dogs are designed primarily to support digestive health. But they can have other effects as well. Lactobacillus Lactobacillus bacteria are included in many probiotics meant to support intestinal health. They may help restore the intestinal microbiome aft

**Remove unwanted and irrelevant content in text** <br>
Ref: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences

In [36]:
#------------------------------------------------------------------------------------#
# Define function to split sentences in large text
#------------------------------------------------------------------------------------#

# -*- coding: utf-8 -*-
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

In [37]:
# split text into sentences
text_splits = split_into_sentences(text)
text_splits

['Here’s a fun fact: A mammal’s intestinal tract contains approximately 10 times more microbes than the number of cells in the rest of their body.',
 'This is likely true for dogs, cats, and even people.',
 'Is it any wonder why the intestinal microbiome’s role in our health is the focus of such intense research?',
 'Nutritional supplements that contain probiotics and prebiotics are popular ways to support a healthy microbiome.',
 'Let’s take a look at some common pre- and probiotics and what they do inside a dog’s body.',
 'Probiotics for Dogs Probiotics are nutritional supplements containing beneficial living microorganisms that are normally found in the gastrointestinal tract.',
 'While we don’t know exactly what each microorganism does, most probiotics for dogs are designed primarily to support digestive health.',
 'But they can have other effects as well.',
 'Lactobacillus Lactobacillus bacteria are included in many probiotics meant to support intestinal health.',
 'They may help 

In [29]:
# define a list of words to remove (in lowercase)
remove_words = ['petmd'
                ,'chewy'
                ,'featured image' 
                ,'image credit'
                ,'istock' 
                ,'https://' 
                ,'.com'
                ,'.gov'
                ,'.edu'
                ,'ncbi.'
                ,'nlm.' 
                ,'citations '
                ,'., et al.'
                ,', vol.'
                ,', no.'
                ,', pp.'
                ,'by: '
               ]

In [31]:
text_splits = split_into_sentences(text)

def word_check(sentence, words):
    res = [all([k.lower() not in s.lower() for k in words]) for s in sentence]
    return [sentence[i] for i in range(0, len(res)) if res[i]]

# remove sentences with unwanted content
clean_text = word_check(text_splits, remove_words)
new_text = ' '.join(clean_text)

print(f'Old text contain {len(text_splits)} sentences.\nNew text contain {len(clean_text)} sentences.\n')
print(f'Below is the new text body: \n {new_text}')

Old text contain 58 sentences.
New text contain 56 sentences.

Below is the new text body: 
 What Is a Body Condition Score? Stepping on a scale is an easy way for people to monitor their weight, but routinely weighing your dog can be a hassle. What if there was a simpler method for determining if your dog is too skinny, too overweight, or just right? There is... it’s called a body condition score (BCS). The only tools you need to determine your dog’s body condition score are your hands and your eyes. A 1997 study first described a body condition score system for dogs that is the basis for the ones that are in common use today. How Veterinarians Use Body Condition Scores for Dogs Veterinarians check their patients’ body condition scores almost every time they see them. Monitoring for changes in both a dog’s weight and body condition score provides vets with more information than either measure does by itself. Sometimes a dog’s weight will change for perfectly normal reasons. For exampl

**Export and save cleaned text**

In [32]:
savefolder = '/Users/dahliama/Desktop/SJSU/data298/github/projectmarley/data-preprocess/petmd/adultdog'
savefilename = filename.replace('.json','')

file = open(f"{savefolder}/{savefilename}.txt","x")
file.write(new_text)
file.close()

### Automated text data cleaning

In [45]:
#------------------------------------------------------------------------------------#
# Define function to split sentences in large text
#------------------------------------------------------------------------------------#

# -*- coding: utf-8 -*-
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

In [46]:
# define a list of words to remove (in lowercase)
remove_words = ['petmd'
                ,'chewy'
                ,'featured image' 
                ,'image credit'
                ,'istock' 
                ,'https://' 
                ,'.com'
                ,'.gov'
                ,'.edu'
                ,'ncbi.'
                ,'nlm.' 
                ,'citations '
                ,'., et al.'
                ,', vol.'
                ,', no.'
                ,', pp.'
                ,'by: '
               ]

In [47]:
def word_check(sentence, words):
    res = [all([k.lower() not in s.lower() for k in words]) for s in sentence]
    return [sentence[i] for i in range(0, len(res)) if res[i]]

In [58]:
def clean_data(directory, folder, remove_words, savedir):
    '''
    Purpose: to clean all text data files in a given folder
    @params directory: the directory path that contains different folders of data
    @params folder: the individual folder that contains data for a topic
    @params remove_words: a list of words to remove from text body
    @params savedir: the directory path of the save location
    returns: a dictionary where key = filename, value = the difference in text split length
    '''
    # initiate lists to store split text lengths for validation
    old_text_len = []
    new_text_len = []
    
    # get a list of files for a given folder
    files = os.listdir(f'{directory}/{folder}')
    
    for file in files:
        file_obj = open(file, 'r')
        json_data = json.load(file_obj)
        
        # clean text data
        raw_text = json_data[-1]['articleBody']  # extract raw article body
        text = unicodedata.normalize("NFKD", raw_text).replace('&#039;',"'").replace('&quot;','')  # clean up unicode
        text_splits = split_into_sentences(text)
        clean_text = word_check(text_splits, remove_words)
        new_text = ' '.join(clean_text)
        
        # store text length stats
        filenames.append(file)
        old_text_len.append(len(text_splits))
        new_text_len.append(len(clean_text))
        
        # export and save cleaned text data
        savefilename = file.replace('.json','')

        writefile = open(f"{savedir}/{folder}/{savefilename}.txt","w")
        writefile.write(new_text)
        file.close()
        
    # merge outputs for validation of data cleansing
    text_len_diff = [new_text_len[i] - old_text_len[i] for i in range(len(files))]
    output = dict(zip(files, text_len_diff))
        
    return output

In [60]:
directory = '/Users/dahliama/Desktop/SJSU/data298/github/projectmarley/data-collect/petmd'
folders = os.listdir(directory)
folders = [folder for folder in folders if '.' not in folder]
folders

['behavior',
 'adultdog',
 'disease-illness-injury',
 'care',
 'puppy',
 'poisoning',
 'seniordog',
 'breed',
 'symptoms',
 'medication',
 'nutrition',
 'allergies']

In [None]:
savedir = '/Users/dahliama/Desktop/SJSU/data298/github/projectmarley/data-preprocess/petmd'
directory = '/Users/dahliama/Desktop/SJSU/data298/github/projectmarley/data-collect/petmd'
folders = os.listdir(directory)
folders = [folder for folder in folders if '.' not in folder]

# clean all data
text_len_validates = []

for folder in folders:
    res = clean_data(directory, folder, remove_words, savedir)
    text_len_validates.append(res)