
### Purpose
Currently being used to shortcut creating data for training.

*However:* A more rigerous method for cleaning data and make a pipeline where changes can produce better results in training.

Use as reference: https://www.safaribooksonline.com/library/view/hands-on-automated-machine/9781788629898/ccac8d45-a703-42b9-992c-d82eafafe94d.xhtml

* use as example for text utilities: https://github.com/openai/finetune-transformer-lm/blob/master/text_utils.py



### Pipeline considerations:
* 1) Cleaning
    * Stop words
    * newline, tab 
    * Lowercasing
* 2) Normalizing
    * Splitting stemmed words into stem and removed portion
    * Restricting the length of sentences (query and/or response)

In [2]:
import string
import pickle
punctuations = string.punctuation

import spacy
from spacy.lang.en import English
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
parser = English()

In [5]:
# Load data
query_response = pickle.load(open("../data/query_response_direct.p","rb"))

## Spacy pipeline

In [113]:
nlp_pipeline_sentencize = spacy.load('en')
nlp = spacy.load('en')

import spacy
import re
class QueryResponseNormalizer:
    """
    Able to initialize parameters for cleansing and normalizing both query and response.
    """
    
    # Pass optional parameters to 
    def __init__(self,
                remove_stop=False,
                lowercase=True,
                 remove_punct = True,
                lem_stem = True,
                num_sent_words = 100):
        self.remove_stop = remove_stop
        self.remove_punct = remove_punct
        # optionally lowercase all words
        self.lowercase = lowercase
        # options are: None (don't do anything), stem (stem but remove extra) and stem_split (stem and add extra)
        self.lem_stem = lem_stem
        # limits the number of words (not tokens) in a query
        self.num_sent_words = num_sent_words
    
    def newline_replace(self,txt_blob):
        """
        Recursive function deals with multiple newline replacement.
        """
        txt_blob = txt_blob.replace('\n\n','\n')
        if '\n\n' in txt_blob:
            self.newline_replace(txt_blob)
        else:
            txt_blob = txt_blob.replace('\n',' ')
            return txt_blob  
        
    def newline_join(self,doc):
        """
        Function deals with issue of having a hyphenatic word connected to a word after a newline.
        """
        new_doc = []
        for token in doc.split(' '):
            # strip whitespace
            token = token.strip()
            # join words if necessary else simply remove newline.
            if '\n' and '-' in token:
                token = token.replace('\n','').replace('-','')
            else:
                token = token.replace('\n',' ')
            if len(token) > 0:
                new_doc.append(token)

        return ' '.join(new_doc).strip()

    def normalize_text(self,txt_blob):
        """
        Function deals with 
        """
        txt_blob = self.newline_join(txt_blob)
        txt_blob = self.newline_replace(txt_blob)
        regex = r"[\W_]"
        txt_blob = re.sub(regex, " ", txt_blob, 0)
        txt_blob = ' '.join(txt_blob.split())
        spcy_txt = nlp(txt_blob)
        
        spcy_txt_new = []
        for token in spcy_txt:
            # optionally remove stops
            if self.remove_stop and token.is_stop:
                continue
            elif self.remove_punct and token.is_punct:
                continue
            elif self.lem_stem:
                spcy_txt_new.append(token.lemma_)
            else:
                spcy_txt_new.append(token.text)
        return spcy_txt_new
    
    def sentencizer(self,txt_blob,num_sent=1):
        """
        For now just returns first sentence.
        
        Input:
            Raw text
        """
        spcy_txt = nlp(txt_blob)
        
        return [sent.text for sent in spcy_txt.sents][:num_sent]

In [86]:
example = nlp(" I'm like the tree.  I'm like the tree")
for s in example.sents:
    print(s)

 I'm like the tree.  
I'm like the tree


**Comparison**

In [60]:
q,r = query_response[43]
print(q)
print('_______________________________________')
print(r)

So, I've had a mild sore throat for about 4 days. When I woke up yesterday morning the sore throat was worse than any strep I've ever dealt with. On top of that, because of the swelling of my tonsils, I haven't slept(it's now 6:22 am) because I just stop breathing in my sleep and wake up feeling like I've been holding my breath. I took DayQuil yesterday, plenty of acetaminophen and ibuprofen, and phenylephrine HCL hoping that would have some effect but my tonsils are bigger than ever. I can't eat, sleep, or drink anything and the amoxicillin hasn't had any effect(it's only been a day of antibiotics so far).  What are some remedies/other OTC medicines that will reduce the swelling? I've already tried gargling salt water, throat lozenges, tea(couldn't swallow It), ice cubes, and Ice packs on my swollen lymph nodes 
_______________________________________
I'd say that's probably mononucleosis (i'm a last year medical student). Large tonsils. worse than any strep you've ever get, no effect

In [83]:
# test text
normalizer = QueryResponseNormalizer(txt_blob=q,remove_stop=True,lem_stem=False).normalize_text()
dir(normalizer)

TypeError: __init__() got an unexpected keyword argument 'txt_blob'

In [106]:
# test text
normalizer_q = QueryResponseNormalizer(lowercase=True,remove_stop=True,lem_stem=True)
s_q = normalizer_q.sentencizer(q)
print(s_q)
print(' '.join(normalizer_q.normalize_text(s_q)))
print('--------------------------------------------')
normalizer_r = QueryResponseNormalizer(lowercase=True,remove_stop=True,lem_stem=False)
r_q = normalizer_r.sentencizer(r)
print(r_q)
print(' '.join(normalizer_r.normalize_text(r_q)))

So, I've had a mild sore throat for about 4 days.
so -PRON- ve mild sore throat 4 day
--------------------------------------------
I'd say that's probably mononucleosis (i'm a last year medical student).
I d s probably mononucleosis m year medical student


In [114]:
%%time
query_response_new = []
for (q,r) in query_response[:4000]:
    if (len(str(q).strip().split()) >= 1) and (len(str(r).strip().split()) >= 1):
        normalizer_q = QueryResponseNormalizer(lowercase=True,remove_stop=True,lem_stem=True)
        s_q = normalizer_q.sentencizer(str(q),num_sent=2)
        s_q = ' '.join(normalizer_q.normalize_text(s_q))

        normalizer_r = QueryResponseNormalizer(lowercase=True,remove_stop=True,lem_stem=False)
        s_r = normalizer_q.sentencizer(str(r),num_sent=1)
        s_r = ' '.join(normalizer_q.normalize_text(s_r))
        query_response_new.append((s_q,s_r))

AttributeError: 'list' object has no attribute 'split'

In [116]:
s_q

['Pertinent facts\n\n', '*']

In [111]:
# save data from data creation option 2: Every comment is a question and answer
pickle.dump(query_response_new,open( '../data/query_response_direct_one_sentence.p', "wb" ))

**Write to file**

In [24]:
with open('seq2seq_examples/fra-eng/all_responses_equal.txt', 'w') as fp:
    fp.write('\n'.join('%s -----+----- %s' % x for x in qa_new))