<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-processing-of-Data-and-EDA" data-toc-modified-id="Pre-processing-of-Data-and-EDA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-processing of Data and EDA</a></span></li></ul></div>

## Pre-processing of Data and EDA

In this section, we will sectionate our review data into sentences with the intent of identifying one aspect per sentence. For this purpose, we will use SpaCy's sentencizer pipeline to fulfil that purpose.

In [1]:
import spacy
#import libraries
import pandas as pd
nlp = spacy.load("en_core_web_sm")
from spacy.language import Language

import gensim
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import *

In [2]:
df1 = pd.read_csv('../datasets/raw/tripadvisor_fullerton_210403_1449')

In [20]:
df = df1.rename(columns={'Unnamed: 0': 'rev_id'})

In [22]:
df.duplicated().sum()

0

In [23]:
df.drop_duplicates(inplace=True)

In [24]:
df

Unnamed: 0,rev_id,property,rev_source,rev_date,rev_location,rev_title,rev_content,rev_score,rev_visit_date
0,689620611,fullerton,tripadvisor,2019-07-01,Australia,Relaxing Visit,Our suite overlooked the Singapore River and s...,10.0,2019-07-01
1,689494350,fullerton,tripadvisor,2019-07-01,,Lovely view and friendly crew,Thank you Giri and Ian for this experience at ...,10.0,2019-07-01
2,689448855,fullerton,tripadvisor,2019-07-01,,Birthday celebration,"The restaurant staff (Steve, Pat, Belle,KC, JH...",10.0,2019-07-01
3,689447212,fullerton,tripadvisor,2019-07-01,,Dinner at Jade Restaurant,The food is great! Service is great too! A spe...,10.0,2019-07-01
4,689447097,fullerton,tripadvisor,2019-07-01,Australia,Fall for Singapore at The Fullerton,I fell in love with Singapore at The Fullerton...,10.0,2019-07-01
...,...,...,...,...,...,...,...,...,...
495,661243953,fullerton,tripadvisor,2019-03-01,United Kingdom,Great place,Second visit here.every aspect is excellent. W...,10.0,2019-03-01
496,661154909,fullerton,tripadvisor,2019-03-01,United Kingdom,A wonderful experience,This is a top rate hotel. The Service is both ...,10.0,2019-03-01
497,661115280,fullerton,tripadvisor,2019-03-01,United Kingdom,Leading hotel in Singapore,"When returning to Singapore, on business usual...",10.0,2019-03-01
498,660976489,fullerton,tripadvisor,2019-03-01,Florida,Spectacular Historical Property,On our last trip to Singapore we became intere...,10.0,2019-03-01


We will now break our review content into sentences. The sentences will also be broken at the word 'but' due to content before and after 'but' being of contrasting opinion. We will do this by adding a modification to SpaCy's sentencizer method.

In [25]:
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    '''
    Function adds a 'new sentence' token to sentences which have a but, allowing content after 'but' to be a new line.
    '''
    for token in doc[:-1]:
        if token.text == "but":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe("set_custom_boundaries", before="parser")

<function __main__.set_custom_boundaries(doc)>

We will also decontract contracted words in sentences to assist sentiment analysis algorithms in identifying negation as accurately as possible.

In [26]:
def decontracted(phrase):
    '''
    Function removes contractions from strings and replaces them with their un-contracted form to assist in sentiment analysis.
    Returns a string.
    '''
    phrase = re.sub(r"won't", "will not", phrase)      # replace won't with "will not"
    phrase = re.sub(r"can\'t", "can not", phrase)      # replace can or cant with 'can not'
    phrase = re.sub(r"n\'t", " not", phrase)           # replece n with 'not'
    phrase = re.sub(r"\'re", " are", phrase)           # replace re with 'are'
    phrase = re.sub(r"\'s", " is", phrase)             # replace s with 'is'
    phrase = re.sub(r"\'d", " would", phrase)          # replace 'd' with 'would'
    phrase = re.sub(r"\'ll", " will", phrase)          # replace 'll with 'will'
    phrase = re.sub(r"\'t", " not", phrase)            # replace 't' with 'not'
    phrase = re.sub(r"\'ve", " have", phrase)          # replace ve with 'have'
    phrase = re.sub(r"\'m", " am", phrase)             # replace 'm with 'am'
    return phrase


Since our data has no labels yet, we will have to add labels manually using human effort. We will **bootstrap** data using models from our next section to continually do rough annotations, reducing manual labelling effort.

**Note:** The first batch of 1000 rows of annotation took me 4 hours. Subsequent batches from 2-3 hours. Bootstraping sentiment started even in the first batch.

Our next function will break review data into sentences for further processing.

In [52]:
def update_df_pp(raw_table):
    '''
    Function updates df_pp with preprocessed data from raw tables
    returns modified df
    '''
    pp_columns = ['rev_date',
                'rev_id', # review id from raw_table
                  'sent_text', # sentence_text from spaCy .sents
#                   'sent_sw', # sentence with stopwords removed by spaCy
                  'objects', # list of found dobj and pobj and nsubj
#                   'contains_staff_terms', # contains words from list of words commonly associated with staff, indicating its talking about service
#                   'contains_names', # contains names, indicating that its probably talking about service
                  'descriptive', # adjectives and adverbs -- indicating sentiment
                  'vader_neg', # vader sentiment negative score
                  'vader_neu', # vader sentiment neutral scores
                  'vader_pos', # vader sentiment positive score
                  'vader_comp', # vader compound score
                  'category', # category -- will be manually entered for training set
                  'sentiment', # overall sentence sentiment
                 ]
    df_pp = pd.DataFrame(columns = pp_columns)
    pp_index = 0
    
#     Index(['rev_id', 'property', 'rev_source', 'rev_date', 'rev_location',
#        'rev_title', 'rev_content', 'rev_score', 'rev_visit_date'],
#       dtype='object')
    
    for index, row in raw_table.iterrows(): # iterate through rows in raw_table
        rev_date = row['rev_date']
        # insert rev_id
        review_id = row['rev_id']
   
        doc = nlp(row['rev_content'])
        #iterate through sentences
        for sentence in doc.sents:
            # insert review id
            df_pp.loc[pp_index, 'rev_id'] = review_id
            df_pp.loc[pp_index, 'rev_date'] = rev_date
            
            # insert sentence
            df_pp.loc[pp_index, 'sent_text'] = sentence.text
                        
            sentence_mod = decontracted(sentence.text)
            # insert sentence with stop words removed
            
            descriptive_terms = []
            target = []
            contains_names = False
            contains_staff_terms = False

            
            for token in sentence:
                 # get objects for reference
                if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.dep_ == 'nsubj':
                    target.append(token.text)
             
                # get descriptive terms for reference
                if token.pos_ == 'ADJ':
                    prepend = ''
                    for child in token.children:
                        if child.pos_ != 'ADV':
                              continue
                        prepend += child.text + ' '
                    descriptive_terms.append(prepend + token.text)
                        
            df_pp.loc[pp_index, 'objects'] = ", ".join(target)
            df_pp.loc[pp_index, 'descriptive'] = ", ".join(descriptive_terms)
            
            #vader sentiment analysis -- bootlegging of sentiment analysis
            sid = SentimentIntensityAnalyzer()
            ss = sid.polarity_scores(sentence.text)
            df_pp.loc[pp_index, 'vader_neg'] = ss['neg']
            df_pp.loc[pp_index, 'vader_neu'] = ss['neu']
            df_pp.loc[pp_index, 'vader_pos'] = ss['pos']
            df_pp.loc[pp_index, 'vader_comp'] = ss['compound']
            
            # overall sentiment started as vader_comp rounded up or down
            if ss['compound'] > 0.1:
                df_pp.loc[pp_index, 'sentiment'] = 1
            elif ss['compound'] == 0:
                df_pp.loc[pp_index, 'sentiment'] = 0
            elif ss['compound'] < 0:
                df_pp.loc[pp_index, 'sentiment'] = -1
            
            # enter new row for new sentence
            pp_index += 1
    # return full df_pp
    return df_pp

In [28]:
test_df = update_df_pp(df[0:1000])

In [29]:
test_df.head()

Unnamed: 0,rev_id,sent_text,objects,descriptive,vader_neg,vader_neu,vader_pos,vader_comp,category,sentiment
0,689620611,Our suite overlooked the Singapore River and s...,"suite, River, we, views",amazing,0.073,0.659,0.268,0.6077,,1
1,689620611,Its located within easy walk of many attractio...,"walk, attractions, others, Island","easy, many, worst",0.153,0.634,0.213,0.1531,,1
2,689620611,Our check-in was swift and friendly.,check,"swift, friendly",0.0,0.444,0.556,0.6124,,1
3,689620611,Staff were welcoming at all times and most spo...,"Staff, times, English, us",most,0.0,0.642,0.358,0.8104,,1
4,689620611,Breakfast was of high caliber and of considera...,"Breakfast, caliber, variety","high, considerable",0.0,1.0,0.0,0.0,,0


In [30]:
test_df.shape

(3006, 10)

In [31]:
test_df['batch_date'] = '210426'

In [32]:
# test_df.to_csv('../testdata/sentencized/test_df_210426.csv')

We will perform EDA and Modeling in our next section.