## Sentiment Analysis for Whispr

1. import data from google sheets
2. clean dataset and create synthetic variables
3. summarize dataset: how many records per category, reviews over time
4. evaluate sentiment of review, give confidence interval
5. calculate summary insights: average sentiment / subjectivity per item, reviews per item
6. compare against manual evaluation
7. export data to google sheets

In [1]:
#operational packages
import pandas as pd
import numpy as np
import string

#packages for google sheets
import gspread
import pygsheets
from oauth2client.service_account import ServiceAccountCredentials

#plotting
from matplotlib import pyplot as plt
import seaborn as sns

#natural language processing
from textblob import TextBlob
import nltk
from nltk import pos_tag_sents, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords, words

#plot formatting
%matplotlib inline
sns.set_style('darkgrid')
pd.options.display.max_rows = 100

### 1a. Import data from GS using GSpread
- connect to google sheets API
- create spreadsheet and worksheet objects, explore GSpread library
- create dataframe of reviews

In [89]:
# #1 define the scope of your access tokens
# scope = ['https://www.googleapis.com/auth/drive','https://spreadsheets.google.com/feeds']

# #2 after getting oauth2 credentials in a json, obtain an access token from google authorization server
# #by creating serviceaccountcredentials and indicating scope, which controls resources / operations that an
# #access token permits
# creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret2.json', scope)

# #3 log into the google API using oauth2 credentials
# #returns gspread.Client instance
# c = gspread.authorize(creds)

spreadsheet = c.open('UK Sentiment')
worksheet = spreadsheet.worksheet('WHotel_Sentiment')
records = worksheet.get_all_records()
df = pd.DataFrame(records)
df = df[['Contents','Sentiment','Topic','Location','Comment']]

### 1b. Import data from GS using pygsheets

In [88]:
# #authorization in one step - read client_secret
# gc = pygsheets.authorize(service_file='client_secret.json')
# spreadsheet2 = gc.open('UK Sentiment')

# #clean up workbook 
# for item in spreadsheet2.worksheets():
#     title = item.title
#     if item.title not in ['UK_Reviews','WHotel_Sentiment','WHOTELS_analyzed']:
#         worksheet2 = spreadsheet2.worksheet_by_title(str(item.title))
#         spreadsheet2.del_worksheet(worksheet2)
#         print('{} sheet deleted'.format(item.title))
        
# worksheet2 = spreadsheet2.worksheet_by_title('WHotel_Sentiment')
# records2 = worksheet2.get_all_records()
# df2 = pd.DataFrame(records2)
# df2 = df2[['Contents','Sentiment','Topic','Location','Comment']]

#get data for kind bars
kindbar = spreadsheet.worksheet('UK_Reviews')
kindrecords = kindbar.get_all_records()
kind_df = pd.DataFrame(kindrecords)
kind_df = kind_df[['review_rating','Review','review_headline','Product (Taste/Experience)']]

kind_df.head()

Unnamed: 0,review_rating,Review,review_headline,Product (Taste/Experience)
0,5.0 out of 5 stars,"I really like these bars, and so do the other ...",A very tasty and well-balanced treat,1
1,5.0 out of 5 stars,I purchased these because I’m on the 16:8 IF d...,Great size snack for those of us wanting a hea...,1
2,5.0 out of 5 stars,These are great bars. I find when I'm training...,Price varies a lot !!!,1
3,5.0 out of 5 stars,Not a protein bar but a very health-designed s...,Possibly the best tasting healthiest snack bar...,1
4,5.0 out of 5 stars,So good and actually quite low in sugar all co...,Definitely a bar to try and enjoy,1


### 2. Simple sentiment analysis

In [90]:
#baseline sentiment analysis - use textblob polarity, compare accuracy
df['Sentiment_Category'] = df['Sentiment'].map({1: 'Positive',2:'Neutral',3:'Negative'})

def pos_neg(polarity):
    if polarity >= 0.1:
        return 'Positive'
    if polarity >= 0 and polarity < 0.1:
        return 'Neutral'
    else:
        return 'Negative'

df['Polarity'] = [TextBlob(x).polarity for x in df['Contents']]
df['Subjectivity'] = [TextBlob(x).subjectivity for x in df['Contents']]
df['Textblob_Score'] = df['Polarity'].apply(pos_neg)

df.groupby(['Sentiment_Category','Textblob_Score'])['Polarity'].agg({'mean':np.mean, 'count':len})

is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  app.launch_new_instance()


Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Sentiment_Category,Textblob_Score,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,Negative,-0.229419,11.0
Negative,Neutral,0.003046,72.0
Negative,Positive,0.379133,55.0
Neutral,Neutral,0.028125,1.0
Positive,Negative,-0.4,1.0
Positive,Neutral,0.001145,14.0
Positive,Positive,0.425419,20.0


### 3. KNN Sentiment Analysis
- data cleaning: remove hashtags and extra whitepsaces
- lemmatize contents
- count word frequencies of lemmatized words
- calculate polarity and choose positive / negative words

In [108]:
#function to convert penn POS tags to wordnet
lemmatizer = WordNetLemmatizer()
def nltk2wn(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None        

#function to lemmatize sentence with appropriate pos
def lemmatize_sent(sentence):
    nltk_tagged = pos_tag([x.lower() for x in nltk.word_tokenize(sentence)])
    converted_tags = [(x[0], nltk2wn(x[1])) for x in nltk_tagged]
    lemmatized_sent = []
    for x in converted_tags:
        if x[1] is None:
            lemmatized_sent.append(x[0])
        else:
            lemmatized_sent.append(lemmatizer.lemmatize(x[0], pos = x[1]))                     
    #final_sentence = ' '.join(lemmatized_sent)
    return lemmatized_sent 

#function to find most popular positive and negative words
def count_words(df, lemmatized_col):
    #create three checks: stopwords, punctuation, english
    mystop = stopwords.words('english')
    punctuation = string.punctuation
    englishwords = [x.lower() for x in words.words()]

    #lemmatize words in comments
    allwords = TextBlob(str(df[lemmatized_col].values.tolist())).tokenize()
    print(allwords)
    #create list of lemmatized words
    finalwords = [word for word in allwords if word not in punctuation and word not in mystop and word in englishwords]

    #for lemmatized words, create counts and polarity scores
    counts = {x: finalwords.count(x) for x in finalwords}
    word_df = pd.DataFrame(counts.items(), columns = ['word','count']).sort_values('count', ascending = False)
    word_df['polarity'] = word_df['word'].apply(lambda x: TextBlob(x).polarity)
    positives = word_df[word_df['polarity']>0].sort_values(['count','polarity'], ascending = False)
    negatives = word_df[word_df['polarity']<0].sort_values(['count','polarity'], ascending = False)

    toptenpos=positives.nlargest(10, columns='count').reset_index(drop=True)
    toptenneg=negatives.nlargest(10, columns='count').reset_index(drop=True)
    return toptenpos, toptenneg

#function to create dummies for pos and neg words
def pos_dummies(df, review_col, pos_words, neg_words):
    for word in pos_words.values:
        newcol = 'pos_{}'.format(word[0])
        df[newcol] = [1 if word[0] in x else 0 for x in df[review_col]]
    for word in neg_words.values:
        newcol = 'neg_{}'.format(word[0])
        df[newcol] = [1 if word[0] in x else 0 for x in df[review_col]]
    #df['total'] = (df[df.columns[-20:]]).apply(sum, axis = 1)
    return df

In [127]:
l = kind_df['Lemmatized'].tolist()
test = [item for sublist in l for item in sublist]
set(test)

{'!',
 '%',
 '&',
 "'",
 "''",
 "'d",
 "'kind",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 '(',
 ')',
 ',',
 '-',
 '.',
 '...',
 '.but',
 '.im',
 '.unless',
 '/',
 '1',
 '12',
 '14',
 '140-150',
 '16:8',
 '16g',
 '18',
 '2',
 '2-3',
 '200kcal',
 '203cal',
 '2nd',
 '37',
 '3rd',
 '4',
 '40',
 '40g',
 '5gms',
 '7.3g',
 '75p',
 '81',
 ':',
 ';',
 '?',
 '@',
 '``',
 'a',
 'about',
 'absolutely',
 'accord',
 'account',
 'accurate',
 'across',
 'act',
 'actually',
 'ad',
 'add',
 'added',
 'addict',
 'addictive',
 'advertise',
 'advertised',
 'advertising',
 'advice',
 'affect',
 'after',
 'afternoon',
 'again',
 'again.what',
 'ago',
 'albeit',
 'alike',
 'all',
 'allergy',
 'almon',
 'almond',
 'almonds',
 'almost',
 'along',
 'already',
 'also',
 'alternative',
 'although',
 'always',
 'amaze',
 'amazing',
 'amazon',
 'amazon.do',
 'amount',
 'an',
 'analyze',
 'and',
 'animal',
 'another',
 'any',
 'anybody',
 'anymore',
 'anyone',
 'apple',
 'around',
 'arrive',
 'arrived',
 'artificial',
 

In [121]:
test = [x.items for x in kind_df['Lemmatized']]
test

AttributeError: 'list' object has no attribute 'items'

In [None]:
#lemmatize and tokenize kind bars
kind_df['Lemmatized'] = kind_df['Review'].apply(lemmatize_sent)
kind_pos, kind_neg = count_words(kind_df, 'Lemmatized')

#lemmatize and tokenize whotels
df['Lemmatized'] = df['Contents'].apply(lemmatize_sent)
whotels_pos, whotels_neg = count_words(df, 'Lemmatized')

['[', '[', "'", 'i', "'", ',', "'really", "'", ',', "'like", "'", ',', "'these", "'", ',', "'bar", "'", ',', "'", ',', "'", ',', "'and", "'", ',', "'so", "'", ',', "'do", "'", ',', "'the", "'", ',', "'other", "'", ',', "'member", "'", ',', "'of", "'", ',', "'my", "'", ',', "'family", "'", ',', "'", '.', "'", ',', "'it", "'", ',', '``', "'s", "''", ',', "'the", "'", ',', "'slight", "'", ',', "'saltiness", "'", ',', "'and", "'", ',', "'the", "'", ',', "'dark", "'", ',', "'chocolate", "'", ',', "'that", "'", ',', "'work", "'", ',', "'very", "'", ',', "'well", "'", ',', "'together", "'", ',', "'with", "'", ',', "'the", "'", ',', "'nuts.as", "'", ',', "'sweet", "'", ',', "'snack", "'", ',', "'go", "'", ',', "'", ',', "'", ',', "'", 'i', "'", ',', '``', "'d", "''", ',', "'consider", "'", ',', "'it", "'", ',', "'reasonably", "'", ',', "'healthy", "'", ',', "'in", "'", ',', "'moderation", "'", ',', "'", ',', "'", ',', "'because", "'", ',', "'of", "'", ',', "'the", "'", ',', "'balance", "'", ',



In [107]:
whotels_pos

Unnamed: 0,word,count,polarity


In [102]:
whotels = pos_dummies(df, 'Lemmatized', whotels_pos, whotels_neg)
kindbars = pos_dummies(kind_df, 'Lemmatized',kind_pos, kind_neg)
#kindbars['total']=(kindbars[kindbars.columns[-20:-1]]).apply(sum, axis = 1)

In [104]:
kindbars.head()

Unnamed: 0,review_rating,Review,review_headline,Product (Taste/Experience),Lemmatized
0,5.0 out of 5 stars,"I really like these bars, and so do the other ...",A very tasty and well-balanced treat,1,"[i, really, like, these, bar, ,, and, so, do, ..."
1,5.0 out of 5 stars,I purchased these because I’m on the 16:8 IF d...,Great size snack for those of us wanting a hea...,1,"[i, purchase, these, because, i, ’, m, on, the..."
2,5.0 out of 5 stars,These are great bars. I find when I'm training...,Price varies a lot !!!,1,"[these, be, great, bar, ., i, find, when, i, '..."
3,5.0 out of 5 stars,Not a protein bar but a very health-designed s...,Possibly the best tasting healthiest snack bar...,1,"[not, a, protein, bar, but, a, very, health-de..."
4,5.0 out of 5 stars,So good and actually quite low in sugar all co...,Definitely a bar to try and enjoy,1,"[so, good, and, actually, quite, low, in, suga..."


In [44]:
pos_neg = [column for column in kindbars.columns if column[0:3] == 'pos' or column[0:3] == 'neg']
kindbars[['Tokenized']+pos_neg]
kindbars['total_count'] = sum(kindbars[])

Unnamed: 0,Tokenized,pos_good,pos_love,pos_delicious,pos_great,pos_healthy,pos_sweet,pos_nice,pos_kind,pos_really,...,neg_dark,neg_expensive,neg_hard,neg_little,neg_slightly,neg_long,neg_bad,neg_ill,neg_single,neg_less
0,"[i, really, like, these, bar, ,, and, so, do, ...",1,0,0,0,1,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,"[i, purchase, these, because, i, ’, m, on, the...",0,0,1,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,"[these, be, great, bar, ., i, find, when, i, '...",0,0,0,1,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
3,"[not, a, protein, bar, but, a, very, health-de...",1,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
4,"[so, good, and, actually, quite, low, in, suga...",1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,"[the, price-, a, little, too, high, .]",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
280,"[these, be, fill, without, too, much, sweetnes...",0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
281,"[super, bar, nice, they, be, not, chocolatey, ...",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
282,"[this, product, be, a, tasty, nut, snack, .]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
kindbars.head()

Unnamed: 0,review_rating,Review,review_headline,Product (Taste/Experience),Lemmatized,pos_good,pos_love,pos_delicious,pos_great,pos_healthy,...,neg_dark,neg_expensive,neg_hard,neg_little,neg_slightly,neg_long,neg_bad,neg_ill,neg_single,neg_less
0,5.0 out of 5 stars,"I really like these bars, and so do the other ...",A very tasty and well-balanced treat,1,"i really like these bar , and so do the other ...",1,0,0,0,1,...,1,0,0,0,0,0,0,1,0,0
1,5.0 out of 5 stars,I purchased these because I’m on the 16:8 IF d...,Great size snack for those of us wanting a hea...,1,i purchase these because i ’ m on the 16:8 if ...,0,1,1,0,0,...,0,0,0,0,1,1,0,0,0,0
2,5.0 out of 5 stars,These are great bars. I find when I'm training...,Price varies a lot !!!,1,these be great bar . i find when i 'm train an...,0,0,0,1,0,...,1,0,0,0,0,0,0,1,0,0
3,5.0 out of 5 stars,Not a protein bar but a very health-designed s...,Possibly the best tasting healthiest snack bar...,1,not a protein bar but a very health-designed s...,1,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
4,5.0 out of 5 stars,So good and actually quite low in sugar all co...,Definitely a bar to try and enjoy,1,so good and actually quite low in sugar all co...,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [15]:
columns_to_keep = [column for column in whotels.columns if column[:3] == 'pos' or column[:3] == 'neg']
columns_to_keep

final_columns = ['Contents','Sentiment','Polarity'] + columns_to_keep
whotels[final_columns]

Unnamed: 0,Contents,Sentiment,Polarity,pos_love,pos_new,pos_live,pos_beautiful,pos_amazing,pos_fun,pos_first,...,neg_little,neg_parade,neg_past,neg_long,neg_limited,neg_wet,neg_drag,neg_pink,neg_sharp,neg_due
0,What I thought was the weirdest design choice ...,1,0.229401,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,"New day, new sunset 🌅 #wkohsamui #beachlife #h...",1,0.344156,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,#amsterdam #wamsterdam #finertravel #travelpho...,1,0.000000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Best breakfast ever whotels at #whoteldubai 🤩 ...,1,0.766667,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,#그립다😢 #bali #wbali #seminyak,1,0.000000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,...but really you can! 👙 Thank you to dukespir...,3,0.312500,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
170,#Goa #Wgoa #VagatorBeach #nature #photography ...,3,0.675000,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
171,Rule #01- Be healthy . . . #whotel #singapore ...,3,0.500000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
172,Movida night #eventdinner #wbarcelonahotel #ba...,3,0.000000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
