## Sentiment Analysis for Whispr

1. import data from google sheets
2. clean dataset and create synthetic variables
3. summarize dataset: how many records per category, reviews over time
4. evaluate sentiment of review, give confidence interval
5. calculate summary insights: average sentiment / subjectivity per item, reviews per item
6. compare against manual evaluation
7. export data to google sheets

In [152]:
import pandas as pd
import numpy as np
import os
from textblob import TextBlob
import gspread
from datetime import datetime
from oauth2client.service_account import ServiceAccountCredentials
from matplotlib import pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords, words
import string


%matplotlib inline
sns.set_style('darkgrid')
pd.options.display.max_rows = 100


### 1. Import data from GS
- connect to google sheets API
- create spreadsheet and worksheet objects, explore GSpread library
- create dataframe of reviews

In [3]:
#1 define the scope of your access tokens
scope = ['https://www.googleapis.com/auth/drive','https://spreadsheets.google.com/feeds']

#2 after getting oauth2 credentials in a json, obtain an access token from google authorization server
#by creating serviceaccountcredentials and indicating scope, which controls resources / operations that an
#access token permits
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)

#3 log into the google API using oauth2 credentials
#returns gspread.Client instance
c = gspread.authorize(creds)

In [12]:
spreadsheet = c.open('UK Sentiment')

worksheet = spreadsheet.worksheet('WHotel_Sentiment')

records = worksheet.get_all_records()
df = pd.DataFrame(records)
df = df[['Contents','Sentiment','Topic','Location','Comment']]

In [16]:
df.head()

Unnamed: 0,Contents,Sentiment,Topic,Location,Comment
0,What I thought was the weirdest design choice ...,1,Design,Washington DC,
1,"New day, new sunset 🌅 #wkohsamui #beachlife #h...",1,Location&View,Koh Samui,
2,#amsterdam #wamsterdam #finertravel #travelpho...,1,Location&View,Amsterdam,
3,Best breakfast ever whotels at #whoteldubai 🤩 ...,1,Restaurant,Dubai,Breakfast
4,#그립다😢 #bali #wbali #seminyak,1,Guest Experience,Bali,


### 2. Data preprocessing

In [181]:
df['Sentiment_Category'] = df['Sentiment'].map({1: 'Positive',2:'Neutral',3:'Negative'})

def pos_neg(polarity):
    if polarity >= 0.1:
        return 'Positive'
    if polarity >= 0 and polarity < 0.1:
        return 'Neutral'
    else:
        return 'Negative'

df['Polarity'] = [TextBlob(x['Contents']).polarity for i, x in df.iterrows()]
df['Subjectivity'] = [TextBlob(x['Contents']).subjectivity for i, x in df.iterrows()]
df['Textblob_Score'] = df['Polarity'].apply(pos_neg)

df.groupby(['Sentiment_Category','Textblob_Score'])['Polarity'].agg({'mean':np.mean, 'count':len})

is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)

  from ipykernel import kernelapp as app


Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Sentiment_Category,Textblob_Score,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative,Negative,-0.229419,11.0
Negative,Neutral,0.003046,72.0
Negative,Positive,0.379133,55.0
Neutral,Neutral,0.028125,1.0
Positive,Negative,-0.4,1.0
Positive,Neutral,0.001145,14.0
Positive,Positive,0.425419,20.0


## Create sentiment analyzer
- count frequency of meaningful words
- rate positivity and negativity of most frequent words
- create dummy variables for the presence of these words
- use knn model to classify positive and negative

In [178]:
#has to be legitimate english word, no stopwords, no punctuation
#create dictionary of counts of each word
#calculate polarity of each word - identify positive and negative words
#create dummy variables for the presence of these words
#split dataset into test / train, fit knn model
#make predictions using knn

mystop = set(stopwords.words('english'))
punctuation = string.punctuation
englishwords = words.words()

contentblob = [x for x in TextBlob(str(df['Contents'].values.tolist())).tokenize() 
               if x.lower() not in mystop and x not in string.punctuation
              and x in englishwords]
counts = {x: contentblob.count(x) for x in contentblob}

word_df = pd.DataFrame(counts.items(), columns = ['word','count']).sort_values('count', ascending = False)
word_df['polarity'] = word_df['word'].apply(lambda x: TextBlob(x).polarity)

In [180]:
word_df.head(100)

Unnamed: 0,word,count,polarity
49,W,61,0.0
229,de,23,0.0
18,hotel,22,0.0
151,barcelona,17,0.0
145,travel,17,0.0
28,love,13,0.5
231,la,11,0.0
110,party,10,0.0
50,would,10,0.0
172,us,9,0.0
