<div>
<img src="images/icon_important.jpg" width="50" align="left"/>
</div>
<br>
<br>

### __Important Legal Notice__
By running and editing this Jupyter notebook with the corresponding dataset, you agree that you will not use or store the data for other purposes than participating in the Champagne Coding with DNB & Women in Data Science, Oslo. You will delete the data and notebook after the event and will not attempt to identify any of the commentors.

## Sentiment Analysis

#### What is sentiment analysis?

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text, or document, written in natural language and extracts the opinions present in the text.

In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positive, neutral, and negative.

For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative. 

#### The process

<div>
<img style="float: right" 
     src="images/process.png" 
     width="200" />
    
<p> 

Before we can build our sentiment classifier, we need to go through a couple of steps to prepare our data for classification:
    <li>Tokenization</li>
    <li>Stop Word filtering</li>
    <li>Negation handling</li>
    <li>Stemming</li>
    <li>Classification </li>

</p> 
    
</div>
<br>




Parts of the code are taken from: https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a

Also, this repository contains very useful code to perform different NLP techniques!
https://github.com/susanli2016/NLP-with-Python

In [1]:
import pandas as pd
from pathlib import Path
current_directory = Path.cwd()
reviews_directory = Path(current_directory, 'reviews')

#### Read data frame

In [3]:
df = pd.read_csv(Path(reviews_directory, 'dnb_reviews-translated.csv'))

#### Clean data frame

In [5]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)
df = df.drop(['Unnamed: 0'], axis=1)

#### Polarity

A basic task in sentiment analysis is classifying the __polarity__ of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. We will use TextBlob to calculate the polarity across our reviews.

In [6]:
from textblob import TextBlob
from nltk.tokenize import sent_tokenize # for tokenizing into sentences
import statistics

In [7]:
df.sample(5)

Unnamed: 0,Name,Date,Review_Score,Review_Text,Language,Review_Eng
381,Vilde Lund,"March 12, 2019",2,Hvem er den smartingen som har kommet opp med ...,no,"Who is the smart, no one has come up with this..."
714,Asbjørn Iversen,"December 17, 2015",1,Can no longer log in.,en,Can no longer log in.
696,Anette Haugland,"August 23, 2016",3,Got some challenge to login. Not easy to use,en,Got some challenge to login. Not easy to use
727,Stig Lien,"June 27, 2015",1,A link to mobile website.,en,A link to mobile website.
29,Alf Jørgensen,"April 5, 2019",1,what the heck is this update?! i can not do wh...,en,what the heck is this update?! i can not do wh...


Let's start by calculating the polarity on the English text.

In [8]:
df['polarity'] = df['Review_Eng'].map(lambda text: TextBlob(text).sentiment.polarity)

Some helper functions for encapsulating the calls.

In [9]:
def calc_polarity(text):
    text_blob = TextBlob(text)
    
    if text_blob.detect_language() != 'en':
        text_blob = text_blob.translate(to='en')
        
    return text_blob.sentiment.polarity

# We used the average of the polarity scores for all the sentences.. does it make sense? go ahead and test it :)
def calc_polarity_sentence(text):
    sentence_polarity = [calc_polarity(sentence) for sentence in sent_tokenize(text)]
    return statistics.mean(sentence_polarity)

In [10]:
text1 = 'Hvorfor kan ikke JEG bestemme om jeg vil bruke denne appen på en rootet telefon? Det er jo JEG som tar risikoen ved å gjøre det, ikke dere. Det blir så sinnsykt baklengs at dere skal diktere hva jeg kan og ikke kan med min egen telefon.'
text2 = 'Hvis du har prøved denne appen tidligere og likte den ikke som meg, bør du gi denne en annen sjanse. DnB har oppdatert appen grundig i det siste året og lagt til en del gode funksjoner som var veldig savnet. gir den 4 ut av 5 for det er fremdeles tider hvor widgeten ikke fungerer.'

In [11]:
text_blob = TextBlob(text1)
try:
    print(text_blob.translate(to='en').sentiment)
except:
    print('HTTPError')

HTTPError


In [12]:
text_blob = TextBlob(text2)
try:
    print(text_blob.translate(to='en').sentiment)
except:
    print('HTTPError')

HTTPError


In [13]:
text = text2 # test with text1 too!

try:
    print('Opasity score for the whole text at once: {}'.format(calc_polarity(text)))
except:
    print('HTTPError')
    
try:
    sentence_polarity = [calc_polarity_sentence(sentence) for sentence in sent_tokenize(text)]
    print('Opasity score for aggregating over sentences: {}'.format(calc_polarity_sentence(text)))
except:
    print('HTTPError')

HTTPError
HTTPError


#### Review the results for positive, neutral, and negatively classified results.

In [14]:
print('5 random reviews with the most positive sentiment (1) polarity: \n')

cl = df.loc[df.polarity == 1, ['Review_Text']].sample(5).values
for c in cl:
    print(c[0])

5 random reviews with the most positive sentiment (1) polarity: 

dnb er best
The best way to accses dnb
Appen slet litt med barndomsproblemer, men nå er alt som det skal være; innlogging ved bruk av fingeravtrykk funker nå og appen har blitt raskere. Veldig fornøyd med oppsettet og alle funksjonene appen tilbyr.
Best app ever
Superb


In [15]:
print('5 random reviews with the most neutral sentiment(zero) polarity: \n')
cl = df.loc[df.polarity == 0, ['Review_Text']].sample(5).values
for c in cl:
    print(c[0])

5 random reviews with the most neutral sentiment(zero) polarity: 

Trying to login but the application is not responding.
After the update I even can't login. 😒
it's not optimal. userfriendly, not at the moment.
Crashed...
app doesn't work, can't log in


In [16]:
print('5 random reviews with the most negative sentiment(-1) polarity: \n')
cl = df.loc[df.polarity == -1, ['Review_Text']].sample(5).values
for c in cl:
    print(c[0])

5 random reviews with the most negative sentiment(-1) polarity: 

Worst bank app ever
horrible after update...
En av de dårligste nettbank-appene for mobil.
tidenes verste app. Ukke testet av noen forutenom mossfornøyde brukere
Ghastly AND Awful


## Visualizations

Let's create some visualizations to get a feel for our data. We will start with a hisogram of the polarity.

#### Distribution of review sentiment polarity score 

In [17]:
%matplotlib inline

from plotly.offline import iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

df['polarity'].iplot(
    kind='hist',
    bins=50,
    xTitle='polarity',
    linecolor='black',
    yTitle='count',
    title='Sentiment Polarity Distribution')

#### Distribution of review rating 

In [18]:
df['Review_Score'].iplot(
    kind='hist',
    xTitle='rating',
    linecolor='black',
    yTitle='count',
    title='Review Rating Distribution')

#### 2D Density jointplot of sentiment polarity and rating

In [19]:
import plotly.graph_objs as go

trace1 = go.Scatter(
    x=df['polarity'], y=df['Review_Score'], mode='markers', name='points',
    marker=dict(color='rgb(102,0,0)', size=2, opacity=0.4)
)
trace2 = go.Histogram2dContour(
    x=df['polarity'], y=df['Review_Score'], name='density', ncontours=20,
    colorscale='Hot', reversescale=True, showscale=False
)
trace3 = go.Histogram(
    x=df['polarity'], name='Sentiment polarity density',
    marker=dict(color='rgb(102,0,0)'),
    yaxis='y2'
)
trace4 = go.Histogram(
    y=df['Review_Score'], name='Rating density', marker=dict(color='rgb(102,0,0)'),
    xaxis='x2'
)
data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    showlegend=False,
    autosize=False,
    width=600,
    height=550,
    xaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    yaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    margin=dict(
        t=50
    ),
    hovermode='closest',
    bargap=0,
    xaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    ),
    yaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='2dhistogram-2d-density-plot-subplots')

### Top unigrams after removing stop words

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df['Review_Eng'], 20)
for word, freq in common_words:
    print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['Review_Eng' , 'count'])

app 602
log 150
login 147
dnb 129
bank 122
time 117
work 113
new 111
update 110
use 108
balance 107
mobile 103
works 98
old 96
just 91
does 89
check 87
version 85
banking 85
account 67


### Top trigrams before removing stop words

In [21]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_trigram(df['Review_Eng'], 20)
for word, freq in common_words:
    print(word, freq)
df5 = pd.DataFrame(common_words, columns = ['Review_Eng' , 'count'])

to log in 58
does not work 50
the old app 39
the app is 24
log in with 23
have to log 19
the new update 16
without logging in 16
of the app 15
in the app 15
the previous version 14
to use the 13
the new app 13
one of the 12
you have to 12
the app to 12
this app is 12
in with bankid 12
it is not 11
old app was 11


In [22]:
df5.groupby('Review_Eng').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in review before removing stop words')

### Top Trigrams after removing stop words

In [23]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_trigram(df['Review_Eng'], 20)
for word, freq in common_words:
    print(word, freq)
df6 = pd.DataFrame(common_words, columns = ['Review_Eng' , 'count'])

norway largest bank 9
check account balance 8
log online banking 8
bank id time 5
log bank id 5
old app better 5
app doesn work 5
log mobile banking 5
does work anymore 5
online banking app 5
app does work 5
getting error 1010 4
time open app 4
new update app 4
login error retry 4
just check balance 4
app completely useless 3
want use app 3
old app worked 3
app check balances 3


In [24]:
df6.groupby('Review_Eng').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in review after removing stop words')

## Now for the fun part - Challenges!

#### 1 - How do the reviews change if we just consider the period of time from which the native MobilApp was launched?
#### 2 - Is there any relationship between the reviews and the release dates for some of their features? 
(data will be provided during the coding session)
#### 3 - What are the relationships between topics, releases of features, sentiments and scoring?