> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Exercise Set 17: Text as Data 2

*Morning, August 23, 2018*

In this Exercise Set you will practice methods within Information Extraction in python. 
You will practice the following:
* Practice doing look ups using set operations.
* Implement and compare different lexical based methods for sentiment analysis. 
* Furthermore you get to play with the output from a Word2Vec model and a Topic Model, both trained on 4 million reviews from the TrustPilot Review dataset, that we practiced scraping.

## Exercise Section 17.1: Look-ups and Dictionary Methods
In ths exercise you will practice using curated lexicons to extract knowledge from text.

First we load the dataset. Again we use the Review Data Set. Load it by running the following:
```python
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv')
```


> **Ex 17.1.1:**
Define two concepts you want to measure in the reviews. And curate a list of words expressing that concept. E.g. words related to Travelling, Computers, or for the bold define words that indicate Trolling.
* Convert the two lists into sets, and assign to variables of choice.

These will be your Lexicons that you want to match up with the documents.


In [None]:
#[Answer 17.1.1]

In [1]:
#SOLUTION
import pandas as pd
import nltk
df = pd.read_csv('https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv')
dictionaries = {'trolling':set(['lol','haha','fuck','fuuck']),
               'thankful':set(['thank','wonderful','love'])}


> **Ex 17.1.2:**
Now we design a simple preprocessing function to:
* first tokenize the string using the nltk.word_tokenize function. 
* And secondly it converts capital letters to noncapital letters for each token, using a list comprehension.

In [None]:
#[Answer 17.1.2]

In [2]:
#SOLUTION
# tokenize documents
def preprocess(document):
    tokenized = nltk.word_tokenize(document)
    # lower all tokens
    lowered = [i.lower() for i in tokenized]
    return lowered

>** Ex.17.1.3:**
*Now we apply the preprocessing scheme to all of our documents assigning it to a variable: tokenized_docs.
*Secondly we convert all of the tokenized docs into sets, by loop through the documents and applying the set() command.

In [None]:
#[Answer 17.1.3]

In [3]:
# SOLUTION
tokenized_docs = df.reviewBody.apply(preprocess).values
# convert each document to a set of tokens instead of list of tokens
doc_sets = [set(doc) for doc in tokenized_docs]

>** Ex 17.1.4:** Now we shall find the overlap between our curated lists, and each document set.
We do this by defining a container named `overlap`. 
Then we run through all document sets:
* And take the length of the overlap between the document set and our curated lexicons.
* Append the length to the `overlap`  container.
HINT: Overlaps between sets our found using the `&` sign. And length you get from the `len()` builtin function.
* Finally assign the overlap values. a new column in the dataframe with the overlap values.

In [None]:
#[Answer 17.1.4]

In [4]:
# SOLUTION
for key,lexicon in dictionaries.items(): # iterate through my curated lexicons.
    df[key] = [len(s&lexicon)>0 for s in doc_sets] # take the overlap using a list comprehension

## Exercise section 17.2 Lexical Based Sentiment Analysis using Dictionaries
Here I want you to test 4 different dictionaries for sentiment analysis on the review dataset.

* You will compare each document to the nltk.corpus.opinion_lexicon build into nltk.
* You will try to use the Afinn package. `pip install afinn`
* And finally you will compare the rulebased version of the VADER (Valence Aware Dictionary and sEntiment Reasoner - "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text" Hutto and Gilbert 2014) sentiment analyser, to a simple lookup version. 

This means comparing 4 different lexical based sentiment analysis. Two of them you will use your `set` operations for checking overlap between documents and a set. And two them will be prepackaged with builtin methods.


> **17.2.1:** First we need to get our curated lists of words with strong signals of sentiment. The wordlists you will need for this exercise is in the nltk.corpus.opinion_lexicon. The other one is the wordlist used in the VADER method, this you will download from github using the following link: https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt
* First assign nltk.corpus.opinion_lexicon to the variable `lexicon_1`
* next run the following command to parse the VADER lexicon from github:
```python
import pandas as pd
vader_df = pd.read_csv('https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt'
            ,sep='\t',header=None) # changing the separator to tab, and specifying no header
vader_df.columns = ['token','average_score','variance','ratings'] # adding the header
```
* To get a list - that will be converted to a set - of posive and negative words we will extract it from the dataframe by filtering on the column average score. Take the tokens where the average score is less than 0 and assign to a variable `negative`, and do the opposite for a variable `positive`. Remember to convert them to a set. 
* Define two dictionaries for each of the curated lexicons looking like this: 
```python
opinion_lexicon = {'positive':positive_words,'negative':negative_words}
vader_lexicon = {'positive':vader_positive,'negative',vader_negative}
``` 


In [5]:
import nltk
# get the built in lexicon
positive_words = set(nltk.corpus.opinion_lexicon.positive())
negative_words = set(nltk.corpus.opinion_lexicon.negative())
# get the raw vader lexicon
import pandas as pd
vader_df = pd.read_csv('https://raw.githubusercontent.com/cjhutto/vaderSentiment/master/vaderSentiment/vader_lexicon.txt'
            ,sep='\t',header=None) # changing the separator to tab, and specifying no header
vader_df.columns = ['token','average_score','variance','ratings'] # adding the header
# extract positive and negative words by filtering on the average score column
vader_positive = set(vader_df[vader_df['average_score']>0].token.values)
vader_negative = set(vader_df[vader_df['average_score']<0].token.values)
# define two dictionaries
vader_lexicon = {'positive':vader_positive,'negative':vader_negative}
opinion_lexicon = {'positive':positive_words,'negative':negative_words}

print('Vader has %d negative words and %d positive VS. the opinion lexicon %d and %d'%(len(vader_negative),
                                                                                         len(vader_positive),
                                                                                         len(negative_words),
                                                                                         len(positive_words)))

Vader has 4170 negative words and 3335 positive VS. the opinion lexicon 4783 and 2006


>**Ex. 17.2.3:** Scoring a document using the dicionary.
Now we write a function that takes in a document and a dictionary containing negative and positive words. The function will tokenize the document and return a sentiment score based on the overlapping words.

* First you apply the `preprocess` function you created in exercise 17.1 on the document.
* Then you filter all words from the documents that are not in the positive word set and take the length of the resulting list. 
* You do the same with the negative word set. 
* Finally you calculate a polarity score by subtracting the negative overlap from the positive overlap and divide it by the length of the document : `pos-neg/len(doc)`    
(Hint1: Filter like this [w for w in doc if w in pos])

Wrap the above in a function called apply_sentiment_dictionary.

In [None]:
#[Answer 17.2.3]

In [6]:
# SOLUTION
def apply_sentiment_dictionary(doc,dictionary):
    doc = preprocess(doc)
    pos = len([w for w in doc if w in dictionary['positive']])
    neg = len([w for w in doc if w in dictionary['negative']])
    polarity = (pos-neg)/len(doc)
    return polarity



> **Ex. 17.2.4:** Make two new columns in the dataframe;'opinion' ,'vader_raw', by applying the function with their respective dictionaries as input. This means you will have to give the .apply function another argument: `.apply(apply_sentiment_dictionary,args=(vader_lexicon,)`

In [None]:
#[Answer Ex. 17.2.4]

In [7]:
# SOLUTION
df['vader_raw'] = df.reviewBody.apply(apply_sentiment_dictionary,args=(vader_lexicon,))
df['opinion'] = df.reviewBody.apply(apply_sentiment_dictionary,args=(opinion_lexicon,))

>**Ex.17.2.5:** Applying the prepackaged.
* Figure out how to apply the Afinn method here: https://github.com/fnielsen/afinn
* Apply the afinn score on each document and define a column 'afinn_score'. 

In [None]:
#[Answer Ex.17.2.5]

In [8]:
#SOLUTION
from afinn import Afinn
afinn = Afinn()
df['afinn_score'] = [afinn.score(doc) for doc in df.reviewBody]

>**Ex.17.2.6:** Applying the prepackaged(2).
The MIT VADER Analyzer is run by initializing the analyzer = nltk.sentiment.SentimentIntensityAnalyzer(). And then using the builtin function of the analyzer: `.polarity_score(string)`. The function has more than one output so defining the function has more than one output, so we will only use the 'compound' variable.
* apply the .polarity_score function to each document and extract the 'compound' value from the dictionary output of the sentiment analyzer. And define a new column called 'vader_compound' in the dataframe.

In [None]:
#[Answer 17.2.6]

In [9]:
# SOLUTION
import nltk.sentiment
analyzer = nltk.sentiment.SentimentIntensityAnalyzer()
vader_scores = [analyzer.polarity_scores(doc) for doc in df.reviewBody]
for key in ['compound','neg','neu','pos']:
    df['vader_%s'%key] = [i[key] for i in vader_scores]



> **Ex. 17.2.7:** Comparing the performance of the Sentiment Analyzers.
How to actually evaluate the performance of their scores does not have a definite answer, since we do not have a Human label score of each review, also they are on different scales, so what scale to use. However we might do the following:
* Convert all ratings into a binary: 1 if above 3 and 0 if below. 
* Do the same with the Scores from the sentiment analyzers.
* And calculate an accuracy score. 

or 
> 
* we could do a simple correlation between the score and the rating. And compare which has the best fit.
    * use np.correcoef()
* or even train a classifier to predict the Rating using the output from the classifier.

In [92]:
#[Answer 17.2.7]

Unnamed: 0,Analyzer,accuracy,correlation,Precision
0,vader_raw,0.7776,0.329029,0.896989
1,vader_compound,0.8318,0.54686,0.900119
2,afinn_score,0.8032,0.384245,0.900045
3,opinion,0.7943,0.343067,0.906115


In [10]:
# SOLUTION
import sklearn.metrics
import numpy as np
y = [1 if i>3 else 0 for i in df.reviewRating_ratingValue] # Convert rating to binary
header = ['Analyzer','accuracy','correlation','Precision']
performance = []
for num,column in enumerate(['vader_raw','vader_compound','afinn_score','opinion']):
    subdf = df[df[column]!=0] # filter on the values where the sentiment score is defined.
    y = np.array([1 if i>3 else 0 for i in subdf.reviewRating_ratingValue])
    x = np.array([1 if i>0 else 0 for i in df[column] if not i==0]) # convert the sentiment score to a binary
    accuracy = sum(x==y)/len(df)
    # calculate the correlation
    correlation = np.corrcoef(df[column],df['reviewRating_ratingValue'])[0][1]
    row = [column,accuracy,correlation,sklearn.metrics.accuracy_score(y,x)]
    performance.append(row)
pd.DataFrame(performance,columns=header)

Unnamed: 0,Analyzer,accuracy,correlation,Precision
0,vader_raw,0.7776,0.329029,0.896989
1,vader_compound,0.8318,0.54686,0.900119
2,afinn_score,0.8032,0.384245,0.900045
3,opinion,0.7943,0.343067,0.906115


## Exercise Section 17.3: Playing around with Outputs from Unsupervised Models
Here I want you to get acquinted with the capabilities and the syntax in the python implementation of the two famous unsupervised methods for text data: Topic Modelling and Word2Vec. 

You need to install the pyldaviz package: `conda install -c conda-forge pyldavis`


Download the Word2Vec model here: https://www.dropbox.com/sh/lwpoyipspunzojl/AABSoO8j7EUjPLixSBkOe7Uda?dl=0

Download the TopicModel here: https://www.dropbox.com/sh/fmmxcyvnti0c1y7/AAAOgHmnD2mbbHEQiwtJsjW-a?dl=0

In [42]:
# load the models
from gensim.models import Word2Vec
model = Word2Vec.load('word_embeddings_review/w2vec_review')
from gensim.models import LdaMulticore
lda = LdaMulticore.load('topicmodel_review/lda50_reviews')

>** Ex.17.3.1: ** The Word2Vec model object. 
Here we will use the `model.wv.most_similar()` method to seach the vector space. This can be used when developing lexicons.
We will see how the model has embedded the negative and positive words from our lexicons.
* First we define a union between the Vader lexicon and the Opinion lexicon. The union of two sets can be done using the `|` operator.
* Then we filter which of these words are actually in the vocabulary of the model. The vocabulary of the model can be found under the model.vocab property, and then you use an `if in` statement to filter.

In [18]:
all_negative = vader_negative|negative_words
all_positive = vader_positive|positive_words

in_w2vec_positive = [w for w in all_positive if w in model.wv]
in_w2vec_negative = [w for w in all_negative if w in model.wv]
print(len(in_w2vec_negative),len(in_w2vec_positive))

3670 2530


> **Ex 17.3.2:**
Now we pick a random sample from the negative and apply the .most_similar command. 
* We use the `random` module and the random.choice method to get a word.
* And then we `print(word,model.wv.most_similar(word))`

In [43]:
#[Answer 17.3.2]

In [45]:
# SOLUTION
import random
w = random.choice(in_w2vec_positive)
print(w,model.wv.most_similar(w))

acceptance [('submission', 0.5592600107192993), ('denial', 0.5328542590141296), ('approval', 0.5284862518310547), ('completion', 0.5146688222885132), ('finalization', 0.4938328266143799), ('eligibility', 0.48998504877090454), ('issuance', 0.48656946420669556), ('underwriting', 0.4714069366455078), ('submitting', 0.4659322202205658), ('acknowledgment', 0.46115410327911377)]



Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.



> **Ex.17.3.3:** Now we do some of the famous linear algebra (King - Man + Women = Queen)
But instead we say: What is good- :) + :( = ? 
We do this by applying the same function:
`.most_similar(positive=['good',':('],negative=[':)'])` 

In [None]:
#[Answer 17.3.3]

In [46]:
# SOLUTION
model.wv.most_similar(positive=['good',':('],negative=[':)'])


Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.



[('poor', 0.6456157565116882),
 ('terrible', 0.5774315595626831),
 ('decent', 0.5770268440246582),
 ('bad', 0.5701460838317871),
 ('subpar', 0.5187545418739319),
 ('horrible', 0.5081883668899536),
 ('disappointing', 0.5017893314361572),
 ('awful', 0.4973694682121277),
 ('lousy', 0.4924020767211914),
 ('substandard', 0.48289310932159424)]

> ** Exercise 17.3.5:** Interactive Plotting of Word Embedding
** Inspecting clusters of Words ** 
Run PCA on a subsample of the wordvectors found by applying this command.
* Inspect what the different dimensions seem to represent by hovering over the words.

In [16]:

import plotly.offline as py # import plotly in offline mode
py.init_notebook_mode(connected=True) # initialize the offline mode, with access to the internet or not.
import plotly.tools as tls 
tls.embed('https://plot.ly/~cufflinks/8') # embed cufflinks.
# import cufflinks and make it offline
import cufflinks as cf
cf.go_offline() # initialize cufflinks in offline mode
import random
negative_sample = random.sample(list(in_w2vec_negative),200)
positive_sample = random.sample(list(in_w2vec_positive),200)
neutral = random.sample(list(model.wv.vocab),600)
words = negative_sample+positive_sample+neutral
valence = (['Positive']*200)+(['Negative']*200)+(['Neutral']*600)
X = [model.wv[w] for w in words]
#from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
#embedding = TSNE(n_components=2).fit_transform(X)
embedding = PCA(n_components=2).fit_transform(X)
embedding_df = pd.DataFrame(embedding,columns=['x','y'])
embedding_df['word'] = words
embedding_df['valence'] = valence
embedding_df.iplot(x='x',y='y',categories='valence',text='word')

>**Ex.17.3.6:** Inspecting TOPIC MODELS using pyldaviz
Lets look at the ldamodel object. We shall use the pyldaviz package to "discover" what the topic model have found.

In [52]:
# load corpus
import pickle
sample_corpus = pickle.load(open('topicmodel_review/lda_sample_corpus.pkl','rb'))
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda,corpus=sample_corpus,dictionary=lda.id2word) # this takes a while to run

In [54]:
#vis