# 05 - Taming Text

## Deadline
Thursday December 15, 2016 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution
you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code

## Background
In this homework you will explore a relatively large corpus of emails released in public during the
[Hillary Clinton email controversy](https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy).
You can find the corpus in the `hillary-clinton-emails` directory of this repository, while more detailed information 
about the [schema is available here](https://www.kaggle.com/kaggle/hillary-clinton-emails).

## Assignment
1. Generate a word cloud based on the raw corpus -- I recommend you to use the [Python word_cloud library](https://github.com/amueller/word_cloud).
With the help of `nltk` (already available in your Anaconda environment), implement a standard text pre-processing 
pipeline (e.g., tokenization, stopword removal, stemming, etc.) and generate a new word cloud. Discuss briefly the pros and
cons (if any) of the two word clouds you generated.

2. Find all the mentions of world countries in the whole corpus, using the `pycountry` utility (*HINT*: remember that
there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.)
Perform sentiment analysis on every email message using the demo methods in the `nltk.sentiment.util` module. Aggregate 
the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level)
that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo
methods from the sentiment analysis module -- can you find substantial differences?

3. Using the `models.ldamodel` module from the [gensim library](https://radimrehurek.com/gensim/index.html), run topic
modeling over the corpus. Explore different numbers of topics (varying from 5 to 50), and settle for the parameter which
returns topics that you consider to be meaningful at first sight.

4. *BONUS*: build the communication graph (unweighted and undirected) among the different email senders and recipients
using the `NetworkX` library. Find communities in this graph with `community.best_partition(G)` method from the 
[community detection module](http://perso.crans.org/aynaud/communities/index.html). Print the most frequent 20 words used
by the email authors of each community. Do these word lists look similar to what you've produced at step 3 with LDA?
Can you identify clear discussion topics for each community? Discuss briefly the obtained results.


In [None]:
#required imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import string
import collections
import re
import pycountry
from os import path
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import bigrams
from nltk import ngrams
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment import SentimentAnalyzer
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
from pylab import rcParams
import sentiment as senth
from gensim import models, corpora


#local helper file import
import nlp_helper as nlph

# 1.Generate a word cloud based on emails' content

Load email with subjects, already extracted text and raw text.

In [None]:
raws = pd.read_csv('hillary-clinton-emails/Emails.csv',usecols=['ExtractedSubject','ExtractedBodyText','RawText'])
raws.head()

We see some NaN extractedSubject/extractedBodytext. After visualizing RawText, a majority of the cases can be explained (e.g no subject, email forwarding,..)
We decide to trust the latter and drop RawText

In [None]:
raws.drop(['RawText'], axis= 1,inplace=True)

In [None]:
body = pd.DataFrame()
body['text'] = raws.apply(nlph.concat_subj_txt, axis=1)
body.head()

Ok. dumb word cloud with concat of all cells.

In [None]:
text = ' '.join(body['text'])

In [None]:
text[:1000]

### 1.1. Naive Wordcloud on raw text

In [None]:
nlph.classic_cloud(text)

Not bad. Let's try something more stylish.

In [None]:
nlph.img_cloud(text, 'hc.png')

### 1.2. WordCloud on clean/preprocessed data
Observe frequent words and remove the ones that are frequent and unwanted

In [None]:
body_tokenized = [i for i in word_tokenize(text.lower())]

In [None]:
stop = set(stopwords.words('english'))
stop.update(string.punctuation) #Remove ponctuation

In [None]:
body_tokenized = [i for i in word_tokenize(text.lower()) if i not in stop]

After removing the 'english' stopwords and the punctuaction, we verify what we still need to remove

In [None]:
counter=collections.Counter(body_tokenized)
print(counter.most_common()[:50])

We remove the unwanted words

In [None]:
stop.update(['nan', '\'s', '--', '``', 'w', 'fw', 'n\'t', '\'m', 'also', 'thx', 'fyi', 'pls', '\'\'', '-', '—', 'pm', '•']) #Remove other unwanted characters and words

In [None]:
body_tokenized = [i for i in word_tokenize(text.lower()) if i not in stop]

In [None]:
nlph.classic_cloud(' '.join(body_tokenized))

If we print body_tokenized, we can still see some unwanted strings like dots ('..', '...'), dates and url. Since they are quite rare, we could ignore their existence for the only purpose of drawing the workdcloud which only considers the most frequent tokens. However, we will need a clean data later on. We thus decide to remove the latters.

In [None]:
print(len(body_tokenized))
r = re.compile('[.]{1,3}$')
dots = set(filter(r.match, body_tokenized))
body_tokenized = [token for token in body_tokenized if token not in dots]
print(len(body_tokenized))

We thus removed 477 tokens. Now, let's remove the url and files.

In [None]:
is_file = lambda x: x.endswith('.docx')
is_url = lambda x: x.startswith('htte/') or x.startswith('http/')
body_tokenized = [token for token in body_tokenized if not is_file(token) and not is_url(token)]
print(len(body_tokenized))

It is still not perfect since we still can find some unwanted symbols, useless dates/times and numbers.
Let's now try to apply some normalization techniques on our data.
Two classic methods are Stemming and Lemmatization.

In [None]:
porter = PorterStemmer()
wnl = WordNetLemmatizer()

stemmed = [porter.stem(token) for token in body_tokenized]
lemmatized = [wnl.lemmatize(token) for token in body_tokenized]

We now compute the word frequencies and look in their respective top10 for some potential differences.

In [None]:
count_stemm = collections.Counter(stemmed)
print("################## Stemmed Tokens ####################")
print(count_stemm.most_common()[:20])

count_lemm = collections.Counter(lemmatized)
print("################## Lemmatized Tokens #################")
print(count_lemm.most_common()[:20])

-----> Talk about differences <-------
We print the lemmatized wordcloud.

In [None]:
nlph.img_cloud(' '.join(lemmatized), 'hc.png')

Not much difference with not stemmed/lemmatized tokens.

---------> TODO: discuss pros/cons of each clouds<----------

### 1.3. Test: Frequency Analysis of bigrams

We see plenty of verbs and adjectives that lack some context to become meaningful. Let's start by creating bigrams from the "clean" text we obtained before. We could then check the top bigrams in term of frequency.

In [None]:
tokens_bigram = bigrams(lemmatized)
count_bigram= collections.Counter(tokens_bigram)
print(count_bigram.most_common()[:50])

We observe some interesting pairs of words coming up. 

# 2. Identification of countries and sentiment analysis per email. Aggregation of perceptions on countries.

### 2.1. Detect Countries
We now want to extract the mentions of countries in the text of each email.
We will start with our DataFrame "body" which contained the aggregated text of the two fields (subject + body text). 

In [None]:
geo_body = body.copy()
geo_body.head()

We will try to extract the country names in each of our rows (=email). We thus create different helper lists from the pycountry set of countries. They will help us returning a country name from its alpha_2, alpha_3 and lower case form.

In [None]:
a2_name = {}
a3_name = {}
for c in pycountry.countries:
    a2_name[c.alpha_2] = c.name
    a3_name[c.alpha_3] = c.name
    
names = [c.name for c in pycountry.countries]
lowers = map(lambda s: s.lower(), names)
lower_to_name = {}
for c in pycountry.countries:
    lower_to_name[c.name.lower()] = c.name

We tokenize our rows and clean them a little bit by removing some punctuation tokens. We do not remove stop words since a country could be composed of multiple words including a stop word.
We also need to remove tokens like 'pm', 'fw' and others which could be detected as country ids (e.g pm <=> saint pierre and miquelon)

In [None]:
#tokenization
geo_body['text'] = geo_body.apply(lambda row: word_tokenize(row.text) , axis=1)
#punctuation removal
geo_body['text'] = geo_body.apply(lambda row: [i for i in row.text if i not in string.punctuation], axis=1)
#bad ids
geo_body['text'] = geo_body.apply(lambda row: [i for i in row.text if i not in ['PM', 'RE', 'AM', 'TO', 'FM', 'NO', 'AND']], axis=1)

geo_body.head()

In [None]:
geo_body['countries'] = geo_body.apply(nlph.extract_country, axis=1)
geo_body.head(10)

We have some first country results. However we need to go further since we are overlooking countries in two different scenarios:
- a country could be composed of multiple tokens (e.g united states)
- a country can be represented by a city (e.g Cairo, row 3)

We first create new columns containing the bigrams and 3-grams of each text and then search again for countries.

In [None]:
def composed_token(grams):
    return list(map(lambda x: ' '.join(x), grams))
        

geo_body['bigrams'] = geo_body.apply(lambda r: composed_token(bigrams(r.text)), axis=1)
geo_body['trigrams'] = geo_body.apply(lambda r: composed_token(ngrams(r.text,3)), axis=1)
geo_body.head()

updates countries and add the ones found through bigrams and trigrams.

In [None]:
geo_body['countries'] = geo_body.apply(nlph.bigram_search, axis=1)
geo_body['countries'] = geo_body.apply(nlph.trigram_search, axis=1)
geo_body.drop(['bigrams', 'trigrams'], axis=1, inplace=True)

In [None]:
geo_body.head(10)

We obtained for instance the country United states and others.

We still lack the cases where cities are mentionned in an email without the country. 
We thus use a named-entity recognition classifier (stanford-nltk). We then query for the country of the locations obtained.

->__SEE NOTEBOOK: "__country_detection_stanfordNER.ipynb"

It is computationally expensive (~4-5h for a full location detection). 
TO RUN AND INSERT HERE.

### 2.2. Sentiment Analysis

As seen in the demos of the following page http://www.nltk.org/_modules/nltk/sentiment/util.html, there are many ways to compute a sentiment polarity or subjectivity of a given document.

The first solutions could be to train a classifier on an existing corpus and then classify our emails (see demo_twitter, demo_movies). However, those datasets are very domain specific. 

We thus decided to use two algorithms namely the Vader approach and the Liu Hu lexicon method.
Those two methods share the property that they do not need any training beforehand in order to classify a given sentence/text.


We first define the two row classification methods:


In [None]:
def vote(pos, neg):
    polarity = 0
    if pos > neg:
        polarity = 1
    elif pos < neg:
        polarity = -1
    return polarity

def vader_classify(row):
    vader_analyzer = SentimentIntensityAnalyzer()
    polarity = vader_analyzer.polarity_scores(row.text)
    pos = polarity['pos']
    neg = polarity['neg']
    return vote(pos, neg)

def liu_hu_classify(row):
    tokenizer = treebank.TreebankWordTokenizer()
    pos_words = 0
    neg_words = 0
    tokenized_sent = [word.lower() for word in tokenizer.tokenize(row.text)]
    for word in tokenized_sent:
        if word in opinion_lexicon.positive():
            pos_words += 1
        elif word in opinion_lexicon.negative():
            neg_words += 1
        else:
            continue
    return vote(pos_words, neg_words)    

We now create a new dataframe including a column for each sentiment analysis method output per row email.

In [None]:
if 0:
    sent_body = body.copy()
    sent_body['Vader'] = sent_body.apply(vader_classify, axis=1)

Since the run is very computationally expensive we save the obtained dataframe.

|text|vader polarity|

In [None]:
if 0:
    sent_body.to_csv('sentiment_vader.csv', sep=',', encoding='utf-8')

Same for the Liu Hu method.
We do not run this sentiment analysis for now. it is very expensive and take too much time. We will try to run it later on and compare the obtained results with the vader outputs.
 
|text|vader polarity|liu hu polarity|

In [None]:
if 0:
    sent_body['LiuHu'] = sent_body.apply(liu_hu_classify, axis=1)
    sent_body.to_csv('sentiment_vader_liu_hu.csv', sep=',', encoding='utf-8')

We load our csv containing the table with vader sentiment analysis and concatenate it with the table containing the detected countries.

In [None]:
sent_body = pd.read_csv('sentiment_vader.csv', index_col=0)
sent_body.head()

In [None]:
geo_sent = pd.concat([geo_body, sent_body], axis=1)
geo_sent.drop(['text'], axis=1, inplace=True)
geo_sent.head(10)

We now desaggregate the countries.

In [None]:
country_sentiment = pd.DataFrame(columns=['Country', 'Sentiment'])

def desaggregate(row):
    if len(row.countries) != 0:
        for c in row.countries:
            country_sentiment.loc[len(country_sentiment)] = [c, row.Vader]


geo_sent.apply(desaggregate, axis=1)
country_sentiment.head(10)

In [None]:
country_sentiment['Frequency'] = 1
country_sentiment = country_sentiment.groupby('Country').sum()
country_sentiment['Sentiment'] = country_sentiment['Sentiment'] / country_sentiment['Frequency']

In [None]:
country_sentiment = country_sentiment.sort_values(by='Frequency', ascending=False)
country_sentiment.head(10)

We now plot the country frequencies as well as a sentiment indicator.

In [None]:
# Credits to Stack Overflow :
# http://stackoverflow.com/questions/31313606/pyplot-matplotlib-bar-chart-with-fill-color-depending-on-value
from matplotlib import cm
rcParams['figure.figsize'] = 12, 8
# Set up colors : red to green
y = np.array(country_sentiment['Sentiment'][:50])
colors = cm.RdYlGn(y / float(max(y)))
plot = plt.scatter(y, y, c=y, cmap = 'RdYlGn')
plt.clf()
clb = plt.colorbar(plot)
clb.ax.set_title("Sentiment")

# Display bar plot : country frequency vs. country name, with color indicating polarity score
plt.bar(range(50), country_sentiment['Frequency'][:50], align='center', tick_label=country_sentiment.index[:50], color=colors)
plt.xticks(rotation=70, ha='right')
plt.xlabel("Country")
plt.ylabel("Frequency")
plt.show()

# 3. Topic Modeling on the corpus

We start from the extracted subjects and body texts in the emails, concat them and remove the stopwords and ponctuation

In [None]:
raws = pd.read_csv('hillary-clinton-emails/Emails.csv',usecols=['ExtractedSubject','ExtractedBodyText'])
body = pd.DataFrame()
body['text'] = raws.apply(nlph.concat_subj_txt, axis=1)

In [None]:
stop = set(stopwords.words('english'))
stop.update(string.punctuation) #Remove ponctuation
stop.update(['nan', '\'s', '--', '``', 'w', 'fw', 'n\'t', '\'m', 'also', 'thx', 'fyi', 'pls', '\'\'', '-', '—', 'pm', '•'])

In [None]:
body_tokenized = [[i for i in word_tokenize(text.lower())if i not in stop] for text in body['text']]

We create a dictionary and the corpus using the tokenized body of the emails.

In [None]:
dictionary = corpora.Dictionary(body_tokenized)
corpus = [dictionary.doc2bow(text) for text in body_tokenized]

We then build the LDA Model with number of topics between 5 to 50 and print each topic with 7 words to compare what number of topic returns meaningful topics at first sight  :

In [None]:
for i in range(5,51,5):
    lda = models.LdaModel(corpus, id2word=dictionary, num_topics=i)
    print("Number of topics : ", i , " / Number of words per topic : 7 \n")
    model = lda.show_topics(num_topics = i, num_words = 7, formatted=False)
    i = 0
    for topic in model:
        wordlist = topic[1]
        i+=1
        print("Topic " + str(i) + " has words :", end=" ")
        for word, prob in wordlist:
            print(word, end=", ")
        print("")
    print("\n-------------------------------------------------------------\n")

We can say that if the number of topics is appropriate the words in a topic seems related. The number of topics is too small when a topic contains words that could be split in two different topics. On the other hand the number of topics is too high when the words defining one topic are split in multiple topics. 

In our case, we find that 20 topics gives meaningful results.