# Tutorial 1

In this session, we will look at the wikileaks dataset and learn how to start gathering statistics about the dataset, preprocess the emails and extract useful information.

https://en.wikipedia.org/wiki/2016_Democratic_National_Committee_email_leak
https://wikileaks.org/dnc-emails/

## DNC emails

Around 40 000 emails leaked from DNC, around 1000 distinct users.

I give you an already pre-processed dataset in .json, where the emails are a bit cleaned and put into a 'nice' structure. If you are interested in the process of crawling + generating this file, find me later or watch the repository: https://github.com/hanveiga/nlp-amld-2018

## Loading JSON file

In the folder you will find a json file.

In [153]:
import pandas as pd

path_data = '../../data/clean_json.json'

def load_json_data(path_to_file):
    data_DF = pd.read_json(path_to_file,encoding='ascii')
    data_DF['from'] = data_DF['from'].str.lower()
    data_DF['body'] = data_DF['body'].apply(lambda x: " ".join(str(x).split()))
    return data_DF #[0:12000]

# I did this to speed up the computation a bit, you should play with this later, using the full dataset!
# return data_DF

Loading dataset from data folder

In [154]:
data = load_json_data(path_data)

Defining the Dataset object

In [155]:
from collections import defaultdict, Counter
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
import string

stop_words_list = stopwords.words('english') + list(string.punctuation) #TODO: add other words?

class Dataset(object):
    def __init__(self, dataframe):
        self.data = dataframe
        self.user_emails = list(set(self.data['from']))
        self._generate_email2name()
        self.get_total_vocabulary()
        self.word_count = Counter()
        self.stop_words_list = stop_words_list
        

    def _generate_email2name(self):
        self.EMAIL2NAME = defaultdict(list) # in case there are aliases
        user_emails = self.data['from']
        user_names = self.data['from_name']
        receivers = self.data['to']
        for email, name in zip(list(user_emails),list(user_names)):
            email = email.lower()
            name = name.replace('"','')
            if name not in self.EMAIL2NAME[email]:
                self.EMAIL2NAME[email].append(name)

        receivers_emails = []
        for receiver in list(receivers):
            for name, email in receiver:
                email = email.lower()
                name = name.replace('"','')
                if name not in self.EMAIL2NAME[email]:
                    self.EMAIL2NAME[email].append(name)

    def get_top_spammers(self, ntop=9999):
        print("Count \t Email \t \t \t Name")
        list_spammers = []
        printout = 0
        for a in self.data.groupby(self.data['from'])['from'].count()\
                                        .reset_index(name='count') \
                                        .sort_values(['count'], ascending=False)\
                                        .iterrows():
                _, email = a
                if printout < ntop:
                    print("%i \t %s \t %s" %(email['count'],email['from'],self.EMAIL2NAME[email['from']][0]))
                    printout += 1
                    list_spammers.append([email['count'],email['from'],self.EMAIL2NAME[email['from']][0]])
        return list_spammers
            
    def get_total_vocabulary(self):
        #returns a dict of emails and their respective vocab
        self.vocabulary = self.data['body'].str.cat(sep=' ') + self.data['subject'].str.cat(sep=' ')
        return self.vocabulary
    
    def get_vocabulary_count(self,stop_words=False):
        if stop_words:
            self.word_count = Counter([x for x in self.vocabulary.split(' ') if x not in self.stop_words_list])
        else:
            self.word_count = Counter([x for x in self.vocabulary.split(' ')])
        return self.word_count
    
    def get_top_words(self,stop_words=False):
        if len(self.word_count.keys())==0:
            self.get_vocabulary_count(stop_words=stop_words)
        print('Word \t Count')
        for a,b in self.word_count.most_common(20):
            print('%s \t %i)' %(a, b))
        return self.word_count.most_common(20)
        
    def generate_reduced_dataset(self, list_of_users):
        pass
        #returns a smaller dataframe

Let's explore this dataset a bit.

1. For example, who sends out most emails?

In [156]:
# Initiate the dataset
DataObject = Dataset(data)

# TODO:
# tab = DataObject.get_top_spammers(ntop=...)


2. Which words are most common?

In particular, how can we improve the output of question 2 (if the most common words aren't particularly interesting?)

In [157]:
word_count = DataObject.get_top_words()

Word 	 Count
the 	 582792)
to 	 411832)
of 	 286168)
and 	 285184)
a 	 251727)
in 	 212081)
that 	 159347)
for 	 148611)
on 	 141931)
is 	 126653)
Trump 	 111331)
- 	 93388)
with 	 88619)
I 	 81842)
he 	 75887)
have 	 75697)
be 	 75553)
at 	 74138)
PM 	 73430)
his 	 69326)


That doesn't seem very relevant... How can we find more relevant words?

Hint:
Try modifing 

~~~
stop_words_list = stopwords.words('english') + list(string.punctuation) 
~~~

In [158]:
# Initiate the dataset again
DataObject.get_vocabulary_count(stop_words=True)
word_count = DataObject.get_top_words()

Word 	 Count
Trump 	 111331)
I 	 81842)
PM 	 73430)
2016 	 67980)
The 	 66008)
· 	 53101)
May 	 52171)
Subject: 	 46889)
From: 	 46479)
Democratic 	 46336)
To: 	 46042)
Sent: 	 45427)
would 	 43856)
Donald 	 42110)
– 	 40382)
said 	 39250)
RE: 	 36435)
Republican 	 36106)
National 	 33004)
-- 	 32753)


Try extending the stop_words_list by
~~~
DataObject.stop_words_list = stop_words_list + ...
~~~

In [142]:
DataObject.stop_words_list = stop_words_list + []
DataObject.get_vocabulary_count(stop_words=True)
word_count = DataObject.get_top_words()

Word 	 Count
Trump 	 25733)
I 	 22574)
– 	 19457)
The 	 18198)
2016 	 17798)
Democratic 	 16978)
May 	 14464)
· 	 13100)
would 	 12978)
From: 	 11408)
Subject: 	 11260)
said 	 11206)
To: 	 11093)
Donald 	 10994)
Republican 	 10977)
Sent: 	 10761)
National 	 10471)
RE: 	 8273)
And 	 8017)
PM 	 7932)


### One last question...
When do people email the most?

In [159]:
%matplotlib inline
from matplotlib import pyplot as plt

def plot_time(dataframe):
    new = dataframe[['date']]
    new['hour'] = pd.DatetimeIndex(new['date']).hour
    new['hour'].hist(bins=24)
    plt.title('Emails per hour')
    
#plot_time(...)

# PART 2: Exploring the user

Now let's try to get a feeling of what these people are talking about.

In this example, we will do a simple topic mining model and use spacy to pick up on relevant entities.

In particular:
1. Aggregate the communication between two people
1. Perform topic modelling on the subset exchanged emails
2. Perform named entity extraction on the subset

The output of this task is to find pairs of people and the keywords/topics they are talking about in their emails.

In [160]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

nlp = spacy.load('en')
list_of_entities = nlp.entity.cfg[u'actions']['1']
relevant_entities = ['EVENT','FAC','GPE','LAW','NORP','LOC','ORG','PRODUCT', 'PERSON']

def clean_text(text):
    return text

def display_topics(model, feature_names, no_top_words):
    topics = []
    for topic_idx, topic in enumerate(model.components_):
        for i in topic.argsort()[:-no_top_words - 1:-1]:
            topics.append(feature_names[i])
        
    return topics

def get_keywords(sentence):
    keywords = defaultdict(list)
    doc = nlp(sentence)
    for ent in doc.ents:
        if ent.label_ in relevant_entities:
            keywords[ent.label_].append(ent.text)
    return keywords

def get_topics(emails):
    # eats a list of emails and returns 3 topics 
    # NMF is able to use tf-idf
    temp = []
    for em in emails:
        try:
            accum = [a for a in em[0].split('.')]
            temp += accum
        except:
            continue
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', lowercase=False)
    tfidf = tfidf_vectorizer.fit_transform(temp)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model
    tf_vectorizer = CountVectorizer(stop_words='english', lowercase=False)
    tf = tf_vectorizer.fit_transform(temp)
    tf_feature_names = tf_vectorizer.get_feature_names()

    no_topics = 5

    # Run NMF
    nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
    # Run LDA
    lda = LatentDirichletAllocation(n_components=no_topics, max_iter=10, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
    no_top_words = 3
    topics1 = display_topics(nmf, tfidf_feature_names, no_top_words)
    topics2 = display_topics(lda, tf_feature_names, no_top_words)
    return topics1 + topics2

class user(object):
    def __init__(self, data, email):
        self.user = email
        self.emails = data.data.loc[data.data['from']==self.user]['body']
        self.vocabulary_raw = self.get_vocabulary(data)
        self.keywords = defaultdict(list)
        self.correspondents_count = Counter()
        self.correspondents_emails = defaultdict(list)
        self.correspondents_keywords = defaultdict(dict)
        self.get_connections(data)
        self.correspondents_topics = defaultdict(list)
        self.get_topics_correspondents()
        self.connections = self.correspondents_count.keys()

    def get_vocabulary(self, data):
        return data.data.loc[data.data['from']==self.user]['body'].str.cat(sep=' ')
    
    def get_connections(self,data):
        # return person, number of emails, top entities
        self.keywords_per_receiver = defaultdict(dict)
        for row in data.data.loc[data.data['from']==self.user].itertuples():
            indx, body, date, sender, from_name, subject, corres = row
            #try:
            if len(corres) == 0:
                continue
            for r in corres[0]:
                    if '@' not in r:
                        pass
                    else:
                        r = r.lower()
                        self.correspondents_count[r] += 1
                        self.correspondents_emails[r].append([clean_text(body)])
                        keywords = get_keywords(clean_text(body))
                        
                        if r not in self.correspondents_keywords.keys():
                            for key in relevant_entities:
                                self.correspondents_keywords[r][key] = []
                        for key in keywords.keys():
                            if key in relevant_entities:
                                self.correspondents_keywords[r][key] += keywords[key]
                                
        for receiver in self.correspondents_emails.keys():
            for row in data.data.loc[data.data['from']==receiver].itertuples():
                indx, body, date, sender, from_name, subject, corres = row
                if len(corres) == 0:
                    continue
                if self.user not in corres[0]:
                    continue
                    
                self.correspondents_emails[receiver].append(str(body))
                keywords = get_keywords(body)
                        
                if receiver not in self.correspondents_keywords.keys():
                    #instanciate dictionary
                    for key in relevant_entities:
                        self.correspondents_keywords[receiver][key] = []
                        
                for key in keywords.keys():
                    if key in relevant_entities:
                        self.correspondents_keywords[receiver][key] += keywords[key]
    
    def get_topics_correspondents(self):
        for corres in self.correspondents_keywords.keys():
            try:
                topics = get_topics(userA.correspondents_emails[corres])
            except:
                topics = []
            counter = Counter(topics)
            self.correspondents_topics[corres] = counter.most_common(5)

Suppose now we are interested in looking at a person in particular. For example, some names were particularly centered in the controversy, such as:

Debbie Wasserman (email: hrtsleeve@gmail.com)     
Brad Marshal (email: marshall@dnc.or)       
Luis Miranda (mirandal@dnc.org) (he's just the top spammer :) )


In [161]:
userA = user(DataObject,"hrtsleeve@gmail.com")

Interesting methods of the user object:
1. user.correspondents_emails
2. user.vocabulary_raw

In [162]:
print(userA.vocabulary_raw)
vocab_list = userA.vocabulary_raw.split(' ')
print(Counter(vocab_list).most_common())

Good. Thanks everyone. This is how he responds to Reid??? Please refer the reporter to Luis Miranda, the DNC's Communications Director. Thank you for asking and the heads up! I have copied Luis and April from the DNCC on this reply. Damn liar. Particularly scummy that he barely acknowledges the violent and threatening behavior that occurred. We need to discuss the point of disagreement as I feel strongly the initial rollout needs to be done by me. We can and should do alongside Cummings but I'm not going to hide. I am for the second one. What do others think? No, I would not encourage them to do that. As of right now, the Sanders campaign is not supporting a DNC sanctioned debate, which as you know, was part of the agreement when we added the four extra debates. We need to make sure that if we were going to entertain the possibility of having FOX host the final debate, that they understand that we start with this as a DNC sanctioned‎ debate and that is non-negotiable. Thanks. ‎Good to 

Ew, can we get something more representative?
Hint: Remove some words...

In [163]:
stop_words_list = []
vocab = [a for a in vocab_list if a not in stop_words_list]
print(Counter(vocab).most_common())

[('the', 193), ('to', 148), ('and', 90), ('a', 79), ('that', 70), ('is', 60), ('of', 59), ('in', 54), ('for', 48), ('Wasserman', 43), ('I', 34), ('on', 32), ('have', 27), ('Schultz', 27), ('it', 26), ('be', 24), ('at', 22), ('his', 22), ('he', 22), ('this', 22), ('was', 21), ('with', 21), ('We', 21), ('she', 20), ('Sanders', 20), ('been', 20), ('we', 20), ('you', 20), ('Debbie', 18), ('about', 18), ('by', 18), ('her', 18), ('The', 17), ('has', 17), ('DNC', 16), ('my', 16), ('as', 15), ('not', 14), ('from', 14), ('they', 14), ('one', 14), ('are', 14), ('when', 13), ('tax', 12), ('—', 12), ('just', 11), ('but', 11), ('This', 11), ('up', 11), ('if', 11), ('Trump', 11), ('do', 11), ('Democratic', 11), ('first', 10), ('can', 10), ('an', 10), ('Chair', 10), ('Clinton', 10), ('get', 10), ("Schultz's", 10), ('said', 10), ('going', 9), ('had', 9), ('there', 9), ('need', 9), ('what', 9), ('should', 9), ('being', 8), ('me', 8), ('before', 8), ('made', 8), ('Bernie', 8), ('no', 8), ('Jones', 8), (

## Part 3: 
Now let's do the last part of this session, let's see if we can extract some interesting topics from the emails.
In particular, we want to find what person A and person B are talking about.

We are interested in looking at the function:

1. user.get_topics_correspondents     
2. get_topics   
3. get_keywords

Defined above!

In [199]:
def get_top_words(dictionary, exclude=[]):
    all_words = []
    if len(dictionary['topics'])==0:
        pass
    else:
        for a,b in dictionary['topics']:
            all_words.append(a)
    for key in list(dictionary['keywords'].keys()):
        temp = []
        if len(dictionary['email'])==0:
            return 
        temp = dictionary['keywords'][key]
        if len(temp) < 2:
            continue
        count = Counter(temp)
        for a, b in count.most_common(1): #for example 
            all_words.append(a)
            
    all_words = [a for a in all_words if a not in exclude]
    
    print('Email: ', dictionary['email'], 'To: ', dictionary['correspondent'],\
          'Words: ', all_words)

In [200]:
tab = DataObject.get_top_spammers(ntop=10)
top_s = [a[1] for a in tab]

graph = []

email = "hrtsleeve@gmail.com"
userA = user(DataObject,email)
for key in userA.correspondents_count.keys():
        graph.append({'email': email, 'correspondent': key, 'topics': userA.correspondents_topics[key], 'keywords': userA.correspondents_keywords[key], 'count': userA.correspondents_count[key]})

import pickle        
pickle.dump(graph, open('graph_topics_dict_t.pkl','wb'))
#a = pickle.load(open('graph_topics_dict_t.pkl','rb'))

Count 	 Email 	 	 	 Name
1780 	 mirandal@dnc.org 	 Miranda, Luis
1518 	 hendricksl@dnc.org 	 Hendricks, Lauren
1319 	 brinsterj@dnc.org 	 Brinster, Jeremy
1110 	 walkere@dnc.org 	 Walker, Eric
1104 	 dncpress@dnc.org 	 DNC Press
1098 	 sargem@dnc.org 	 Sarge, Matthew
1077 	 freundlichc@dnc.org 	 Freundlich, Christina
1032 	 comers@dnc.org 	 Comer, Scott
892 	 garciaw@dnc.org 	 Garcia, Walter
889 	 bhatnagara@dnc.org 	 Bhatnagar, Akshai


In [201]:
for e in graph:
    exclude_words = []
    get_top_words(e, exclude = exclude_words)

Email:  hrtsleeve@gmail.com To:  paustenbachm@dnc.org Words:  ['We', 'worded', 'statement', 'liar', 'Damn', 'Democratic', 'Capitol', 'CNN', 'Nevada', '’s Indiana primary', 'the Hillary Victory Fund', 'Sanders', 'the Bay Area']
Email:  hrtsleeve@gmail.com To:  mirandal@dnc.org Words:  ['Wasserman', 'Schultz', 'Sanders', 'We', 'Good', 'Democratic', 'Nevada Democratic', 'DNC', 'Miranda', 'the Election Law Blog', 'Sanders', 'Holocaust', 'Sanders']
Email:  hrtsleeve@gmail.com To:  allenz@dnc.org Words:  []
Email:  hrtsleeve@gmail.com To:  ldd@demconvention.com Words:  []
Email:  hrtsleeve@gmail.com To:  houghtonk@dnc.org Words:  ['Florida', 'Wasserman', 'Alaska', 'We', 'Debbie', 'American', 'Interview CNN New Day', 'Alaska', 'Debbie Wasserman Schultz']
Email:  hrtsleeve@gmail.com To:  pought@dnc.org Words:  ['strongly', 'Here', 'important', 'alongside', 'point', 'INVALUABLE', 'Kate']
Email:  hrtsleeve@gmail.com To:  bonoskyg@dnc.org Words:  ['mins', 'time', 'freed', 'Comer', 'Approps', 'Dem

In [177]:
#exclude_words = [DataObject.EMAIL2NAME[e['email']][0], DataObject.EMAIL2NAME[e['correspondent']][0], e['email'], e['correspondent'],\
#                    ] + DataObject.EMAIL2NAME[e['email']][0].split(',') + DataObject.EMAIL2NAME[e['correspondent']][0].split(',')
    

Maybe we can look at the top spammers keywords.

In [None]:
tab = DataObject.get_top_spammers(ntop=10)
top_s = [a[1] for a in tab]

graph = []

for email in top_s[0:10]:
    userA = user(DataObject,email)
    for key in userA.correspondents_count.keys():
        graph.append({'email': email, 'correspondent': key, 'topics': userA.correspondents_topics[key], 'keywords': userA.correspondents_keywords[key], 'count': userA.correspondents_count[key]})

for e in graph:
    exclude_words = []
    get_top_words(e, exclude = exclude_words)

Count 	 Email 	 	 	 Name
1780 	 mirandal@dnc.org 	 Miranda, Luis
1518 	 hendricksl@dnc.org 	 Hendricks, Lauren
1319 	 brinsterj@dnc.org 	 Brinster, Jeremy
1110 	 walkere@dnc.org 	 Walker, Eric
1104 	 dncpress@dnc.org 	 DNC Press
1098 	 sargem@dnc.org 	 Sarge, Matthew
1077 	 freundlichc@dnc.org 	 Freundlich, Christina
1032 	 comers@dnc.org 	 Comer, Scott
892 	 garciaw@dnc.org 	 Garcia, Walter
889 	 bhatnagara@dnc.org 	 Bhatnagar, Akshai
