# Tutorial 1

In this session, we will look at the wikileaks dataset and learn how to start gathering statistics about the dataset, preprocess the emails and extract useful information.

https://en.wikipedia.org/wiki/2016_Democratic_National_Committee_email_leak
https://wikileaks.org/dnc-emails/

## DNC emails

Around 40 000 emails leaked from DNC, around 1000 distinct users.

I give you an already pre-processed dataset in .json, where the emails are a bit cleaned and put into a 'nice' structure. If you are interested in the process of crawling + generating this file, find me later or watch the repository: https://github.com/hanveiga/nlp-amld-2018


Before we start:

~~~
sudo python3.6 -m spacy download en
sudo python3.6 -m nltk.downloader all
cd data
curl https://www.dropbox.com/s/k16jptjyccxfdkn/clean_json.json?dl=0 -L -o clean_json.json
~~~

## Loading JSON file

In the folder you will find a json file.

In [26]:
import pandas as pd

path_data = 'data/clean_json.json'

def load_json_data(path_to_file):
    data_DF = pd.read_json(path_to_file,encoding='ascii')
    data_DF['from'] = data_DF['from'].str.lower()
    data_DF['body'] = data_DF['body'].apply(lambda x: " ".join(str(x).split()))
    return data_DF 

Loading dataset from data folder

In [27]:
data = load_json_data(path_data)

## Part 0:

Quick exploration of the dataset

How many users?   
Who sends the most emails?    
What are the most common words? 


In [35]:
from collections import Counter
from matplotlib import pyplot as plt


## Part 1:

Exploring one of the people in the dataset.

For example, some names were particularly centered in the controversy, such as:


Debbie Wasserman (email: hrtsleeve@gmail.com)    
Brad Marshal (email: marshall@dnc.or)     
Luis Miranda (mirandal@dnc.org) (he's just the top spammer :) )


In [36]:
email =  "hrtsleeve@gmail.com"
data[data["from"]==email][0:10]

Unnamed: 0,body,date,from,from_name,subject,to
10211,Good. Thanks everyone.,2016-05-06 15:29:08,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Update,"[[""Miranda, Luis"", MirandaL@dnc.org]]"
10518,This is how he responds to Reid???,2016-05-17 19:11:55,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: FOR REVIEW: DNC Statement on Nevada Democr...,"[[""Miranda, Luis"", MirandaL@dnc.org], [, ""Banf..."
10560,"Please refer the reporter to Luis Miranda, the...",2016-05-17 21:38:24,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Platform Committee Inquiry,"[[Greg Rosenbaum, greg@palisadesassociates.com]]"
10816,Damn liar. Particularly scummy that he barely ...,2016-05-17 14:38:09,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Weaver on CNN re Nevada,"[[""Paustenbach, Mark"", PaustenbachM@dnc.org], ..."
11281,We need to discuss the point of disagreement a...,2016-05-22 14:04:41,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Platform Rollout Plan,"[[Tracie Pough, PoughT@dnc.org]]"
11493,I am for the second one. What do others think?,2016-05-17 17:32:38,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: FOR REVIEW: DNC Statement on Nevada Democr...,"[[""Paustenbach, Mark"", PaustenbachM@dnc.org], ..."
11782,"No, I would not encourage them to do that. As ...",2016-05-12 02:25:18,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Connecting you...,"[[Erik Smith, erik@blueenginemedia.com], [, Lu..."
11946,‎Good to go. No changes.,2016-05-17 23:34:49,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: FOR REVIEW: DWS statement about KY and OR ...,"[[""Miranda, Luis"", MirandaL@dnc.org]]"
12369,Excellent!,2016-04-29 23:50:56,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: The Hill - Sanders drops lawsuit against DNC,"[[""Miranda, Luis"", MirandaL@dnc.org], [, ""Dace..."
12652,I'm at a black tie dinner. Will quickly review...,2016-05-08 01:17:21,hrtsleeve@gmail.com,Debbie Wasserman Schultz,Re: Final Medium Post,"[[Leah Daughtry, ldaughtry@demconvention.com],..."


What words are the most emailed by this person?

In [37]:
from nltk.tokenize import word_tokenize
from collections import Counter


Ok this does not say much. How can we improve? Stop words?

In [31]:
import string
from nltk.corpus import stopwords

string_list = [a for a in string.punctuation]

In [38]:
all_stopwords = stopwords.words('english') + string_list + ['Re','FWD', "''", "``",'...',"-"]


Now let's try to get a feeling of what these people are talking about.

In this example, we will do a simple topic mining model and use spacy to pick up on relevant entities.

In particular:

    Aggregate the communication between two people
    Perform named entity extraction on the subset of emails

The output of this task is to find pairs of people and the keywords/topics they are talking about in their emails.

In [39]:
import spacy
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

nlp = spacy.load('en')
relevant_entities = ['EVENT','FAC','GPE','LAW','NORP','ORG','PRODUCT', 'PERSON']

def get_keywords(sentence, ntop = 5):
    if len(sentence) > 100000:
        #going to truncate the sentence
        sentence = sentence[0:100000]
        
    keywords = defaultdict(Counter)
    doc = nlp(sentence)
    for ent in doc.ents:
        if ent.label_ in relevant_entities:
            keywords[ent.label_][ent.text]+=1
            
    most_common_keywords = defaultdict(list)
    
    for key in keywords.keys():
        most_common_keywords[key] = keywords[key].most_common(ntop)
               
    return most_common_keywords



## Last task

Who has the most similar vocabulary?

We use scikit learn for this!

In [18]:
# Remember before we generated the word counts and user emails?
# user_emails, word_count_body
vocabulary_body_set = word_count_body.keys()

In [41]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [42]:
from sklearn.metrics.pairwise import cosine_similarity
