<center> **TEXT AS DATA**



This week we shall be mining the famous **Enron Email Dataset**, that was released after the major scandal.

We will combine what we have learned so far about networks with new methods from **information extraction (IE)** and **natural language processing (NLP)**.



In [None]:
import networkx as nx
import pandas as pd
import nltk 
import spacy
import afinn
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn
import re

In [None]:
#enron_link ='https://query.data.world/s/boypyv5j5ey55s3mgevys7hbz3halo' 
#enron_df = pd.read_csv(enron_link)
#enron_df.to_csv('enron_data.csv',index=False) 
enron_df = pd.read_csv('enron_data.csv')
enron_df = enron_df[enron_df.content.apply(type)==str] # filter non strings

In [None]:
enron_df = enron_df[enron_df.content.apply(type)==str]
enron_categories = []
for cat in range(1,13):
    for level in range(1,3):
        category = 'Cat_%d_level_%d'%(cat,level)
        if category in enron_df.columns:
            
            enron_categories.append(category)
enron_df = enron_df[[col for col in enron_df.columns if not col in enron_categories and 'weight' not in col]]

# Regex exercises

***Ex. 1.1***
* Use regex to mine any number mentioned in the emails.
    * Both literal numbers (millions,billions etc) and digits.
    * Match currencies $ and percentages %
* Rank the people talking most about numbers.

***Ex. 1.2***
* Find names mentioned in emails. - i.e. marked by Capital letters that did not follow punctuation and newlines.
    * This is a really hard task, so don't worry that it won't be perfect.
    * **NOTE** When parsing both first and surnames, note the differences between a "lookahead" for another nam pattern, and a consuming simple pattern. 
        * i.e. "John Carl Johnson Jack Jackson" will be either: ["John Carl" "Johnson Jack", "Jacson"], or ["John Carl","Carl Johnsen","Johnsen Jack", "Jack Johnsen"]
* Which names are mentioned the most, and which people talk most about who?


***Ex. 1.3***
* Lookup how american phone numbers and zipcodes are formatted, and design a regex to capture those.
    * Make sure you start broad in your search, so you won't miss the variations.
    * Use the zipcodes in conjunction with the **geopy** module to geocode the zipcodes and extract Latitude and Longitude.


***Ex. 1.4***
* Extract all email addresses and links from the content column.

***Ex. 1.5*** -- extras
    * Extract dates - and "in-the-wild" coordinations: e.g. on Friday at 20 o'clock.
    * Extract emoticons

# Exploratory analysis and data formatting

**Getting acquinted with the Enron dataset.**

*** ex.2.1 *** 

* Do an basic exploratory analysis of the dataset.
    * Plot basic distributions: 
        * e.g. how many different users
        * Activity over time (daily,weekly, monthly).
            * Of different users.


## Combine with network statistics

*** ex.2.2 ***
* Construct a directed network from the columns ['X-From','X-To','X-cc','X-bcc'] (at least the first two).
    * Make sure you add the index of the email to the edge metadata for later integration.
    * Make sure you parse the To columns so they match the X-From.


*** ex. 2.3 ***
** Explore the network ** 
* How many different edges are present in the data over time?
* Plot a suitably sized subgraph using the K-core algorithm for extracting central components.
* Extract **central actors** according to a metric of choice save these for later investigations.
* Make sets of **important edges**, according to three principles:
    * **Highly active** edges.
    * They have **bridging** qualities (e.g. edge-betweeness-centrality).
    * They are **Clustered**: e.g. has a high overlap of neighbors, or the strongest edges within a community.

## Characterizing actors and edges based on the content of the emails

*** Exercise 2.4 ***

First we should get acquinted with one of the exploratory tools: the wordcloud.

** Wordclouds ** Install the module [wordcloud](https://github.com/amueller/word_cloud): conda install -c conda-forge wordcloud
`from wordcloud import Wordcloud`

Look in the documentation on how to construct wordclouds.
* Make wordclouds of the top 5 people ranked by a network measure you like (e.g. in-degree):
    * Aggregate words statistics from:
        * The emails they write
        * The emails they receive, and
        * **Extra** the difference between the two.
            * **note** Think about how to calculate a distance between word distributions. (Look up TF-IDF)
            
    * **note** this takes a lot of iterations, designing filters and stopwords. Take departure in the normalize_token function defined during todays lectures.
    
       
**Extra** 
    * Aggregate word statistics from sets of *Important Edges* and compare the differences.

In [None]:
# Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # remember to define as set for performance
#The stemmers and lemmers need to be initialized before being run
porter = nltk.stem.porter.PorterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer('english')
wordnet = nltk.stem.WordNetLemmatizer()
def normalize_tokens(tokens,lowercase=False
                     ,remove_non_alpha=False
                     , stop_words = False
                     , stemmer = False
                     , lemmer = False):
    #We can use a generator here as we just need to iterate over it
    workingIter = tokens
    #removing non-words
    if remove_non_alpha:
        workingIter = (w for w in workingIter if w.isalpha())
    
    # lowering
    if lowercase:
        workingIter = (w.lower() for w in workingIter)
    #Now we can use the semmer, if provided
    if stemmer:
        workingIter = (stemmer.stem(w) for w in workingIter)
        
    #And the lemmer
    if lemmer:
        workingIter = (lemmer.lemmatize(w) for w in workingIter)
    
    #And remove the stopwords
    if stop_words:
        workingIter = (w for w in workingIter if w not in stop_words)
    #We will return a list with the stopwords removed
    return list(workingIter)

# Information extraction

***Ex. 3.1 ***
**Sentiment analysis**
* Perform a sentiment analysis to characterize edges. 
* Plot sentiment score against a measure of edge importance as described above(betweeness, activity levels, clustering).
    * **Extra:** calculate a relative sentiment score of each edge, in relation to the average sentiment of each node.

** extra **
* See if the results differ when using different sentiment analysis methods. 


*** Ex. 3.2 *** 
** POS-tagging **
* Investigate the use of different word classes. 
    * e.g. what comes after possitve pronouns ('PRP$': e.g. 'my', 'mine') and possessive endings (POS:e.g. "'")
* Characterize edges by the use of adjectives and verbs
    * E.g. a ratio of verbs/adjectives.

**extra**: Create a co-occurence network between verbs and adjectivesin the same email or sentence (or context-window), and adjectives. Locate clusters of verbs and adjectives.

*** Ex. 3.3 ***
**NER-tagging**
* Extract entities from emails.
* What organizations are important in this company?

** Combine the sentiment analysis with the entities located.**
    * Tokenize sentences, and attribute the sentiment of a sentence to the Entity present in the sentiment.
    * Can you locate any malicious gossip?




## Helpers

In [68]:
def get_To_names(val): # My ugly implemntation
    if type(val)!=str:
        return None
    names = []
    current_name = []

    opposite_name = False # for handling names like Anderson, John. Instead of John Anderson
    for name in val.split():
        # two names
        if ',' in name:
            if len(current_name)==0:
                opposite_name = True
                current_name.append(name.strip(','))
            else:
                if opposite_name:
                    names.append(' '.join(current_name))
                else:
                    names.append(' '.join(reversed(current_name)))
                current_name = []
                opposite_name = False
        elif name[0]!='<':
            current_name.append(name)
        else:
            names.append(' '.join(current_name))
            current_name = []
    if len(current_name)>0:
        names.append(' '.join(current_name))
    return [i for i in list(map(lambda x: x.strip('"\' '),names)) if len(i)!=0]
sample = enron_df.sample(10)
list(zip(sample['X-To'].apply(get_To_names),sample['X-To'])) # try doing a regex based one.

[(['Apollo Beth',
   'Beck Sally',
   'Becker Melissa',
   'Belden Tim',
   'Black Don',
   'Bradford William S.',
   'Bryan Jennifer',
   'Buy Rick',
   'Carrizales Blanca',
   'Causey Richard',
   'Colwell Wes',
   'Dayao Anthony',
   'Dietrich Janet',
   'Heathman Karen K.',
   'Hinojosa Esmeralda',
   'Holmes Sean',
   'Leff Dan',
   'Presto Kevin M.',
   'Stubblefield Wade',
   'Tijerina Shirley',
   'Wadsworth Sue'],
  'Apollo, Beth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bapollo>, Beck, Sally </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbeck>, Becker, Melissa </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mbecker>, Belden, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tbelden>, Black, Don </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dblack>, Bradford, William S. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Wbradfo>, Bryan, Jennifer </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Notesaddr/cn=c4b2bc0b-5c606c03-86256849-57b86a>, Buy, Rick </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rbuy>, Carrizales, Blanca </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Bcarriz>, Causey, Richard </O=