# Summarizing Emails using Machine Learning: Data Wrangling
## Table of Contents
1. Imports & Initalization <br>
2. Retrieve Preprocessed Data <br>
3. TextRank Modeling <br>
    A. Prepare Model <br>
    B. BC3 Corpus Summary <br>
    C. BC3 Corpus ROUGE Evaluation <br>
    D. Enron Dataset Summary <br>

This notebook uses the TextRank algorithim found at:
https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

The algorithim will generate summaries of someone's inbox over a period of time. The BC3 Corpus contains human summaries that can be used to generate ROUGE metrics to better understand the accuracy of the summarizations. 

## Imports & Initalization

In [14]:
import pandas as pd
import numpy as np
import rouge
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

## Retrieve Preprocessed Data

Load pickled dataframe produced by the Process_Emails notebook. 

In [2]:
#Load Data
ENRON_PICKLE_LOC = "../data/dataframes/wrangled_enron_full_df.pkl"
BC3_EMAIL_PICKLE_LOC = "../data/dataframes/wrangled_BC3_email_df.pkl"
BC3_SUMMARY_PICKLE_LOC = "../data/dataframes/wrangled_BC3_summary_df.pkl"

enron_df = pd.read_pickle(ENRON_PICKLE_LOC)
BC3_emails_df = pd.read_pickle(BC3_EMAIL_PICKLE_LOC)
BC3_summary_df = pd.read_pickle(BC3_SUMMARY_PICKLE_LOC)

## TextRank Modeling

### Prepare Model

This ML algorithim uses the GloVe word embeddings generated by the pre-trained Wikipedia 2014 + Gigaword 5 mode. Below are one time executions to download this model locally.  

In [None]:
#nltk.download('punkt')
#nltk.download('stopwords')
#! wget http://nlp.stanford.edu/data/glove.6B.zip
#! unzip glove*.zip

In [3]:
def extract_word_vectors():
    #This returns word vectors from the pretrained glove model. 
    word_embeddings = {}
    f = open('glove.6B.300d.txt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
    f.close()
    return word_embeddings

In [4]:
#Extract word vectors. Only need to be done once. 
word_embeddings = extract_word_vectors()

In [5]:
#Retrieve original email sentences and index them. This will be used to generate the extracted summaries. 
def get_extractive_sentences(df):
    sentences = df.Extractive_Sentences.tolist()
    #flatten list as tuples containting (sentence, dataframe index)  to reassociate summary with original email. 
    sentences = []
    sentences_list = df.Extractive_Sentences.tolist()
    for counter, sublist in enumerate(sentences_list):
        for item in sublist:
            sentences.append([counter, item]) 
    return sentences

In [6]:
#The tokenized sentences were done during preprocessing, 
#so this function retrieves them from the dataframe, then flattens the list. 
def get_tokenized_sentences(df):
    clean_sentences = df.Tokenized_Body.tolist()
    #flatten list
    clean_sentences = [y for x in clean_sentences for y in x]
    return clean_sentences

In [7]:
#Create sentence_vectors for each tokenized sentence using the word_embeddings model. 
def create_sentence_vectors(clean_sentences, word_embeddings):
    sentence_vectors = []
    for i in clean_sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((300,))
        sentence_vectors.append(v)
    return sentence_vectors

In [8]:
def rank_sentences(sentences, sentence_vectors):
    sim_mat = np.zeros([len(sentences), len(sentences)])
    #Initialize matrix with cosine similarity scores. 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
              sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    #Pair sentence with it's similarity score then sort. 
    ranked_sentences = sorted(((scores[i],s[0],s[1]) for i,s in enumerate(sentences)), reverse=True)
    #ranked_sentences = ((scores[i],s[0],s[1]) for i,s in enumerate(sentences))
    return ranked_sentences

In [9]:
#Function to wrap up summarization process
def summarize_emails(word_embeddings, masked_df):
    print("Total number of emails to summarize: " + str(len(masked_df)))
    sentences = get_extractive_sentences(masked_df)
    clean_sentences = get_tokenized_sentences(masked_df)
    #Generate sentence vectors
    sentence_vectors = create_sentence_vectors(clean_sentences, word_embeddings)
    #Create a list of ranked sentences. 
    ranked_sentences = rank_sentences(sentences, sentence_vectors)
    #return enron_masked_df, ranked_sentences
    #display_summary(masked_df, ranked_sentences)
    return ranked_sentences

In [10]:
#color scheme to help distinguish summarizaiton text. 
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

def display_summary(enron_masked_df, ranked_sentences):
  # Specify number of sentences as a fraction of total emails. 
  sn = len(enron_masked_df) // 10

  # Generate summary
  for i in range(sn):
    #pull date and subject from original email
    email_date = str(enron_masked_df['Date'].iloc[ranked_sentences[i][1]])
    email_subject = str(enron_masked_df['Subject'].iloc[ranked_sentences[i][1]])
    email_from = str(enron_masked_df['From'].iloc[ranked_sentences[i][1]])
    print( bcolors.BOLD + "Date: "+ email_date  + 
          " Subject: " + email_subject +
          " From: " + email_from + bcolors.ENDC +
          "\nSummary: " + str(ranked_sentences[i][2]))

### BC3 Corpus Summary 

In [15]:
#First look into what a single email summary looks like. 
masked_df = BC3_emails_df[:1]
masked_summaries = BC3_summary_df['Summary'].loc[(BC3_summary_df['Listno'] == masked_df.iloc[0]['Listno']) & (BC3_summary_df['Email_num'] == str(masked_df['Email_num'].iloc[0]))]
ranked_sentences = summarize_emails(word_embeddings, masked_df)

Total number of emails to summarize: 1


In [16]:
display_summary(masked_df, ranked_sentences)

In [20]:
ranked_sentences

<generator object rank_sentences.<locals>.<genexpr> at 0x7fb753638ed0>

## BC3 Corpus ROUGE Evaluation

The rouge metric (https://pypi.org/project/py-rouge/) is an evaluation metric used to test machine generated summaries against a human "Gold standard". Using the same Text rank summarization methods used on the Enron dataset, the following evaluates the algorithim against the BC3 Corpus. This is one of the few email datasets that contain human summarizations. 

In [None]:
#There are three different human summaries for the same email. 
masked_summaries

In [None]:
evaluator = rouge.Rouge(metrics=['rouge-n'],
                           max_n=1,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)
full_body = masked_df['Body'].iloc[0]
hypothesis = ranked_sentences[0][2]
reference = masked_summaries.iloc[0]

#scores = evaluator.get_scores(hypothesis, reference)
print("Full Email: " + full_body + '\n')
print("ML Summary: " + hypothesis + "\n") 
print("Human Summary: " + reference + "\n") 

The following shows the R-1 scores. Current benchmarks for text summarization can be found at: https://summari.es/

In [None]:
perfect_score = evaluator.get_scores(reference, reference)
scores = evaluator.get_scores(hypothesis, reference)
print(scores)

## Enron Dataset Summary 

In [21]:
#Outputs a subset of the enron dataset masked by the person and a timeframe. 
def subset_emails(df, start_date, end_date, person):   
    summarization_mask = (enron_df['Date'] >= start_date) & (enron_df['Date'] <= end_date) & (enron_df['Employee'] == person)
    enron_masked_df = df.loc[summarization_mask]
    return enron_masked_df   

In [22]:
#Define emails to be summarized. 
start_date = '2001-10-01 00:00:00'
end_date = '2001-10-14 23:59:59'
person = 'skilling-j'
masked_df = subset_emails(enron_df, start_date, end_date, person)
ranked_sentences = summarize_emails(word_embeddings, masked_df)

Total number of emails to summarize: 72


In [24]:
display_summary(masked_df, list(ranked_sentences))

[1mDate: 2001-10-02 21:51:28 Subject: Test From: dalak@hotmail.com[0m
Summary: Please ignore.
[1mDate: 2001-10-02 21:51:28 Subject: Test From: dalak@hotmail.com[0m
Summary: Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp
[1mDate: 2001-10-01 19:45:15 Subject: The Morning Market Call - Monday October 1st, 2001. From: david.morris@lehman.com[0m
Summary: Good Monday Morning Comments
[1mDate: 2001-10-04 22:45:18 Subject: Solar Migration - Third Notice - Time Change!!!!! From: bob.ambrocik@enron.com[0m
Summary: During the weekend of October 6 and 7, 2001 the Enterprise Storage Team will be migrating all production users off the current hardware (Solar) that houses their home and application directories (no production databases are affected, but client software will be) to new hardware.
[1mDate: 2001-10-04 22:45:18 Subject: Solar Migration - Third Notice - Time Change!!!!! From: bob.ambrocik@enron.com[0m
Summary: This migration requires a total system outa

In [None]:
#Examples of a full email. 
masked_df['Body'].iloc[ranked_list[0][1]]

In [None]:
#Summarization from another inbox
start_date = '2001-10-01 00:00:00'
end_date = '2001-10-14 23:59:59'
person = 'arnold-j'
masked_df = subset_emails(enron_df, start_date, end_date, person)
ranked_sentences = summarize_emails(word_embeddings, masked_df)

In [None]:
#Examples of a full email. 
masked_df['Body'].iloc[ranked_sentences[0][1]]

In [None]:
#One more example
start_date = '2001-10-01 00:00:00'
end_date = '2001-10-14 23:59:59'
person = 'lenhart-m'
masked_df = subset_emails(enron_df, start_date, end_date, person)
ranked_sentences = summarize_emails(word_embeddings, masked_df)

In [None]:
#Examples of a full email. 
masked_df['Body'].iloc[ranked_sentences[0][1]]