# Actionable Email Item Detection

In [1]:
import numpy as np
import pandas as pd
import re

import nltk

In [2]:
# Dataset directory, set this before your experiments
dataset = '/home/syed.b/emails.csv'

In [3]:
# Pick a sample for understanding perspective
df_sample  = pd.read_csv(dataset, skiprows = lambda x : np.random.rand() > 0.01 and x > 0)

In [4]:
df_sample.shape

(5328, 2)

In [5]:
df_sample['message'][0]

"Message-ID: <19909580.1075855688684.JavaMail.evans@thyme>\nDate: Fri, 1 Sep 2000 06:08:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: mike.grigsby@enron.com, frank.ermis@enron.com\nSubject: FYI\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Phillip K Allen\nX-To: Mike Grigsby, Frank Ermis\nX-cc: \nX-bcc: \nX-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\'sent mail\nX-Origin: Allen-P\nX-FileName: pallen.nsf\n\n---------------------- Forwarded by Phillip K Allen/HOU/ECT on 09/01/2000 \n01:07 PM ---------------------------\n   \n\tEnron North America Corp.\n\t\n\tFrom:  Matt Motley                           09/01/2000 08:53 AM\n\t\n\nTo: Phillip K Allen/HOU/ECT@ECT\ncc:  \nSubject: FYI\n\n\n--\n\n\n\n - Ray Niles on Price Caps.pdf\n\n\n"

In [6]:
# Establish cleaning patterns

[ x.strip() for x in df_sample['message'][11].split('FileName:')[1].split('\n',1)[1].strip('\n').split('\n') if x]


['---------------------- Forwarded by Phillip K Allen/HOU/ECT on 02/12/2001',
 '12:18 PM ---------------------------',
 'To: Phillip K Allen/HOU/ECT@ECT',
 'cc:',
 'Subject: Re: AEC Volumes at OPAL']

In [7]:
# Load the whole data

df = pd.read_csv(dataset)

In [8]:
df.shape

(517401, 2)

In [9]:
# Store emails here in the list
# Use nltk's sentence tokenizer to tokenize the email sentences -- since we notice above that 
# order
emails = []

for i in range(df.shape[0]):
    
    emails.append(nltk.sent_tokenize(' '.join([ x.strip() for x in df['message'][i].split('FileName:')[1].split('\n',1)[1].strip('\n').split('\n') if x]
)))

In [10]:
#Sanity check
emails[1]

['Traveling to have a business meeting takes the fun out of the trip.',
 'Especially if you have to prepare a presentation.',
 'I would suggest holding the business plan meetings here then take a trip without any formal business meetings.',
 'I would even try and get some honest opinions on whether a trip is even desired or necessary.',
 'As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.',
 'Too often the presenter speaks and the others are quiet just waiting for their turn.',
 'The meetings might be better if held in a round table discussion format.',
 'My suggestion for where to go is Austin.',
 "Play golf and rent a ski boat and jet ski's.",
 'Flying somewhere takes too much time.']

## First Filter


Check for action words. A curated list of action words is specified and if our sentences from the emails contain one of these words, we store it for a further filtering process.

The action words file is in the current directory as this Jupyter Notebook.

In [11]:
# Read in action verbs
action_verbs = []

with open('./action_words.txt', 'r') as f:
    for line in f:
        action_verbs.append(line.strip().lower())
        
        
# Storing the actionable items as a tuple of 3 items - (text, action-words, no. of action-words in sentence)
action_texts = []   

# Also store non-actionable items.
non_action_texts = []

# Due to memory/time constraints, we do not read in the entire emails, rather take a subset of 1,00,000 to show results 
for item in emails[:100000]:
    for text in item:
        temp_verb_list = []
        
        for verb in action_verbs:
            if verb in text.lower().split():
                temp_verb_list.append(verb)
        if len(temp_verb_list) > 0:
            action_texts.append((text, temp_verb_list, len(temp_verb_list)))
        elif len(temp_verb_list) == 0:
            non_action_texts.append(text)
                
            
                

In [12]:
action_texts[:20]

[('I would suggest holding the business plan meetings here then take a trip without any formal business meetings.',
  ['plan'],
  1),
 ('As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.',
  ['think'],
  1),
 ('My suggestion for where to go is Austin.', ['go', 'suggestion'], 2),
 ("Play golf and rent a ski boat and jet ski's.", ['play'], 1),
 ('Randy, Can you send me a schedule of the salary and level of everyone in the scheduling group.',
  ['send'],
  1),
 ('Plus your thoughts on any changes that need to be made.', ['need'], 1),
 ('Please cc the following distribution list with updates: Phillip Allen (pallen@enron.com) Mike Grigsby (mike.grigsby@enron.com) Keith Holst (kholst@enron.com) Monique Sanchez Frank Ermis John Lavorato Thank you for your help Phillip Allen',
  ['help', 'please'],
  2),
 ("1. login:  pallen pw: ke9davis I don't think these are required by the

In [13]:
non_action_texts[20:40]

['Therein comes Sempra energy > gas trading, truly you.',
 '> Store number varies because of installation hurdles face at small percent.',
 '> Let me assure you, this is real deal!!',
 '> > Buck Buckner, P.E., MBA > Manager, Business Development and Planning > Big Box Retail Sales > Honeywell Power Systems, Inc. > 8725 Pan American Frwy > Albuquerque, NM 87113 > 505-798-6424 > 505-798-6050x > 505-220-4129 > 888/501-3145 >',
 'Mr. Buckner, For delivered gas behind San Diego, Enron Energy Services is the appropriate Enron entity.',
 'Her phone number is 713-853-7107.',
 'Phillip Allen',
 '1.',
 'Although the meeting with Keith, on Wednesday,  was informative the solution of creating a infinitely dynamic consolidated position screen, will be extremely difficult and time consuming.',
 'What needs to happen on Monday from 3 - 5 is a effort to design a desired layout for the consolidated position screen, this is critical.',
 'I have been involved in most of the meetings and the discussions h

## Hierarchical Filters

**NOTE**: This section is applicable when we want to achieve high precision i.e. we might miss a few action items from the original list but we will achieve high accuracy on whatever has been retrieved.

### First Level:

**Filter A**: Check if no. of action words are more than 1 and if length of email sentence is less than 30.


### Second Level:

**Filter B**: Check if object pronouns or subject pronouns are present in the sentence. The list is specified as below.


### Third Level:


**Filter C**: Disregard those with negation words.




In [14]:
# Filters A,B,C:

object_pronouns = ['me', 'her', 'him', 'us', 'them']
subject_pronouns = ['i', 'we', 'you', 'he', 'she', 'they']
negation_words = ["shouldn't", "couldn't", "wouldn't"]

filtered_action_texts = []

for item in action_texts:
    if item[2]>1 and len(item[0].split())<30:
        
        
    
        obj_flag =0 
        for x in object_pronouns:
            if x in item[0].lower().split(): 
                filtered_action_texts.append(item[0])
                obj_flag =1
                break
        if (obj_flag == 0):
            for x in subject_pronouns:
                if x in item[0].lower().split():
                    filtered_action_texts.append(item[0])
                    break

final_action_texts = []
for item in filtered_action_texts:
    #print (item)
    count_neg = 0
    for x in negation_words:
        if x in item.lower().split():
            count_neg += 1
            continue
    if count_neg == 0:
        final_action_texts.append(item)
    
        
        
        
        

In [15]:
final_action_texts[:20]

["1. login:  pallen pw: ke9davis I don't think these are required by the ISP 2.  static IP address IP: 64.216.90.105 Sub: 255.255.255.248 gate: 64.216.90.110 DNS: 151.164.1.8 3.",
 "Follow these steps so you don't misplace these files.",
 'We really need a single point of contact to help drive the trader requirements and help come to a consensus regarding the requirements.',
 'We really need a single point of contact to help drive the trader requirements and help come to a consensus regarding the requirements.',
 'Can you please make sure he has an active password.',
 'This would give you a total loan of $992,000, total cost of $1,232,645 for equity required of $241,000.',
 'This would give you a total loan of $992,000, total cost of $1,232,645 for equity required of $241,000.',
 'Please get back to me as soon as your schedule permits regarding the site visit and feel free to call at any time.',
 'I will follow up with an email and phone call about Cherry Creek.',
 'Jeff, I need to see

## Supervised Classification Task

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_union
from sklearn.metrics import confusion_matrix, classification_report

import random


In [17]:
# Read in the predefined actions from the actions.csv file

predefined_actions = pd.read_csv('./actions.csv', header=None)

In [18]:
# Store in list

predefined_actions_list = predefined_actions[0].values.tolist()

In [19]:
len(predefined_actions_list)

1250

In [20]:
# We need a balanced dataset for our classification task. Hence select 1250 non-action items 
# from our list

non_actions_list = non_action_texts[:1250]

In [21]:
len(non_actions_list)

1250

In [22]:
# Create the labeled numpy array

labels = np.array([ 1 if x < 1250 else 0 for x in range(2500) ])

In [23]:
labels

array([1, 1, 1, ..., 0, 0, 0])

In [24]:
X = np.array(predefined_actions_list + non_actions_list)

#random.shuffle(X)

In [25]:
# Create a new dataframe
data = pd.DataFrame({'item':X,'label':labels})

In [26]:
data = data.sample(frac=1).reset_index(drop=True)

In [27]:
data.head()

Unnamed: 0,item,label
0,"Thus, the additional equity for the improvemen...",0
1,Do you have Yahoo Messenger or Hear Me turned on?,0
2,Please print and save file for me.,1
3,Please review and forward to the appropriate E...,1
4,Call my little EES buddies to get better under...,1


### Use the TF-IDF vectorizer to construct feature vector for each piece of text item - both at the word and character level. 

This effectively functions as our char and word n-gram feature builder from the original items we have.
We take n-gram ranges of 1-5 for characters and 2-4 for words.

In [59]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 5),
    max_features=30000)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    ngram_range=(2, 4),
    max_features=30000)
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=2)

In [60]:

all_text = data['item']

In [61]:
type(all_text)

pandas.core.series.Series

In [62]:
# Fit the vectorizer on all texts
vectorizer.fit(all_text)


FeatureUnion(n_jobs=2,
             transformer_list=[('tfidfvectorizer-1',
                                TfidfVectorizer(analyzer='word', binary=False,
                                                decode_error='strict',
                                                dtype=<class 'numpy.float64'>,
                                                encoding='utf-8',
                                                input='content', lowercase=True,
                                                max_df=1.0, max_features=30000,
                                                min_df=1, ngram_range=(1, 5),
                                                norm='l2', preprocessor=None,
                                                smooth_idf=True,
                                                stop_words=None,
                                                strip_accents='unicode',
                                                subl...
                                                dtype

In [63]:
# Train-Test Split : 75-25%

train_text, test_text, train_labels, test_labels = train_test_split(data['item'], data['label'])

In [64]:
train_features = vectorizer.transform(train_text)
test_features = vectorizer.transform(test_text)

In [72]:
# use a simple logistic regression classifier for 2-class classification

classifier = LogisticRegression()

In [66]:
classifier.fit(train_features, train_labels)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [67]:
#Accuracy
classifier.score(test_features, test_labels)

0.9504

In [68]:
predictions = classifier.predict(test_features)

In [69]:
# Get the confusion matrix
confusion_matrix(np.array(test_labels.tolist()),predictions)

array([[296,  11],
       [ 20, 298]])

In [70]:
# Get the Precision, Recall, F-1 scores for the test sets

print(classification_report(np.array(test_labels.tolist()),predictions))

              precision    recall  f1-score   support

           0       0.94      0.96      0.95       307
           1       0.96      0.94      0.95       318

    accuracy                           0.95       625
   macro avg       0.95      0.95      0.95       625
weighted avg       0.95      0.95      0.95       625



# Closing Comments

We have seen how we can leverage a filter-based mechanism in Objective 1 - with a hierarchical schema to enable unsupervised text classification thereby weeding out non-actionable items from our corpus in the data pipeline stage.

Finally, when we have some labeled data, it is essential to note that even with a simple classifier like Logistic Regression, we were able to harness the power of character & word n-grams in order to build a reasonable solution to the actionable-item identification problem.

Few challenges that still remain are scaling this to an imbalanced real world setting - and assessing its performance against a larger labeled dataset -- the core idea though, is promising!