# Email body content analysis 

*For the context:* Current analysis in bigbang focus on headers. There are many analysis on the headers in the emails for the people and orgnization involved in the discussions. There are a few content analysis focusing on the keywords first occurence searching and/or most used words per user.

This notebook analyze the email body contents with Huggingface Named Entity Recognition(NER) models that are able to systematically label the entities and their types(currently supports PER, ORG, LOC, and MISC) in the email bodies. This can potentially help the researchers understand more on the email conversations.

In [1]:
# import necessary packages
from bigbang.archive import Archive
from bigbang.archive import load as load_archive
import pandas as pd

# hide warnings
import warnings
warnings.filterwarnings('ignore')

First, use the script ```bin/collect_mail.py``` to collect web archives. Details can be seen in https://bigbang-py.readthedocs.io/en/latest/data-sources.html#id1 .

<!-- Here, we use an example of the [scipy-dev](https://mail.python.org/pipermail/scipy-dev/) mailing list page.

Scipy-dev mailing list contains 149,718 emails From June 2001 - September 2021. -->


In [25]:
mailing_list = "3gv6"
archive_path = "../../archives/{}/".format(mailing_list)

archive = Archive(archive_path,mbox=True)
# archive data in pandas dataframe format
archive_data = archive.data

In [37]:
# inspect data
print(len(archive_data))
archive_data.head(2)

398


Unnamed: 0_level_0,From,Subject,Date,In-Reply-To,References,Body
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<BF345F63074F8040B58C00A186FCA57F1C65FB27A8@NALASEXMB04.na.qualcomm.com>,"""Laganier, Julien"" <julienl@qualcomm.com>",Re: [3gv6] was draft report - now Ipv6 trans...,2009-11-20 19:01:16+00:00,<C72C35BD.314D0%basavaraj.patil@nokia.com>,<4B0504D8.1070304@piuha.net> <C72C35BD.314D0%b...,Basavaraj Patil wrote:=20\n>=20\n> One other p...
<C72C5914.314E1%basavaraj.patil@nokia.com>,<Basavaraj.Patil@nokia.com>,[3gv6] Is this the Shanghai followup ML?,2009-11-20 20:32:40+00:00,,,"Hello,\n\nIs this ML setup for continuing disc..."


In [4]:
# example of one email body
list(archive_data['Body'].iloc[[1]])

['Hello,\n\nIs this ML setup for continuing discussions pertaining to IPv6 transition i=\nn\n3GPP networks (followup to the meeting in Shanghai)?\n\n-Raj']

In [5]:
# comment line below to install transformers with pytorch 
# inside your current python environment

# !pip install transformers[torch]
# !pip install contractions
# !pip install email_reply_parser

In pre-processing, we want to 
- remove the punctuations
- remove links 
- expand contractions 
- remove digits
- tokenize the words
- [Optional] Lowercase the words

In [6]:
# import functions for analyzing
from bigbang.analysis.entity_recognition import EntityRecognizer, SpanVisualizer

The list of models can be found in: https://huggingface.co/ . You can also train your own model and upload to huggingface.

Examples for possible model names include:
['dslim/bert-base-NER', 'dslim/bert-base-NER-cased', ...]

In [7]:
# we load the model and apply inference in the back-end of the bigbang package
# you can pass the model name of your interest to the function
model_name = "EffyLi/bert-base-NER-finetuned-ner-cerec"

recognizer = EntityRecognizer(model_name)

In [27]:
import itertools

# hyperparams
# taking one row as an example
index = 6
lowercase = False

body = list(archive_data['Body'].iloc[[index]])[0]
print("Text body before pre-processing--------------------:\n")
print(body)
body = recognizer.pre_processing(body, lowercase=lowercase)
print("Text body after pre-processing--------------------:\n")
print(body)

Text body before pre-processing--------------------:

Raj, 

Excuse me for late reply, Inline.
> Hui,
> On 11/19/09 12:53 AM, "Hui Deng" <denghui@chinamobile.com> wrote:
> > Hi, Sri,
> >
> > Just a short cut in here.
> > DS-Lite =  Encapsulation +NAT44.
> > PNAT (464) = Header change +NAT44,
> > Principally, there is no big difference, but PNAT supports more other
> > scenarios at the same time.
> 
> The comparison sounds overly simplistic. PNAT (464) requires a shim in the
> host and state. And the scenarios themselves need to be really reviewed in
> terms of whether they can only be solved by one approach. I am yet to see
a
> scenario that can only be solved by one PNAT and not by DS-Lite. Maybe the
> DS-Lite approach requires you to go through a GW and a NAT in order to
> communicate with another host on the same network. But that in itself is
not
> an issue (IMO).
Are u saying 4-6 or 4-4?
even 4-4, they have overlapped address issue?

-Hui
> 
> Cheers,
> -Raj
> >
> > -Hui
> >
Text 

In [28]:
tokens = recognizer.tokenizer.tokenize(body)
labels = recognizer.recognize(body)
entities = recognizer.get_entities(labels)
for entity in entities:
    print(entity)

{'entity': 'B-PER', 'word': 'Raj'}
{'entity': 'B-PER', 'word': 'me'}
{'entity': 'B-PER', 'word': 'you'}
{'entity': 'B-PER', 'word': 'they'}


*[Optional]* We can also visualize the results with spaCy.

In [29]:
# comment following line to install spacy package
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [30]:
import spacy
from spacy import displacy
from spacy.tokens import Span, Doc

# # defining a score threshold on the recognized entities. only entity has scored above the threshold will show
# threashold = 0.0
find_all_caps = True

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab

visualizer = SpanVisualizer()
merged_tokens = visualizer.merge_tokens(tokens)
doc = Doc(vocab=vocab, words=merged_tokens)
doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)


displacy.render(doc, style='ent', jupyter=True)

In [31]:
visualizer.get_list_per_type()
entity_type_list = visualizer.entity_type
for typ, ent_list in entity_type_list.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)

entity type:  PER
	- Raj
	- me
	- you
	- they


## Processing the whole mailing list

In the end, we show one example of how to pass a list of emails and return a list of entities with types. We save them in a csv file for futher processing.

In [32]:
stop_words = ['i', 'you', 'me', 'my', 'mine', 'myself', 'your', 'yours',
              'yourself', 'we', 'us', 'our', 'ours', 'ourselves', 'yourselves',
              'he', 'him', 'himself', 'his', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
              'they', 'them', 'their', 'theirs', 'themself', 'themselves', 'this', 'that', 'something',
              'these', 'those', 'someone', 'somebody', 'who', 'whom', 'whose', 'which', 'what']

In [24]:
from collections import defaultdict

# taking a list of indexes as examples
num_data = len(archive_data)
# num_data = 50
print("Process {} emails in total".format(num_data))
indexes = list(range(0, num_data))
lowercase = False
find_all_caps = True

model_name = "EffyLi/bert-base-NER-finetuned-ner-cerec"
# model_name = "dslim/bert-base-NER"
recognizer = EntityRecognizer(model_name)

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab
save_file_path = "../../archives/"
save_file_name = archive_path.split("/")[-2] + '-entities.csv'
columns_names = ['email_id', 'entity', 'type']
df = pd.DataFrame(columns=columns_names)

email_entity_types = defaultdict(list)

# print('Process emails with id: ', indexes)
for index in indexes:
    if index % 200 == 0:
        print("{} emails processed, {} emails left.".format(index, (num_data-index)))
    body = list(archive_data['Body'].iloc[[index]])[0]
    body = recognizer.pre_processing(body, lowercase=lowercase)
#     show email bodies after pre-processing
#     print(body)

    visualizer = SpanVisualizer()
    # get labels from recognizer first
    tokens = recognizer.tokenizer.tokenize(body)  
    labels = recognizer.recognize(body)
    # merge tokens and spans in visualizer
    merged_tokens = visualizer.merge_tokens(tokens)
    doc = Doc(vocab=vocab, words=merged_tokens)
    doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)
    visualizer.get_list_per_type()
    entity_type = visualizer.entity_type
    for k, v in entity_type.items():
        email_entity_types[k].extend(v)
        for v_i in v:
            # remove pronouns
            if v_i.lower() not in stop_words:
                new_row = {"email_id": index, "entity": v_i, "type": k}
                df = df.append(new_row, ignore_index=True)
df.to_csv(save_file_name)
print("Extracted entities saved!")

# for typ, ent_list in email_entity_types.items():
#     print("entity type: ", typ)
#     for ent in ent_list:
#         print("\t-", ent)  

Process 398 emails in total
0 emails processed, 398 emails left.


Token indices sequence length is longer than the specified maximum sequence length for this model (4022 > 512). Running this sequence through the model will result in indexing errors


200 emails processed, 198 emails left.
Extracted entities saved!


## Run the cell below only to display pre-processed mailing list

In [26]:
# load pre-processed csv file to dataframe and display
file_path = 'extracted_entities/3gv6-entities.csv'
df = pd.read_csv(file_path)

# get top 10 frequent entities for each category
categories = list(set(recognizer.model.config.id2label.values()))
categories = list(set([c.split('-')[-1] if '-' in c else c for c in categories ]))

for c in categories:
    if c != "O":
        if c =="PER":
            print("Top 10 occurence (pronouns excluded) for type: ", c)
        else:
            print("Top 10 occurence for type: ", c)
        df_c = df.loc[df['type'] == c]
        display_df = df_c['entity'].value_counts().rename_axis('entity').reset_index(name='counts')
        display(display_df.head(10))
        print()

Top 10 occurence for type:  LOC


Unnamed: 0,entity,counts
0,San Francisco,9
1,USA,3
2,Shanghai,2
3,China,2
4,Anaheim,1
5,Tower Hui Hui Deng denghuigmailcom,1
6,Vista level,1
7,Vista Room at the Hilton San Francisco The Vis...,1
8,Vista level of Tower,1
9,the Vista Room at the Hilton San Francisco The...,1



Top 10 occurence (pronouns excluded) for type:  PER


Unnamed: 0,entity,counts
0,Teemu,20
1,Cameron,15
2,Jari,11
3,Dan,10
4,Jouni,9
5,Cameron Byrne,9
6,David Crowe,9
7,Brian,8
8,Julien,7
9,Dan Wing,6



Top 10 occurence for type:  MISC


Unnamed: 0,entity,counts
0,Internet,4
1,Windows,2
2,RFC,1
3,Internet Protocol,1
4,MacOS,1
5,Windows OS,1
6,IGI,1



Top 10 occurence for type:  DIG


Unnamed: 0,entity,counts
0,IPv,3
1,DHCPv,1
2,teemusavolainennokiacom,1
3,PGW,1
4,IHdpdGggREhDUFYIHNlcnZlciwgdGhlbiBaGVzZSBdgREh...,1
5,ba sis,1
6,withIETFDocs,1
7,listA,1
8,STUNTURN,1
9,PNAT,1



Top 10 occurence for type:  ORG


Unnamed: 0,entity,counts
0,UE,34
1,IETF,24
2,GPP,20
3,UEs,17
4,IPvonly,16
5,DS,9
6,PDN,8
7,RFC,8
8,IMHO,7
9,GPP EPC,7



