# Email body content analysis 

*For the context:* Current analysis in bigbang focus on headers. There are many analysis on the headers in the emails for the people and orgnization involved in the discussions. There are a few content analysis focusing on the keywords first occurence searching and/or most used words per user.

This notebook analyze the email body contents with Huggingface Named Entity Recognition(NER) models that are able to systematically label the entities and their types(currently supports PER, ORG, LOC, and MISC) in the email bodies. This can potentially help the researchers understand more on the email conversations.

In [1]:
# import necessary packages
from bigbang.archive import Archive
from bigbang.archive import load as load_archive

# hide warnings
import warnings
warnings.filterwarnings('ignore')

First, use the script ```bin/collect_mail.py``` to collect web archives. Details can be seen in https://bigbang-py.readthedocs.io/en/latest/data-sources.html#id1 .

Here, we use an example of the [scipy-dev](https://mail.python.org/pipermail/scipy-dev/) mailing list page.

Scipy-dev mailing list contains 149,718 emails From June 2001 - September 2021.

In [2]:
archive_path = "../../archives/scipy-dev/"

archive = Archive(archive_path,mbox=True)
# archive data in pandas dataframe format
archive_data = archive.data

In [3]:
# inspect data which contains 7 columns and 149718 entries
print(archive_data.size)
archive_data.head(2)

149718


Unnamed: 0_level_0,From,Subject,Date,In-Reply-To,References,Body
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<NEBBIECAMLMAAKHEGPCGKEBHCLAA.travis@vaught.net>,travis at vaught.net (Travis N. Vaught),[SciPy-dev] SciPy Developer mailing list now o...,2001-06-11 02:10:51+00:00,,,The link:\n\nhttp://scipy.net/mailman/listinfo...
<Pine.LNX.4.33.0107231957590.15960-100000@oliphant.ee.byu.edu>,oliphant at ee.byu.edu (Travis Oliphant),[SciPy-dev] RPMs and source distribution,2001-07-24 02:01:00+00:00,<02f001c111bf$2e78a9d0$777ba8c0@190xb01>,,I've been playing for hours and finally have i...


In [4]:
# example of one email body
list(archive_data['Body'].iloc[[1]])

["I've been playing for hours and finally have it so that\n\npython setup.py sdist\npython setup.by bdist_rpm\n\nwork as expected.\n\nI have distributions and RPM's that I need to put somewhere.\n\nThanks,\n\n-Travis"]

In [5]:
# comment line below to install transformers with pytorch 
# inside your current python environment

# !pip install transformers[torch]
# !pip install contractions
# !pip install email_reply_parser

In pre-processing, we want to 
- remove the punctuations
- remove links 
- expand contractions 
- remove digits
- tokenize the words
- [Optional] Lowercase the words

In [6]:
# import functions for analyzing
from bigbang.analysis.entity_recognition import EntityRecognizer, SpanVisualizer

The list of models can be found in: https://huggingface.co/ . You can also train your own model and upload to huggingface.

Examples for possible model names include:
['dslim/bert-base-NER', 'dslim/bert-base-NER-cased', ...]

In [7]:
# we load the model and apply inference in the back-end of the bigbang package
# you can pass the model name of your interest to the function
model_name = "dslim/bert-base-NER"

recognizer = EntityRecognizer(model_name)

In [8]:
import itertools

# hyperparams
# taking one row as an example
index = 20
lowercase = False

body = list(archive_data['Body'].iloc[[index]])[0]
print("Text body before pre-processing--------------------:\n")
print(body)
body = recognizer.pre_processing(body, lowercase=lowercase)
print("Text body after pre-processing--------------------:\n")
print(body)

Text body before pre-processing--------------------:

I'd say the latter of the two.  I started linalg months ago, and Travis O.
put
a lot of effort into over the last several weeks.

I'm not really familiar with 3.0 -- we are really focusing on ATLAS cause
it is so dang fast on most platforms.  It doesn't provide a full LAPACK
though,
so you have to merge it with another LAPACK to get everything.

If you can figure out how to write a generic interface (not to hard, but
only
partially documented in /linalg/docs/more_notes), then have at it.
The actual f2py interfaces are generated from a python script.

The more interfaces the merrier, but the compatibility issue has to be
addressed.
On Unix, we could use 'nm' to check if the function is there.  On windows it
ain't so easy.  Maybe it should just be an optional function for now (i.e.
defaults
to being commented out) for the widest compatibility.

eric




----- Original Message -----
From: "Robert Kern" <kern at caltech.edu>
To: <scipy-

In [9]:
tokens = recognizer.tokenizer.tokenize(body)
labels = recognizer.recognize(body)
entities = recognizer.get_entities(labels)
for entity in entities:
    print(entity)

{'entity': 'B-PER', 'word': 'Travis'}
{'entity': 'I-PER', 'word': 'O'}
{'entity': 'B-ORG', 'word': 'AT'}
{'entity': 'B-MISC', 'word': 'Unix'}


*[Optional]* We can also visualize the results with spaCy.

In [10]:
# comment following line to install spacy package
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [11]:
import spacy
from spacy import displacy
from spacy.tokens import Span, Doc

# # defining a score threshold on the recognized entities. only entity has scored above the threshold will show
# threashold = 0.0
find_all_caps = True

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab

visualizer = SpanVisualizer()
merged_tokens = visualizer.merge_tokens(tokens)
doc = Doc(vocab=vocab, words=merged_tokens)
doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)


displacy.render(doc, style='ent', jupyter=True)

In [12]:
visualizer.get_list_per_type()
entity_type_list = visualizer.entity_type
for typ, ent_list in entity_type_list.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)

entity type:  PER
	- Travis O
entity type:  ORG
	- ATLAS
entity type:  MISC
	- Unix
entity type:  ALLCAPS
	- LAPACK


In the end, we show one example of how to pass a list of emails and return a list of entities with types.

In [13]:
from collections import defaultdict

# taking a list of indexes as examples
indexes = [40, 50, 60]
lowercase = False
find_all_caps = True

model_name = "dslim/bert-base-NER"
recognizer = EntityRecognizer(model_name)

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab

email_entity_types = defaultdict(list)

print('Process emails with id: ', indexes)
for index in indexes:
    body = list(archive_data['Body'].iloc[[index]])[0]
    body = recognizer.pre_processing(body, lowercase=lowercase)
#     show email bodies after pre-processing
#     print(body)

    visualizer = SpanVisualizer()
    # get labels from recognizer first
    tokens = recognizer.tokenizer.tokenize(body)  
    labels = recognizer.recognize(body)
    # merge tokens and spans in visualizer
    merged_tokens = visualizer.merge_tokens(tokens)
    doc = Doc(vocab=vocab, words=merged_tokens)
    doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)
    visualizer.get_list_per_type()
    entity_type = visualizer.entity_type
    for k, v in entity_type.items():
        email_entity_types[k].extend(v)

for typ, ent_list in email_entity_types.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)  

Process emails with id:  [40, 50, 60]
entity type:  ORG
	- SHA
	- University of North Carolina
	- Department of Chemistry
	- Venable Hall
	- Kenan
	- Chapel Hill NC
	- Mailcrypt
	- Matlabs
	- C
	- Emacs
	- Freiheit
entity type:  LOC
	- USA
entity type:  ALLCAPS
	- BEGIN
	- PGP
	- SIGNED
	- MESSAGE
	- BCCDE
	- SIGNATURE
	- END
	- BCCDE
entity type:  MISC
	- CVS
entity type:  PER
	- Eric
	- Einigkeit
