# Email body content analysis 

*For the context:* Current analysis in bigbang are heavily rely on hands. There are many analysis on the headers in the emails for the people and orgnization involved in the discussions. There are a few content analysis focusing on the keywords first occurence searching and/or most used words per user.

This notebook analyze the email body contents with Huggingface NER models that are able to systematically label the entities and their types(currently supports PER, ORG, LOC, and MISC) in the email bodies. This can potentially help the researchers understand more on the email conversations.

In [1]:
# import necessary packages
from bigbang.archive import Archive
from bigbang.archive import load as load_archive

# hide warnings
import warnings
warnings.filterwarnings('ignore')

First, use the script ```bin/collect_mail,py``` to collect web archives. Details can be seen in https://bigbang-py.readthedocs.io/en/latest/data-sources.html#id1 .

Here, we use an example of the [scipy-dev](https://mail.python.org/pipermail/scipy-dev/) mailing list page.

Scipy-dev mailing list contains 149,718 emails From June 2001 - September 2021.

In [2]:
archive_path = "../../archives/scipy-dev/"

archive = Archive(archive_path,mbox=True)
# archive data in pandas dataframe format
archive_data = archive.data

In [3]:
# inspect data which contains 7 columns and 149718 entries
print(archive_data.size)
archive_data.head(2)

149718


Unnamed: 0_level_0,From,Subject,Date,In-Reply-To,References,Body
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<NEBBIECAMLMAAKHEGPCGKEBHCLAA.travis@vaught.net>,travis at vaught.net (Travis N. Vaught),[SciPy-dev] SciPy Developer mailing list now o...,2001-06-11 02:10:51+00:00,,,The link:\n\nhttp://scipy.net/mailman/listinfo...
<Pine.LNX.4.33.0107231957590.15960-100000@oliphant.ee.byu.edu>,oliphant at ee.byu.edu (Travis Oliphant),[SciPy-dev] RPMs and source distribution,2001-07-24 02:01:00+00:00,<02f001c111bf$2e78a9d0$777ba8c0@190xb01>,,I've been playing for hours and finally have i...


In [4]:
# example of one email body
list(archive_data['Body'].iloc[[1]])

["I've been playing for hours and finally have it so that\n\npython setup.py sdist\npython setup.by bdist_rpm\n\nwork as expected.\n\nI have distributions and RPM's that I need to put somewhere.\n\nThanks,\n\n-Travis"]

In [5]:
# comment line below to install transformers with pytorch 
# inside your current python environment

# !pip install transformers[torch]
# !pip install contractions

In pre-processing, we want to 
- remove the punctuations
- remove links 
- expand contractions 
- remove digits
- tokenize the words

In [6]:
# import functions for analyzing
from bigbang.analysis.entity_recognition import EntityRecognizer, SpanVisualizer

In [7]:
# we load the model and apply inference in the back-end of the bigbang package
# you can pass the model name of your interest to the function
model_name = "dslim/bert-base-NER"

recognizer = EntityRecognizer(model_name)

In [8]:
import itertools

# hyperparams
# taking one row as an example
index = 40
lowercase = False


rows = archive_data.iterrows()
row = next(itertools.islice(rows, index, None))
body = row[1]['Body']
print("Text body before pre-processing--------------------:\n")
print(body)
body = recognizer.remove_ori_message(body)
body = recognizer.pre_processing(body, lowercase=lowercase)
print("Text body after pre-processing--------------------:\n")
print(body)

Text body before pre-processing--------------------:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What's going on with the repository? I am unable to check out stuff,
e.g. 
,----
| cvs server: failed to create lock directory in repository `/home/cvsroot/world/scipy/gui_thread/tests': Permission denied
| cvs server: failed to obtain dir lock in repository `/home/cvsroot/world/scipy/gui_thread/tests'
| cvs [server aborted]: read lock failed - giving up
`----


Greetings,
Jochen
- -- 
University of North Carolina                       phone: +1-919-962-4403
Department of Chemistry                            phone: +1-919-962-1579
Venable Hall CB#3290 (Kenan C148)                    fax: +1-919-843-6041
Chapel Hill, NC 27599, USA                            GnuPG key: 44BCCD8E
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6-cygwin-fcn-1 (Cygwin)
Comment: Processed by Mailcrypt and GnuPG <http://www.gnupg.org/>

iD8DBQE70GkmiJ/aUUS8zY4RAlCMAJsH4pN3CIaLiG/LmJXef3Cq7KV9qgCcDbuO
S6RYodIz

The list of models can be found in: https://huggingface.co/ . You can also train your own model and upload to huggingface.

Examples for possible model names include:
['dslim/bert-base-NER', 'dslim/bert-base-NER-cased', ...]

In [9]:
tokens = recognizer.tokenizer.tokenize(body)
labels = recognizer.recognize(body)
entities = recognizer.get_entities(labels)
for entity in entities:
    print(entity)

{'entity': 'I-ORG', 'word': 'SH'}
{'entity': 'B-ORG', 'word': 'University'}
{'entity': 'I-ORG', 'word': 'of'}
{'entity': 'I-ORG', 'word': 'North'}
{'entity': 'I-ORG', 'word': 'Carolina'}
{'entity': 'B-ORG', 'word': 'Department'}
{'entity': 'I-ORG', 'word': 'of'}
{'entity': 'I-ORG', 'word': 'Chemistry'}
{'entity': 'B-ORG', 'word': 'V'}
{'entity': 'I-ORG', 'word': '##ena'}
{'entity': 'I-ORG', 'word': '##ble'}
{'entity': 'I-ORG', 'word': 'Hall'}
{'entity': 'I-ORG', 'word': 'Ken'}
{'entity': 'B-ORG', 'word': 'Chapel'}
{'entity': 'I-ORG', 'word': 'Hill'}
{'entity': 'I-ORG', 'word': 'NC'}
{'entity': 'I-LOC', 'word': 'USA'}
{'entity': 'I-ORG', 'word': '##c'}


*[Optional]* We can also visualize the results with spaCy.

In [10]:
# comment following line to install spacy package
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [11]:
visualizer = SpanVisualizer()
merged_tokens, maps = visualizer.merge_tokens(tokens)

In [12]:
import spacy
from spacy import displacy
from spacy.tokens import Span, Doc

# defining a score threshold on the recognized entities. only entity has scored above the threshold will show
threashold = 0.0

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab

merged_labels = visualizer.merge_labels(labels, maps)
doc = Doc(vocab=vocab, words=merged_tokens)
doc = visualizer.get_doc_for_visualization(merged_labels, maps, doc)


displacy.render(doc, style='ent', jupyter=True)

In [13]:
visualizer.get_list_per_type()
entity_type_list = visualizer.entity_type
for typ, ent_list in entity_type_list.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)

entity type:  ORG
	- SHA
	- University of North Carolina
	- Department of Chemistry
	- Venable Hall
	- Kenan
	- Chapel Hill NC
	- Mailcrypt
entity type:  LOC
	- USA


In the end, we show one example of how to pass a list of emails and return a list of entities with types.

In [14]:
# taking a list of indexes as examples
indexes = [40, 50, 60]
lowercase = False

model_name = "dslim/bert-base-NER"
recognizer = EntityRecognizer(model_name)

bodies = [] 

print('Process emails with id: ', indexes)
for index in indexes:
    rows = archive_data.iterrows()
    row = next(itertools.islice(rows, index, None))
    body = row[1]['Body']
    body = recognizer.remove_ori_message(body)
    body = recognizer.pre_processing(body, lowercase=lowercase)
    # show email bodies after pre-processing
#     print(body)
    bodies.append(body)

visualizer = SpanVisualizer()
nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab
for body in bodies:
    # get labels from recognizer first
    tokens = recognizer.tokenizer.tokenize(body)  
    labels = recognizer.recognize(body)
    entities = recognizer.get_entities(labels)
    # merge tokens and spans in visualizer
    merged_tokens, maps = visualizer.merge_tokens(tokens)
    merged_labels = visualizer.merge_labels(labels, maps)
    doc = Doc(vocab=vocab, words=merged_tokens)
    _ = visualizer.get_doc_for_visualization(merged_labels, maps, doc)

visualizer.get_list_per_type()
entity_type_list = visualizer.entity_type
for typ, ent_list in entity_type_list.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)  

Process emails with id:  [40, 50, 60]
entity type:  ORG
	- SHA
	- University of North Carolina
	- Department of Chemistry
	- Venable Hall
	- Kenan
	- Chapel Hill NC
	- Mailcrypt
	- Matlabs
	- Emacs
	- Recht
	- Freiheit
	- BCCDE
entity type:  LOC
	- USA
entity type:  MISC
	- CVS
entity type:  PER
	- Eric
	- Einigkeit
