# Email body content analysis 

*For the context:* Current analysis in bigbang focus on headers. There are many analysis on the headers in the emails for the people and orgnization involved in the discussions. There are a few content analysis focusing on the keywords first occurence searching and/or most used words per user.

This notebook analyze the email body contents with Huggingface Named Entity Recognition(NER) models that are able to systematically label the entities and their types(currently supports PER, ORG, LOC, and MISC) in the email bodies. This can potentially help the researchers understand more on the email conversations.

In [1]:
# import necessary packages
from bigbang.archive import Archive
from bigbang.archive import load as load_archive
import pandas as pd

# hide warnings
import warnings

warnings.filterwarnings("ignore")

First, use the script ```bin/collect_mail.py``` to collect web archives. Details can be seen in https://bigbang-py.readthedocs.io/en/latest/data-sources.html#id1 .

<!-- Here, we use an example of the [scipy-dev](https://mail.python.org/pipermail/scipy-dev/) mailing list page.

Scipy-dev mailing list contains 149,718 emails From June 2001 - September 2021. -->


In [4]:
mailing_list = "scipy-dev"
archive_path = "../../archives/{}/".format(mailing_list)
archive = Archive(archive_path, mbox=True)
# archive data in pandas dataframe format
archive_data = archive.data

In [5]:
# inspect data
print(len(archive_data))
archive_data.head(2)

24953


Unnamed: 0_level_0,From,Subject,Date,In-Reply-To,References,Body
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<NEBBIECAMLMAAKHEGPCGKEBHCLAA.travis@vaught.net>,travis at vaught.net (Travis N. Vaught),[SciPy-dev] SciPy Developer mailing list now o...,2001-06-11 02:10:51+00:00,,,The link:\n\nhttp://scipy.net/mailman/listinfo...
<Pine.LNX.4.33.0107231957590.15960-100000@oliphant.ee.byu.edu>,oliphant at ee.byu.edu (Travis Oliphant),[SciPy-dev] RPMs and source distribution,2001-07-24 02:01:00+00:00,<02f001c111bf$2e78a9d0$777ba8c0@190xb01>,,I've been playing for hours and finally have i...


In [6]:
# example of one email body
list(archive_data["Body"].iloc[[1]])

["I've been playing for hours and finally have it so that\n\npython setup.py sdist\npython setup.by bdist_rpm\n\nwork as expected.\n\nI have distributions and RPM's that I need to put somewhere.\n\nThanks,\n\n-Travis"]

In [7]:
# comment line below to install transformers with pytorch
# inside your current python environment

# !pip install transformers[torch]
# !pip install contractions
# !pip install email_reply_parser

In pre-processing, we want to 
- remove the punctuations
- remove links 
- expand contractions 
- remove digits
- tokenize the words
- [Optional] Lowercase the words

In [8]:
# import functions for analyzing
from bigbang.analysis.entity_recognition import EntityRecognizer, SpanVisualizer

The list of models can be found in: https://huggingface.co/ . You can also train your own model and upload to huggingface.

Examples for possible model names include:
['dslim/bert-base-NER', 'dslim/bert-base-NER-cased', ...]

In [9]:
# we load the model and apply inference in the back-end of the bigbang package
# you can pass the model name of your interest to the function
model_name = "EffyLi/bert-base-NER-finetuned-ner-cerec"

recognizer = EntityRecognizer(model_name)

In [10]:
import itertools

# hyperparams
# taking one row as an example
index = 6
lowercase = False

body = list(archive_data["Body"].iloc[[index]])[0]
print("Text body before pre-processing--------------------:\n")
print(body)
body = recognizer.pre_processing(body, lowercase=lowercase)
print("Text body after pre-processing--------------------:\n")
print(body)

Text body before pre-processing--------------------:

All,

As I mentioned in my previous message, I've been trying to patch
Fortran compilation to support f2c. Unfortunately, after some
work on patching both build_flib.py and the fc f2c script, I ran
into several problems.

1. fc puts the files into the current directory.
2. The build process runs into problems with the space in the
platform name which contains "Power Macintosh". In particular,
ar has problems with the space.

I think, I'll wait until gcc 3.0 on OS X. That will have g77 support.

Cheers,

Tim Lahey
Text body after pre-processing--------------------:

All,  As I mentioned in my previous message, I have been trying to patch Fortran compilation to support fc. Unfortunately, after some work on patching both build_flib.py and the fc fc script, I ran into several problems.  . fc puts the files into the current directory. . The build process runs into problems with the space in the platform name which contains "Power Macinto

In [11]:
tokens = recognizer.tokenizer.tokenize(body)
labels = recognizer.recognize(body)
entities = recognizer.get_entities(labels)
for entity in entities:
    print(entity)

{'entity': 'B-PER', 'word': 'I'}
{'entity': 'B-PER', 'word': 'I'}
{'entity': 'B-ORG', 'word': 'Fort'}
{'entity': 'I-PER', 'word': '##ran'}
{'entity': 'B-PER', 'word': 'build'}
{'entity': 'I-PER', 'word': 'fl'}
{'entity': 'I-PER', 'word': '##ib'}
{'entity': 'I-PER', 'word': 'p'}
{'entity': 'B-PER', 'word': 'I'}
{'entity': 'I-PER', 'word': 'f'}
{'entity': 'I-PER', 'word': '##c'}
{'entity': 'B-MISC', 'word': 'Power'}
{'entity': 'I-MISC', 'word': 'Macintosh'}
{'entity': 'B-PER', 'word': 'a'}
{'entity': 'B-PER', 'word': 'I'}
{'entity': 'B-PER', 'word': 'Tim'}
{'entity': 'I-PER', 'word': 'La'}
{'entity': 'I-PER', 'word': '##hey'}


*[Optional]* We can also visualize the results with spaCy.

In [12]:
# comment following line to install spacy package
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [13]:
import spacy
from spacy import displacy
from spacy.tokens import Span, Doc

# # defining a score threshold on the recognized entities. only entity has scored above the threshold will show
# threashold = 0.0
find_all_caps = True

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab

visualizer = SpanVisualizer()
merged_tokens = visualizer.merge_tokens(tokens)
doc = Doc(vocab=vocab, words=merged_tokens)
doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)


displacy.render(doc, style="ent", jupyter=True)

In [14]:
visualizer.get_list_per_type()
entity_type_list = visualizer.entity_type
for typ, ent_list in entity_type_list.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)

entity type:  PER
	- I
	- build
	- flib
	- py
	- fc
	- ar
	- Tim Lahey
entity type:  ORG
	- Fortran
entity type:  MISC
	- Power Macintosh
entity type:  ALLCAPS
	- OS


## Processing the whole mailing list

In the end, we show one example of how to pass a list of emails and return a list of entities with types. We save them in a csv file for futher processing.

In [15]:
stop_words = [
    "i",
    "you",
    "me",
    "my",
    "mine",
    "myself",
    "your",
    "yours",
    "yourself",
    "we",
    "us",
    "our",
    "ours",
    "ourselves",
    "yourselves",
    "he",
    "him",
    "himself",
    "his",
    "she",
    "her",
    "hers",
    "herself",
    "it",
    "its",
    "itself",
    "they",
    "them",
    "their",
    "theirs",
    "themself",
    "themselves",
    "this",
    "that",
    "something",
    "these",
    "those",
    "someone",
    "somebody",
    "who",
    "whom",
    "whose",
    "which",
    "what",
]

In [17]:
from collections import defaultdict

# taking a list of indexes as examples
num_data = len(archive_data)
num_data = 50
print("Process {} emails in total".format(num_data))
indexes = list(range(0, num_data))
lowercase = False
find_all_caps = True

model_name = "EffyLi/bert-base-NER-finetuned-ner-cerec"
# model_name = "dslim/bert-base-NER"
recognizer = EntityRecognizer(model_name)

nlp = spacy.load("en_core_web_sm")
vocab = nlp.tokenizer.vocab
save_file_path = "../../archives/"
save_file_name = archive_path.split("/")[-2] + "-entities.csv"
columns_names = ["email_id", "entity", "type"]
df = pd.DataFrame(columns=columns_names)

email_entity_types = defaultdict(list)

# print('Process emails with id: ', indexes)
for index in indexes:
    if index % 200 == 0:
        print("{} emails processed, {} emails left.".format(index, (num_data - index)))
    body = list(archive_data["Body"].iloc[[index]])[0]
    body = recognizer.pre_processing(body, lowercase=lowercase)
    #     show email bodies after pre-processing
    #     print(body)

    visualizer = SpanVisualizer()
    # get labels from recognizer first
    tokens = recognizer.tokenizer.tokenize(body)
    labels = recognizer.recognize(body)
    # merge tokens and spans in visualizer
    merged_tokens = visualizer.merge_tokens(tokens)
    doc = Doc(vocab=vocab, words=merged_tokens)
    doc = visualizer.get_doc_for_visualization(tokens, labels, body, doc, find_all_caps)
    visualizer.get_list_per_type()
    entity_type = visualizer.entity_type
    for k, v in entity_type.items():
        email_entity_types[k].extend(v)
        for v_i in v:
            # remove pronouns
            if v_i.lower() not in stop_words:
                new_row = {"email_id": index, "entity": v_i, "type": k}
                df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
df.to_csv(save_file_name)
print("Extracted entities saved!")

for typ, ent_list in email_entity_types.items():
    print("entity type: ", typ)
    for ent in ent_list:
        print("\t-", ent)

Process 50 emails in total
0 emails processed, 50 emails left.


Token indices sequence length is longer than the specified maximum sequence length for this model (1208 > 512). Running this sequence through the model will result in indexing errors


Extracted entities saved!
entity type:  PER
	- you
	- scipy - dev
	- scipy . net
	- I
	- I
	- py
	- I
	- Travis
	- I
	- I
	- SciPy
	- my
	- flib
	- py
	- Cephes
	- Tim Lahey
	- I
	- build
	- flib
	- py
	- fc
	- ar
	- Tim Lahey
	- I
	- us
	- de Boor
	- Tim Lahey
	- Tim
	- I
	- Travis Oliphant
	- he
	- you
	- Travis O .
	- I
	- your
	- I
	- splines
	- I
	- Carl de Boor
	- Joe
	- You
	- you
	- me
	- your
	- We
	- your member
	- I
	- everyone
	- eric
	- I
	- me
	- Joe rossini
	- you
	- washington . edu
	- A . J . Rossini
	- I
	- anyone
	- us
	- you
	- Travis Vaught
	- I
	- rossini
	- A . J . Rossini
	- I
	- me
	- you
	- JMR
	- I
	- Eric
	- he
	- Numpy
	- I
	- Rob
	- ps
	- my Fortran
	- I
	- linalg
	- Travis O .
	- we
	- you
	- lapack
	- pyf
	- I
	- I
	- I
	- I
	- cygwin
	- fpy
	- me
	- Pearu
	- I
	- Pearu
	- I
	- me
	- Jochen
	- Jason
	- I
	- me
	- Cygwin
	- Jason
	- Travis
	- cygwin
	- Pearu
	- Jochen
	- you
	- my
	- I
	- libpython
	- dll
	- You
	- distutils
	- sig
	- python . org
	- Jaso

## Run the cell below only to display pre-processed mailing list

In [None]:
import pandas as pd

# load pre-processed csv file to dataframe and display
file_path = "extracted_entities/3gv6-entities.csv"
df = pd.read_csv(file_path)

# get top 10 frequent entities for each category
categories = list(set(recognizer.model.config.id2label.values()))
categories = list(set([c.split("-")[-1] if "-" in c else c for c in categories]))

for c in categories:
    if c != "O":
        if c == "PER":
            print("Top 10 occurence (pronouns excluded) for type: ", c)
        else:
            print("Top 10 occurence for type: ", c)
        df_c = df.loc[df["type"] == c]
        display_df = (
            df_c["entity"]
            .value_counts()
            .rename_axis("entity")
            .reset_index(name="counts")
        )
        display(display_df.head(10))
        print()