Cleverly Zoning Corpus

Multilingual email zoning corpus, containing:

625 non-English emails from Gmane corpus annotated with email zones, including:
- 215 French emails
- 210 Potuguese emails
- 200 Spanish emails

Files and folders

├── LICENSE                     -- File with licence description (Apache License 2.0.)
├── README.md                   -- This file with instructions
├── corpus/                     -- Folder with Cleverly Zoning Corpus
│   ├── es_annotator1.json      -- Spanish annotations done by our first annotator (A1)
│   ├── es_annotator2.json      -- Spanish annotations done by our second annotator (A2)
│   ├── fr_annotator1.json      -- French annotations done by our first annotator (A1)
│   ├── fr_annotator2.json      -- French annotations done by our second annotator (A2)
│   ├── pt_annotator1.json      -- Portuguese annotations done by our first annotator (A1)
│   └── pt_annotator2.json      -- Portuguese annotations done by our second annotator (A2)
└── paper/
    └── EACL_2021__Multilingual_Email_Zoning.pdf   -- PDF with the original paper presenting the Cleverly Zoning Corpus

Reference

If you use this corpus, please cite the original paper:

Bruno Jardim, Ricardo Rei and Mariana S. C. Almeida. "Multilingual Email Zoning". EACL 2021 student Research Wrokshop, April 2021.

with the bibtex:

@inproceedings{BJardim2021,
 author = {Bruno Jardim and Ricardo Rei and Mariana S. C. Almeida},
 title = {Multilingual Email Zoning},
 booktitle = {EACL 2021 student Research Wrokshop},
 publisher = {Association for Computational Linguistics},
 year = {2021},
 month = {April},
 location = {online},
 url = {https://arxiv.org/abs/2102.00461},
}

Abstract

The segmentation of emails into functional zones (also dubbed email zoning) is a relevant preprocessing step for most NLP tasks that deal with emails. However, despite the multilingual character of emails and their applications, previous literature regarding email zoning corpora and systems was developed essentially for English. In this paper, we analyse the existing email zoning corpora and propose Cleverly zoning corpus, a new multilingual benchmark composed of 625 emails in Portuguese, Spanish and French. Moreover, we introduce Okapi, the first multilingual email segmentation model based on a language agnostic sentence encoder. Besides generalizing well for unseen languages, our model is competitive with current English benchmarks, and reached new state-of-the-art performances for domain adaptation tasks in English.

Licence

Apache License 2.0.

More details are in the LICENCE file in the root of this project.

Corpus description

To build a multilingual email corpus, we searched the Gmane corpus (Bevendor et al., 2020) for Portuguese, Spanish and French emails. Then, following the classification schema proposed in Bevendor et al. (2020), we produced a total of 625 annotated emails that compose the Cleverly Email Zoning Corpus.

Email zones

paragraph: main content; new content from the current email sender.
closing: content that closes the email (ex: Regards, Some Name).
inline_headers: contain lines from information that is used to contextualize the details of a previous message in the thread, such as addresses, dates, recipients, etc.
log_data: contain information about configurations or events that have occurred within a software application like error messages or loading of files.
mua_signature: include legal disclaimers, privacy statements, or other automatically generated texts advising the people that interact with the email. Also contained under MUA signature are automatically generated message lines that define from which device the email was sent, mail user agen and mailing list details or advertisements (e.g. Sent from iPhone).
patch: contain text with pieces of code used to make changes to a computer program or its supporting data in order to update, fix, or improve it. This includes fixing security vulnerabilities and other bugs (source code diffs).
personal_signature: consider content containing contact or other personal information such as name, job title, company, phone number, etc., that is automatically inserted at the end of an email message.
quotation: include contents quoted from previous messages in the same conversation thread and forwarded content from other conversations. Each line typically starts with a “>”.
quotation_marker: initiate quotation zones stating the quotation author and date of a quotation.
raw_code: contain text regarding code being developed in any programming languages (source code).
salutation: contain the terms of address and recipient names at the beginning of a message.
section_heading: lines that are headers of other zones, mainly of paragraph zone.
tabular: text in a semi-structured tabular or matrix format.
technical: contain lines of automatic messages generated by Gmane regarding initiation, ending or elimination of email parts; inline attachments or PGP signatures.
visual_separator: lines with no alphanumeric characters, used to divide the text between multiple zones.

Statistics

While French is the language with more emails, Portuguese and Spanish emails tend to be longer, resulting in a greater amount of lines and an overall higher number of zones per email. The distribution of zones is quite similar between the three languages, with the percentage of the total lines being: quotation (52.2% of the all the corpus' lines), paragraph (19.8%), MUA signature (8.8), personal signature (3.7%), visual separator (2.7%), quotation marker (2.7%),closing ($2.5%$), raw code ($2.0%$), inline headers (1.9%), salutation (1.0%), tabular (0.4%), technical (0.3%), patch (0.2%) and section heading (0.1%). Spanish and French tickets do not contain any section heading lines.

	pt	es	fr
emails	210	200	215
zones	15	14	14
lines	12366	9824	6958
lines per email	58.9	49.1	32.4
zones per email	8.6	6.5	5.9

The annotation was carried out using Tagtog by two annotators. The first annotator was a native Portuguese speaker and the second annotator a native Spanish speaker, both with academical background in French and fluent in the third language. Each email was annotated by both annotators, with the following inter-annotator agreement:

	pt	es	fr
accuracy	0.93	0.92	0.96
F₁ A₁A₂	0.93	0.92	0.96
F₁ A₂A₁	0.94	0.92	0.96
k (Cohen's Kappa)	0.90	0.87	0.94

Annotation format

The corpus consists of 6 files (one for each of the 3 languages and 2 annotators). Each annotation file contains a list of document annotations, that are stored in dictionarie with the following strucure:

{"_id": "<urn:uuid:aa5e973e-65fc-4d32-a0d1-7848eaf405e8>", 
"zones": [["salutation", 0, 11], 
         ["paragraph", 11, 601], 
         ["closing", 601, 610], 
         ["personal_signature", 610, 677], 
         ["mua_signature", 677, 759], 
         ["quotation_marker", 759, 829], 
         ["quotation", 829, 1150]]}

The first key ("_id") has the value the email id from the original Gmane corpus. The second key ("zones") contains the annotated zone spans of that email as a list of lists, where each sublist contais the name of the zone ("greading", "body", "signature", etc), followed by the starting character and ending character of that zone in the email.

How to use Cleverly Zoning Corpus

Load the email textual content

To retrieve the email content:

Request access to the full Gmane corpus from: https://webis.de/data/webis-gmane-19.html
Download the webis-gmane-19-part04 folder
Access the .ndjson file part-0051
Retrive the first 300000 emails. (See instructions about how to extract file content from: https://webis.de/data/webis-gmane-19.html and Gmane README.md)
Match ids with our zone annotations

Load Cleverly corpus and match Ids (Python)

Open Cleverly corpus files

Open one of the annotation files (french)

import io
import json
with io.open('fr_annotator1.json', encoding="utf-8") as json_file:
    cleverly_fr = json.load(json_file)

Match Cleverly ids with the ones from the full Gmane corpus to get email text

After requesting for access to the full Gmane corpus (as outlined before), open the part-0051.ndjson file and extract the first 3000000 lines

gmane = []
for line in open('webis-gmane-19-part04\part-0051.ndjson', 'r', encoding="utf-8"):
    while len(gmane)<=3000000:
        gmane.append(json.loads(line))

Reassemble ids and email content

gmane_id_email = []
for j, line in enumerate(gmane):
    if j%2==0:
        _id = line['index']['_id']
    else:
        dic = {'_id': _id, 'lang': line['lang'], 'text': line['text_plain']}
        gmane_id_email.append(dic)
        dic = {}

Filter for correct language

gmane_fr = []
for email in gmane_id_email:
   if email['lang'] == 'fr':
       gmane_fr.append(email)

Match ids

fr_email_zones = []
for g_mail in gmane_fr:
    for c_ann in cleverly_fr:
        if g_mail['_id'] == c_ann['_id']:
            fr_email_zones.append((c_ann['_id'], g_mail['text'], c_ann['zones']))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cleverly Zoning Corpus

Table of contents

Files and folders

Reference

Abstract

Licence

Corpus description

Email zones

Statistics

Annotation format

How to use Cleverly Zoning Corpus

Load the email textual content

Load Cleverly corpus and match Ids (Python)

Open Cleverly corpus files

Match Cleverly ids with the ones from the full Gmane corpus to get email text

About

Releases

Packages

Contributors 2

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
corpus		corpus
paper		paper
poster		poster
LICENSE		LICENSE
README.md		README.md

License

cleverly-ai/multilingual-email-zoning

Folders and files

Latest commit

History

Repository files navigation

Cleverly Zoning Corpus

Table of contents

Files and folders

Reference

Abstract

Licence

Corpus description

Email zones

Statistics

Annotation format

How to use Cleverly Zoning Corpus

Load the email textual content

Load Cleverly corpus and match Ids (Python)

Open Cleverly corpus files

Match Cleverly ids with the ones from the full Gmane corpus to get email text

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages