Multilingual email zoning corpus, containing:
-
625 non-English emails from Gmane corpus annotated with email zones, including:
- 215 French emails
- 210 Potuguese emails
- 200 Spanish emails
├── LICENSE -- File with licence description (Apache License 2.0.)
├── README.md -- This file with instructions
├── corpus/ -- Folder with Cleverly Zoning Corpus
│ ├── es_annotator1.json -- Spanish annotations done by our first annotator (A1)
│ ├── es_annotator2.json -- Spanish annotations done by our second annotator (A2)
│ ├── fr_annotator1.json -- French annotations done by our first annotator (A1)
│ ├── fr_annotator2.json -- French annotations done by our second annotator (A2)
│ ├── pt_annotator1.json -- Portuguese annotations done by our first annotator (A1)
│ └── pt_annotator2.json -- Portuguese annotations done by our second annotator (A2)
└── paper/
└── EACL_2021__Multilingual_Email_Zoning.pdf -- PDF with the original paper presenting the Cleverly Zoning Corpus
If you use this corpus, please cite the original paper:
Bruno Jardim, Ricardo Rei and Mariana S. C. Almeida. "Multilingual Email Zoning". EACL 2021 student Research Wrokshop, April 2021.
with the bibtex:
@inproceedings{BJardim2021,
author = {Bruno Jardim and Ricardo Rei and Mariana S. C. Almeida},
title = {Multilingual Email Zoning},
booktitle = {EACL 2021 student Research Wrokshop},
publisher = {Association for Computational Linguistics},
year = {2021},
month = {April},
location = {online},
url = {https://arxiv.org/abs/2102.00461},
}
The segmentation of emails into functional zones (also dubbed email zoning) is a relevant preprocessing step for most NLP tasks that deal with emails. However, despite the multilingual character of emails and their applications, previous literature regarding email zoning corpora and systems was developed essentially for English. In this paper, we analyse the existing email zoning corpora and propose Cleverly zoning corpus, a new multilingual benchmark composed of 625 emails in Portuguese, Spanish and French. Moreover, we introduce Okapi, the first multilingual email segmentation model based on a language agnostic sentence encoder. Besides generalizing well for unseen languages, our model is competitive with current English benchmarks, and reached new state-of-the-art performances for domain adaptation tasks in English.
Apache License 2.0.
More details are in the LICENCE file in the root of this project.
To build a multilingual email corpus, we searched the Gmane corpus (Bevendor et al., 2020) for Portuguese, Spanish and French emails. Then, following the classification schema proposed in Bevendor et al. (2020), we produced a total of 625 annotated emails that compose the Cleverly Email Zoning Corpus.
- paragraph: main content; new content from the current email sender.
- closing: content that closes the email (ex: Regards, Some Name).
- inline_headers: contain lines from information that is used to contextualize the details of a previous message in the thread, such as addresses, dates, recipients, etc.
- log_data: contain information about configurations or events that have occurred within a software application like error messages or loading of files.
- mua_signature: include legal disclaimers, privacy statements, or other automatically generated texts advising the people that interact with the email. Also contained under MUA signature are automatically generated message lines that define from which device the email was sent, mail user agen and mailing list details or advertisements (e.g. Sent from iPhone).
- patch: contain text with pieces of code used to make changes to a computer program or its supporting data in order to update, fix, or improve it. This includes fixing security vulnerabilities and other bugs (source code diffs).
- personal_signature: consider content containing contact or other personal information such as name, job title, company, phone number, etc., that is automatically inserted at the end of an email message.
- quotation: include contents quoted from previous messages in the same conversation thread and forwarded content from other conversations. Each line typically starts with a “>”.
- quotation_marker: initiate quotation zones stating the quotation author and date of a quotation.
- raw_code: contain text regarding code being developed in any programming languages (source code).
- salutation: contain the terms of address and recipient names at the beginning of a message.
- section_heading: lines that are headers of other zones, mainly of paragraph zone.
- tabular: text in a semi-structured tabular or matrix format.
- technical: contain lines of automatic messages generated by Gmane regarding initiation, ending or elimination of email parts; inline attachments or PGP signatures.
- visual_separator: lines with no alphanumeric characters, used to divide the text between multiple zones.
While French is the language with more emails, Portuguese and Spanish emails tend to be longer, resulting in a greater amount of lines and an overall higher number of zones per email.
The distribution of zones is quite similar between the three languages, with the percentage of the total lines being: quotation (52.2% of the all the corpus' lines), paragraph (19.8%), MUA signature (8.8), personal signature (3.7%), visual separator (2.7%), quotation marker (2.7%),closing (
pt | es | fr | |
---|---|---|---|
emails | 210 | 200 | 215 |
zones | 15 | 14 | 14 |
lines | 12366 | 9824 | 6958 |
lines per email | 58.9 | 49.1 | 32.4 |
zones per email | 8.6 | 6.5 | 5.9 |
The annotation was carried out using Tagtog by two annotators. The first annotator was a native Portuguese speaker and the second annotator a native Spanish speaker, both with academical background in French and fluent in the third language. Each email was annotated by both annotators, with the following inter-annotator agreement:
pt | es | fr | |
---|---|---|---|
accuracy | 0.93 | 0.92 | 0.96 |
F1 A1A2 | 0.93 | 0.92 | 0.96 |
F1 A2A1 | 0.94 | 0.92 | 0.96 |
k (Cohen's Kappa) | 0.90 | 0.87 | 0.94 |
The corpus consists of 6 files (one for each of the 3 languages and 2 annotators). Each annotation file contains a list of document annotations, that are stored in dictionarie with the following strucure:
{"_id": "<urn:uuid:aa5e973e-65fc-4d32-a0d1-7848eaf405e8>",
"zones": [["salutation", 0, 11],
["paragraph", 11, 601],
["closing", 601, 610],
["personal_signature", 610, 677],
["mua_signature", 677, 759],
["quotation_marker", 759, 829],
["quotation", 829, 1150]]}
The first key ("_id") has the value the email id from the original Gmane corpus. The second key ("zones") contains the annotated zone spans of that email as a list of lists, where each sublist contais the name of the zone ("greading", "body", "signature", etc), followed by the starting character and ending character of that zone in the email.
To retrieve the email content:
-
Request access to the full Gmane corpus from: https://webis.de/data/webis-gmane-19.html
-
Download the webis-gmane-19-part04 folder
-
Access the .ndjson file part-0051
-
Retrive the first 300000 emails. (See instructions about how to extract file content from: https://webis.de/data/webis-gmane-19.html and Gmane README.md)
-
Match ids with our zone annotations
Open one of the annotation files (french)
import io
import json
with io.open('fr_annotator1.json', encoding="utf-8") as json_file:
cleverly_fr = json.load(json_file)
After requesting for access to the full Gmane corpus (as outlined before), open the part-0051.ndjson file and extract the first 3000000 lines
gmane = []
for line in open('webis-gmane-19-part04\part-0051.ndjson', 'r', encoding="utf-8"):
while len(gmane)<=3000000:
gmane.append(json.loads(line))
Reassemble ids and email content
gmane_id_email = []
for j, line in enumerate(gmane):
if j%2==0:
_id = line['index']['_id']
else:
dic = {'_id': _id, 'lang': line['lang'], 'text': line['text_plain']}
gmane_id_email.append(dic)
dic = {}
Filter for correct language
gmane_fr = []
for email in gmane_id_email:
if email['lang'] == 'fr':
gmane_fr.append(email)
Match ids
fr_email_zones = []
for g_mail in gmane_fr:
for c_ann in cleverly_fr:
if g_mail['_id'] == c_ann['_id']:
fr_email_zones.append((c_ann['_id'], g_mail['text'], c_ann['zones']))