Skip to content

cleverly-ai/multilingual-email-zoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cleverly Zoning Corpus

Multilingual email zoning corpus, containing:

  • 625 non-English emails from Gmane corpus annotated with email zones, including:

    • 215 French emails
    • 210 Potuguese emails
    • 200 Spanish emails

Table of contents

  1. Files and folders
  2. Reference
  3. Licence
  4. Corpus description
    1. Email zones
    2. Statistics
    3. Annotation format
  5. How to use Cleverly Zoning Corpus
    1. Load the email textual content
    2. Load Cleverly corpus and match Ids (Python)

Files and folders

├── LICENSE                     -- File with licence description (Apache License 2.0.)
├── README.md                   -- This file with instructions
├── corpus/                     -- Folder with Cleverly Zoning Corpus
│   ├── es_annotator1.json      -- Spanish annotations done by our first annotator (A1)
│   ├── es_annotator2.json      -- Spanish annotations done by our second annotator (A2)
│   ├── fr_annotator1.json      -- French annotations done by our first annotator (A1)
│   ├── fr_annotator2.json      -- French annotations done by our second annotator (A2)
│   ├── pt_annotator1.json      -- Portuguese annotations done by our first annotator (A1)
│   └── pt_annotator2.json      -- Portuguese annotations done by our second annotator (A2)
└── paper/
    └── EACL_2021__Multilingual_Email_Zoning.pdf   -- PDF with the original paper presenting the Cleverly Zoning Corpus 

Reference

If you use this corpus, please cite the original paper:

Bruno Jardim, Ricardo Rei and Mariana S. C. Almeida. "Multilingual Email Zoning". EACL 2021 student Research Wrokshop, April 2021.

with the bibtex:

@inproceedings{BJardim2021,
 author = {Bruno Jardim and Ricardo Rei and Mariana S. C. Almeida},
 title = {Multilingual Email Zoning},
 booktitle = {EACL 2021 student Research Wrokshop},
 publisher = {Association for Computational Linguistics},
 year = {2021},
 month = {April},
 location = {online},
 url = {https://arxiv.org/abs/2102.00461},
} 

Abstract

The segmentation of emails into functional zones (also dubbed email zoning) is a relevant preprocessing step for most NLP tasks that deal with emails. However, despite the multilingual character of emails and their applications, previous literature regarding email zoning corpora and systems was developed essentially for English. In this paper, we analyse the existing email zoning corpora and propose Cleverly zoning corpus, a new multilingual benchmark composed of 625 emails in Portuguese, Spanish and French. Moreover, we introduce Okapi, the first multilingual email segmentation model based on a language agnostic sentence encoder. Besides generalizing well for unseen languages, our model is competitive with current English benchmarks, and reached new state-of-the-art performances for domain adaptation tasks in English.

Licence

Apache License 2.0.

More details are in the LICENCE file in the root of this project.

Corpus description

To build a multilingual email corpus, we searched the Gmane corpus (Bevendor et al., 2020) for Portuguese, Spanish and French emails. Then, following the classification schema proposed in Bevendor et al. (2020), we produced a total of 625 annotated emails that compose the Cleverly Email Zoning Corpus.

Email zones

  • paragraph: main content; new content from the current email sender.
  • closing: content that closes the email (ex: Regards, Some Name).
  • inline_headers: contain lines from information that is used to contextualize the details of a previous message in the thread, such as addresses, dates, recipients, etc.
  • log_data: contain information about configurations or events that have occurred within a software application like error messages or loading of files.
  • mua_signature: include legal disclaimers, privacy statements, or other automatically generated texts advising the people that interact with the email. Also contained under MUA signature are automatically generated message lines that define from which device the email was sent, mail user agen and mailing list details or advertisements (e.g. Sent from iPhone).
  • patch: contain text with pieces of code used to make changes to a computer program or its supporting data in order to update, fix, or improve it. This includes fixing security vulnerabilities and other bugs (source code diffs).
  • personal_signature: consider content containing contact or other personal information such as name, job title, company, phone number, etc., that is automatically inserted at the end of an email message.
  • quotation: include contents quoted from previous messages in the same conversation thread and forwarded content from other conversations. Each line typically starts with a “>”.
  • quotation_marker: initiate quotation zones stating the quotation author and date of a quotation.
  • raw_code: contain text regarding code being developed in any programming languages (source code).
  • salutation: contain the terms of address and recipient names at the beginning of a message.
  • section_heading: lines that are headers of other zones, mainly of paragraph zone.
  • tabular: text in a semi-structured tabular or matrix format.
  • technical: contain lines of automatic messages generated by Gmane regarding initiation, ending or elimination of email parts; inline attachments or PGP signatures.
  • visual_separator: lines with no alphanumeric characters, used to divide the text between multiple zones.

Statistics

While French is the language with more emails, Portuguese and Spanish emails tend to be longer, resulting in a greater amount of lines and an overall higher number of zones per email. The distribution of zones is quite similar between the three languages, with the percentage of the total lines being: quotation (52.2% of the all the corpus' lines), paragraph (19.8%), MUA signature (8.8), personal signature (3.7%), visual separator (2.7%), quotation marker (2.7%),closing ($2.5%$), raw code ($2.0%$), inline headers (1.9%), salutation (1.0%), tabular (0.4%), technical (0.3%), patch (0.2%) and section heading (0.1%). Spanish and French tickets do not contain any section heading lines.

pt es fr
emails 210 200 215
zones 15 14 14
lines 12366 9824 6958
lines per email 58.9 49.1 32.4
zones per email 8.6 6.5 5.9

The annotation was carried out using Tagtog by two annotators. The first annotator was a native Portuguese speaker and the second annotator a native Spanish speaker, both with academical background in French and fluent in the third language. Each email was annotated by both annotators, with the following inter-annotator agreement:

pt es fr
accuracy 0.93 0.92 0.96
F1 A1A2 0.93 0.92 0.96
F1 A2A1 0.94 0.92 0.96
k (Cohen's Kappa) 0.90 0.87 0.94

Annotation format

The corpus consists of 6 files (one for each of the 3 languages and 2 annotators). Each annotation file contains a list of document annotations, that are stored in dictionarie with the following strucure:

{"_id": "<urn:uuid:aa5e973e-65fc-4d32-a0d1-7848eaf405e8>", 
"zones": [["salutation", 0, 11], 
         ["paragraph", 11, 601], 
         ["closing", 601, 610], 
         ["personal_signature", 610, 677], 
         ["mua_signature", 677, 759], 
         ["quotation_marker", 759, 829], 
         ["quotation", 829, 1150]]}

The first key ("_id") has the value the email id from the original Gmane corpus. The second key ("zones") contains the annotated zone spans of that email as a list of lists, where each sublist contais the name of the zone ("greading", "body", "signature", etc), followed by the starting character and ending character of that zone in the email.

How to use Cleverly Zoning Corpus

Load the email textual content

To retrieve the email content:

  1. Request access to the full Gmane corpus from: https://webis.de/data/webis-gmane-19.html

  2. Download the webis-gmane-19-part04 folder

  3. Access the .ndjson file part-0051

  4. Retrive the first 300000 emails. (See instructions about how to extract file content from: https://webis.de/data/webis-gmane-19.html and Gmane README.md)

  5. Match ids with our zone annotations

Load Cleverly corpus and match Ids (Python)

Open Cleverly corpus files

Open one of the annotation files (french)

import io
import json
with io.open('fr_annotator1.json', encoding="utf-8") as json_file:
    cleverly_fr = json.load(json_file)

Match Cleverly ids with the ones from the full Gmane corpus to get email text

After requesting for access to the full Gmane corpus (as outlined before), open the part-0051.ndjson file and extract the first 3000000 lines

gmane = []
for line in open('webis-gmane-19-part04\part-0051.ndjson', 'r', encoding="utf-8"):
    while len(gmane)<=3000000:
        gmane.append(json.loads(line))

Reassemble ids and email content

gmane_id_email = []
for j, line in enumerate(gmane):
    if j%2==0:
        _id = line['index']['_id']
    else:
        dic = {'_id': _id, 'lang': line['lang'], 'text': line['text_plain']}
        gmane_id_email.append(dic)
        dic = {}  

Filter for correct language

gmane_fr = []
for email in gmane_id_email:
   if email['lang'] == 'fr':
       gmane_fr.append(email)

Match ids

fr_email_zones = []
for g_mail in gmane_fr:
    for c_ann in cleverly_fr:
        if g_mail['_id'] == c_ann['_id']:
            fr_email_zones.append((c_ann['_id'], g_mail['text'], c_ann['zones']))

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published