Skip to content

Fetching and Processing Emails

Panagiotis Antoniadis edited this page Jul 21, 2019 · 4 revisions

Tools

The extraction.py tool has been implemented in order to fetch all sent emails of the user, process them in the desired format and save them. Each email is saved in a sentence-per-line format, in order to help us manipulate the sentences of the emails later in clustering.

Usage:

$ python extraction.py -h
usage: extraction.py [-h] --output OUTPUT [--reload] [--info] [--sentence]

Tool for extracting sent emails from a user's account

optional arguments:
  -h, --help       show this help message and exit

required arguments:
  --output OUTPUT  Output directory

optional arguments:
  --reload         If true, remove any existing account.
  --info           If true, create an info file containing the headers.
  --sentence       If true, save each sentence of the emails in separate
                   files.

A token.pickle file is created automatically when the authorization flow completes for the first time. So, in order to fetch all sent emails from a new email account and save them in emails directory, we use --reload True argument, as follows:

$ python extraction.py --out emails --reload

Libraries and API's

Connection

In order to connect to an email account, Gmail API is used, that provides flexible RESTful access to the emails of a Gmail account. As a result, only gmail accounts are supported, but the tool can also be extended for more email providers.

Processing

After email fetching, the body of each email contains a lot of undesired things, that should be removed. The clean body should contain only Greek words since it will be used as input to the language model tool. In order to achieve it, we use:

  • BeautifulSoup library to remove all html characters.
  • num2words library to convert numbers to English words. Then, words are translated into Greek using convert_num.py.
  • alphabet-detector library to detect and keep Greek words.

Also, some emails contain the whole history of the conversation between sender and receiver. Since we need only the new sent email, previous conversations are removed. Finally, we remove all punctuation and non-alphabetic characters and convert all characters to lowercase. An example follows:

Before:

Καλησπέρα σας,

Θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα Machine Learning με κωδικό 12345.

--
Αντωνιάδης Παναγιώτης 

After:

['καλησπέρα σας', 'θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα με κωδικό δώδεκα χιλιάδες τριακόσια σαράντα πέντε ']

We can see that the signature is removed and the course code has been converted into Greek words. It should be noted, that the salutation καλησπέρα σας is considered a separated sentence for semantic reasons.

Finally, each clean email is saved in out directory as email_{id}(one sentence per line). Also, by applying the --info True argument an info file is saved, that contains the headers of the emails in the following format: sender | receiver | subject.

You can’t perform that action at this time.