Scans mail folders and returns a count of repeated words
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
LICENSE
README.md
getMailTopKeywords.py

README.md

getMailTopKeywords

Scans mail folders and returns a count of repeated words. I'm using it with a backup of Gmail I got using "Backup Gmail" https://code.launchpad.net/~cfraire/backup-gmail/devel

Requires pyzmail library: http://www.magiksys.net/pyzmail/

usage: getMailTopKeywords.py [-h] [-c COUNT] [-a] [-l LANG] [-n] folder

positional arguments:
  folder                the folder to process

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        show words with count greater or equal than this
                        number. Default: 10
  -a, --aspell          whether to check the words using Aspell. Default: no
                        Requires python-aspell library. See
                        https://github.com/WojciechMula/aspell-python.
  -l LANG, --lang LANG  check words in that language (only if -a is set).
                        Needs aspell dictionary in that language installed.
                        Default: en
  -n, --nltk            use Natural Language Toolkit. Default: no
                        Requires nltk library.
                        See http://nltk.org/index.html. 

Example:

./getMailTopKeywords.py -a -c 10 -l es /home/esalgado/GmailBackup/2005-12
Reading from /home/USER/GmailBackup/2005-12 ... 
Analyzing 42 mails...
[==================================================] 100%
Building a list of the top keywords...
Top keywords:
  mas 21
  correo 20
  foto 18
  infantil 15
  seguridad 13
  enero 13
  parque 11
  servicios 11
  nuevos 11

Example using NLTK (extracts the roots of the words):

./getMailTopKeywords.py -a -c 10 -l es -n /home/esalgado/GmailBackup/2005-12
Reading from /home/USER/GmailBackup/2005-12 ... 
Analyzing 42 mails...
[==================================================] 100%
Building a list of the top keywords...
Processing using Natural Language Toolkit...
Top keywords:
  mas 21
  envi 20
  corre 20
  fot 20
  nuev 19
  infantil 15
  servici 14
  activ 14
  ener 13
  segur 13
  mes 12
  monitor 12
  parqu 11