# Extracting information from email data using Regular Expressions

* Με υλικό από το [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)

* Regular expressions (REs, regexes) are a mechanism by which we can describe strings.

* This way we can search for strings that have a specific format, rather than being limited to searching for specific strings.

* For example, let's say we don't want to find a specific phone in some data, but all phone numbers in the data. How are we going to do this?

* A regular expression consists of characters and symbols.

* Most characters *match* (describe, match) themselves.

 * The regular expression `test` matches the string `test`.

 * The regular expression `banana` matches the string `banana`.

* But there are exceptions (and that's where things get interesting).

* Some characters are *metacharacters*, and do not match themselves.

* They mark some event, or some match that is not just an identification of a character.

* Metacharacters are as follows:

 `. ^ $ * + ? { } [ ] \ | ( )`

* The metacharacters `[` and `]` are used to define a *character class*, a set of characters we want to match.

* For example, the regular expression `[abc]` matches `a`, `b`, or `c`.

* The regular expression `[fgm]ood` matches the words `food`, `good`, `mood`.

* We can define a character field using `-`.

* Instead of writing `[abc]` we can write `[a-c]`.

* So if we want to match any lowercase Latin character we can use the regular expression `[a-z]`.

* Metacharacters behave like regular characters when in a class description.

* `[akm$]` will match any of `a`, `k`, `m`, or `$`.

* `$` is usually a metacharacter, but not inside a class description.

* If we want to get the characters that are not included in a class, we can *complement* it.

* We do this by putting `^` as the first character of the class.

* The expression `[^5]` will match any character *except* `5`.

* The expression `[^m]iss` will match `kiss`, `hiss`, `diss`, as well as any other word ending in `iss` but not beginning with `m`.

* If `^` is not at the beginning of a class it loses its special meaning.

* `[5^]` will match `5` or `^`.

* Perhaps the most important metacharacter is the `\` (backslash).

* With this we describe specific classes of characters.

* `\d` describes any decimal digit: `[0-9]`.

* `\D` describes characters that are *not* decimals: `[^0-9]`.

* `\s` describes whitespace characters: `[ \t\n\r\f\v]`:

 * `\f` is the character for new page (formfeed)
 * `\n` is the new line character (linefeed)
 * `\r` is the carriage return character
 * `\t` is the tab
 * `\v` is the vertical tab

* `\S` describes what is not a blank character: `[^ \t\n\r\f\v]`.

* `\w` matches any alphanumeric character: `[a-zA-Z0-9_]` (approx).

* `\W` describes that it is not an alphanumeric character: `[^a-zA-Z0-9_]` (approx.).

* A special class is *any character*, `.`.

* This matches every character except the newline.

* Describing character classes isn't the only thing we can do with regular expressions.

* Another possibility they give us is to describe that their pieces can be repeated.

* The `*' metacharacter means that its previous character can match from zero to any number of times, as many as possible, i.e. it is *greedy*.

* The expression `ca*t` can match `ct` (0 `a`), `cat` (1 `a`), `caaat` (3 `a`), etc.

* Another relevant character is `+`, which can be greedily matched from *one* to any number of times.

* The expression `ca+t` can match `cat` (1 `a`), `caaat` (3 `a`s), but not `ct`.

* Question mark `?`, matches 0 or 1 time.

* That is, it states that something is optional.

* The expression `home-?brew` matches `homebrew` or `home-brew`.

* We can give a specific number of iterations by writing `{m,n}`, where `m` and `n` are digits (`n` we will see can be missing).

* This means we want from `m` to `n` iterations.

* The expression `a/{1,3}b` matches `a/b`, `a//b`, and `a///b`.

* `{0,}` is the same as `*`.

* `{1,}` is the same as `+`.

* `{0,1}` is the same as `?`.

* The metacharacter `|` is the logical "OR".

* The expression `Cat|Dog` matches `Cat` or `Dog`.

* The `^` metacharacter matches the beginning of the string.

* For example, if we want to match the word `From` only at the beginning of the string, we will give `^From`.

* So `^From` matches ``From Here to Eternity''.

* But it *doesn't* match ``Reciting From Memory''.

* Symmetrical to `^` is the metacharacter `$` which matches the end of the string.

* So `fear$` matches `do not fear`.

* But it doesn't match `fear not`.

* If we want to remove the meaning of a metacharacter, we use `\`.

* So, to match `$` we give `\$`.

* So how do we match phone numbers?

* If our number consists of seven digits we will write:

 `\d{7}`

* If we want to catch the numbers in Athens, we will write:

 `210 \d{7}`

* But whitespace can be optional, so better:

 `210\s?\d{7}`

* Let's look at a realistic example.

* The example concerns a set of more than 500,000 emails that were related to the [Enron Scandal](https://en.wikipedia.org/wiki/Enron_scandal).

* These have been compiled into one file, which contains a CSV file.

* The file can be downloaded from [Kaggle](https://www.kaggle.com/wcukierski/enron-email-dataset).

In [8]:
import pandas as pd

enron = pd.read_csv('emails.csv.zip')
enron

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...
...,...,...
517396,zufferli-j/sent_items/95.,Message-ID: <26807948.1075842029936.JavaMail.e...
517397,zufferli-j/sent_items/96.,Message-ID: <25835861.1075842029959.JavaMail.e...
517398,zufferli-j/sent_items/97.,Message-ID: <28979867.1075842029988.JavaMail.e...
517399,zufferli-j/sent_items/98.,Message-ID: <22052556.1075842030013.JavaMail.e...


In [9]:
print(enron.loc[1].message)

Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the

* Suppose we want to export all message senders.

* This can be easily done with a regular expression (which we can find ready-made).

In [10]:
matches = enron.message.str.extract(r'From: ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
matches

Unnamed: 0,0
0,phillip.allen@enron.com
1,phillip.allen@enron.com
2,phillip.allen@enron.com
3,phillip.allen@enron.com
4,phillip.allen@enron.com
...,...
517396,john.zufferli@enron.com
517397,john.zufferli@enron.com
517398,john.zufferli@enron.com
517399,john.zufferli@enron.com


* The result was a `DataFrame` where the messages were in column `0`.

* We can give a name to the part we export with the regular expression to make `DataFrame` more friendly.

In [11]:
matches = enron.message.str.extract(r'From: (?P<from>[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
matches

Unnamed: 0,from
0,phillip.allen@enron.com
1,phillip.allen@enron.com
2,phillip.allen@enron.com
3,phillip.allen@enron.com
4,phillip.allen@enron.com
...,...
517396,john.zufferli@enron.com
517397,john.zufferli@enron.com
517398,john.zufferli@enron.com
517399,john.zufferli@enron.com


* Then we can count the messages sent and find the most frequent senders.

In [14]:
matches['from'].value_counts()[:10]

kay.mann@enron.com               16735
vince.kaminski@enron.com         14368
jeff.dasovich@enron.com          11411
pete.davis@enron.com              9149
chris.germany@enron.com           8801
sara.shackleton@enron.com         8777
enron.announcements@enron.com     8587
tana.jones@enron.com              8490
steven.kean@enron.com             6759
kate.symes@enron.com              5438
Name: from, dtype: int64

* `kay.mann@enron.com', Kay Mann, was legal counsel at Enron.

* `vince.kaminsky@enron.com' is Vincent Kaminski, director of research at Enron. He opposed the practices that led to the bankruptcy (but was not listened to).