### Discover Python Regular Expressions
<p style='text-align:justify;'>- Regular expressions are used to identify whether a pattern exists in a given sequence of characters (string) or not. They help in manipulating textual data, which is often a pre-requisite for data science projects that involve text mining. You must to have come across some application of regular expressions: they are used at the server side to validate the format of email addresses or password during registration, used for parsing text data files to find, replace or delete certain string, etc.

__Regular Expression in Python__
- In Python, regular expressions are supported by the `re` module. That means that if you want to start using them in your Python scripts, you have to import this module with the help of `import`:

In [1]:
# Import `re`
import re

__Basic Patterns: Ordinary Characters__
- You can easily tackle many basic patterns in Python using the ordinary characters. Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a speacial meaning in their regular expression syntax.

Example are 'A', 'a', 'X', '5'.

- Ordinary characters can be used to perform simple exact matches:

In [2]:
pattern = r'Cookie'
sequence = 'Cookie'
if re.match(pattern, sequence):
    print('Match!')
else:
    print('Not a match!')

Match!


The `match()` function returns a match object if the text matches the pattern. Otherwise it returns None. The re module also contains several other functions

In [3]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_alphanum_bytes',
 '_alphanum_str',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pattern_type',
 '_pickle',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']

__Wild Card Characters: Special Characters__
- Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.
- The most widely used special characters are:

`.`- A period. Matches any single charater except newline character.

In [4]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

`\w`- Lowercase w. Matches any single letter, digit or underscore.

In [5]:
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

-------------------------------------------------------------------------
__Introducing our data set__
- __Context:__

    - Fraudulent e-mails contain criminally deceptive information, usually with the intent of convincing the recipient to give the sender a large amount of money. Perhaps the best known type of fraudulent e-mails is the __Nigerian Letter or “419”__ Fraud.

- __Content:__

    - This dataset is a collection of more than 2,500 "Nigerian" Fraud Letters, dating from 1998 to 2007.<br>
    - These emails are in a single text file. Each e-mail has a header which includes the following information:

        - Return-Path: address the email was sent from
        - X-Sieve: the X-Sieve host (always cmu-sieve 2.0)
        - Message-Id: a unique identifier for each message
        - From: the message sender (sometimes blank)
        - Reply-To: the email address to which replies will be sent
        - To: the email address to which the e-mail was originally set (some are truncated for anonymity)
        - Date: Date e-mail was sent
        - Subject: Subject line of e-mail
        - X-Mailer: The platform the e-mail was sent from
        - MIME-Version: The Multipurpose Internet Mail Extension version
        - Content-Type: type of content & character encoding
        - Content-Transfer-Encoding: encoding in bits
        - X-MIME-Autoconverted: the type of autoconversion done
        - Status: r (read) and o (opened)

__Introducing Python's regex module__
- First, prepare the data set by opening the text file, setting it to read-only, and reading it. We also assign it to a variable, `fh` ('file handle').

In [6]:
fh = open(r'test_emails.txt', 'r').read()

Notice that we precede the directory path with an `r`. This technique converts a string into a raw string, which helps to avoid conflicts caused by how some machines read characters, such as backslashes in directory paths on Windows.

Now, suppose we want to find out who the emails are from. We could try raw Python on its own:
```python
for line in fh.split('\n'):
    if 'From:' in line:
        print(line)
```
Or, we could use regex:

In [7]:
for line in fh.split('\n'):
    if 'From:' in line:
        print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>


In [8]:
import re

for line in re.findall('From: .*', fh):
    print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>


`re.findall()`returns a list of all instances of the pattern in the string. It's one of the most popular functions in Python's built-in `re` module. Let's break it down. The function takes two arguments in the form of `re.findall(pattern, string)`. Here, `pattern` represents the substring we want to find, and `string` represents the main string we want to find it in. The main string can consist of multiple lines.

`.*` are shorthand for string patterns. We'll explain them in detail very, very soon. Suffice to say for now that they match the name and email address in the `From:` field.

Let's take our first look at some common regex patterns before we dive deeper.

__Common regex patterns__<br>

The pattern we used with `re.findall()` above contains a fully spelt out string, `'From:'`. This is useful when we know precisely what we're looking for, right down to the actual letters and whether or not they're upper or lower case. If we don't know the exact format of the strings we want, we'd be lost. Fortunately, regex has basic patterns that account for this scenario. Let's look at the ones we use in this tutorial:

- `\w` matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.

- `\d` matches digits, which means 0-9.

- `\s` matches whitespace characters, which include the tab, new line, carriage return, and space characters.

- `\S`  matches non-whitespace characters.

- `.` matches any character except the new line character `\n`.

With these regex patterns in hand, you'll quickly understand our code above as we go on to explain it.

__Working with regex patterns__<br>

We can now explain the use of `.*` in the line `re.findall('From: .*', text)` above. Let's look at `.` first:

In [9]:
for line in re.findall('From:.',fh):
    print(line)

From: 
From: 
From: 
From: 
From: 


By adding a `.` next to `From:`, we look for one additional character next to it. Because `.` looks for any character except `\n`, it captures the space character, which we cannot see. We can try more dots to verify this.

In [10]:
for line in re.findall('From:............', fh):
    print(line)

From: "MR. JAMES 
From: "Mr. Ben Su
From: "PRINCE OBO
From: "PRINCE OBO
From: "Maryam Aba


- It looks like adding dots does acquire the rest of the line for us. But, it's tedious and we don't know how many dots to add. This is where the asterisk symbol, `*`, plays a very useful role.
- Let's construct a greedy search for `.` with `*`.

In [11]:
for line in re.findall('From: .*', fh):
    print(line)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>
From: "Maryam Abacha" <m_abacha03@www.com>


- Because `*` matches zero or more instances of the pattern indicated on its left, and `.` is on its left here, we are able to acquire all the characters in the `From:` field till the end of the line. This prints out the full line with beautifully succinct code.

- We might even go further and isolate only the name:

In [12]:
match = re.findall('From: .*', fh)

for line in match:
    print(re.findall('\".*\"', line))

['"MR. JAMES NGOLA."']
['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']
['"PRINCE OBONG ELEME"']
['"Maryam Abacha"']


- We iterate through the list. In each cycle, we perform `re.findall` again. This time, the function starts by matching the first quotation mark.
- Notice that we use a backslash next to the first quotation mark. The backslash is a special character used for escaping other special characters. For instance, when we want to use a quotation mark as a string literal instead of a special character, we escape it with a backslash like this: `\"`. If we do not escape the pattern above with backslashes, it would become `"".*""`, which the Python interpreter would read as a period and an asterisk between two empty strings. It would produce an error and break the script. Hence, it's crucial that we escape the quotation marks here with backslashes.
- After the first quotation mark is matched, `.*` acquires all the characters in the line until the next quotation mark, also escaped in the pattern. This gets us just the name, within quotation marks. Each name is also printed within square brackets because `re.findall` returns matches in a list.
- What if we want the email address instead?

In [13]:
for line in match:
    print(re.findall('\w\S*@.*\w', line))

['james_ngola2002@maktoob.com']
['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']
['obong_715@epatra.com']
['m_abacha03@www.com']


Here's how we match just the front part of the email address:

In [14]:
for line in match:
    print(re.findall('\w\S*@', line))

['james_ngola2002@']
['bensul2004nng@']
['obong_715@']
['obong_715@']
['m_abacha03@']


- Emails always contain an `@` symbol, so we start with it. The part of the email before the `@` symbol might contain alphanumeric characters, which means `\w` is required. However, because some emails contain a period or a dash, that's not enough. We add `\S` to look for non-whitespace characters. But, `\w\S` will get only two characters. Add `*` to look for repetitions. The front part of the pattern thus looks like this: `\w\S*@`.
- Now for the pattern behind the `@` symbol:

In [15]:
for line in match:
     print(re.findall('@.*', line))

['@maktoob.com>']
['@spinfinder.com>']
['@epatra.com>']
['@epatra.com>']
['@www.com>']


- The domain name usually contains alphanumeric characters, periods, and a dash sometimes. This is simple, a `.` would do. To make it greedy, we extend the search with a `*`. This allows us to match any character till the end of the line.
- If we look at the line closely, we see that each email is encapsulated within angle brackets, < and >. Our pattern, `.*`, includes the closing bracket, >.
- Let's remedy it:

In [16]:
for line in match:
    print(re.findall('@.*\w', line))

['@maktoob.com']
['@spinfinder.com']
['@epatra.com']
['@epatra.com']
['@www.com']


- Email addresses end with an alphanumeric character, so we cap the pattern with `\w`. Hence, the rear of the @ symbol is `.*\w`, which means that the pattern we want is a group of any type of characters that ends with an alphanumeric character. This excludes `>`.
- Our full email address pattern thus looks like this: `\w\S*@.*\w`.

__Common regex functions__<br>
`re.findall()` is undeniably useful, and the `re` module provides more equally convenient functions. These include:
- `re.search()`
- `re.split()`
- `re.sub()`

We'll take a gander at these one by one before using them to bring some order to the unwieldy mass of the Corpus.

__re.search()__<br>
While `re.findall()` matches all instances of a pattern in a string and returns them in list, `re.search()` matches the first instance of a pattern in a string, and returns it as a `re` match object.

In [17]:
match = re.search('From:.*', fh)

In [18]:
type(match)

_sre.SRE_Match

In [19]:
type(match.group())

str

In [20]:
print(match)

<_sre.SRE_Match object; span=(190, 244), match='From: "MR. JAMES NGOLA." <james_ngola2002@maktoob>


In [21]:
print(match.group())

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>


- Like `re.findall()`, `re.search()` also takes two arguments. The first is the pattern to match, and the second is the string to find it in. Here, we've assigned the results to the `match` variable for neatness.
- Because `re.search()` returns a `re` match object, we can't display the name and email address by printing it directly. Instead, we have to apply the `group()` function to it first. We've printed both their types out in the code above. As we can see, `group()` converts the match object into a string.
- We can also see that printing `match` displays properties beyond the string itself, whereas printing `match.group()` displays only the string.

__re.split()__<br>
Suppose we need a quick way to get the domain name of the email addresses. We could do it with three regex operations, like so:

In [22]:
address = re.findall('From: .*', fh)

for item in address:
    for line in re.findall('\w\S*@.*\w', item):
        username, domain_name = re.split('@', line)
        print('{}, {}'.format(username, domain_name))

james_ngola2002, maktoob.com
bensul2004nng, spinfinder.com
obong_715, epatra.com
obong_715, epatra.com
m_abacha03, www.com


The first line is familiar. We return a list of strings, each containing the contents of the `From:` field, and assign it to a variable. Next, we iterate through the list to find the email addresses. At the same time, we iterate through the email addresses and use the `re` module's `split()` function to snip each address in half, with the @ symbol as the delimiter. Finally, we print it.

__re.sub()__<br>
Another handy `re` function is `re.sub()`. As the function name suggests, it substitutes parts of a string. An example:

In [23]:
sender = re.search('From:.*', fh)
address = sender.group()
email = re.sub('From', 'Email', address)

In [24]:
print(address)

From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>


In [25]:
print(email)

Email: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>


- Here, we've already seen the tasks on the first and second lines performed before. On the third line, we apply `re.sub()` on address, which is the full `From:` field in the email header.
- `re.sub()` takes three arguments. The first is the substring to substitute, the second is a string we want in its place, and the third is the main string itself.

__Regex with pandas__
- Now that we have the basics of regex in hand, we can try something much more sophisticated. However, we need to combine regex with the pandas Python data analysis library.

__Sorting emails with regex and pandas__
- The Corpus is a single text file containing thousands of emails. We'll use regex and pandas to sort the parts of each email into appropriate categories so that the Corpus can be more easily read or analysed.
- We'll sort each email into the following categories:
    - `sender_name`
    - `sender_address`
    - `recipient_address`
    - `recipient_name`
    - `date_sent`
    - `subject`
    - `email_body`
- Each of these categories will become a column in our pandas dataframe or table. This is useful because it lets us work on each column on its own. For instance, we could write code to find out which domain names the emails come from, instead of coding to isolate the email addresses from the other parts first. Essentially, categorising the important parts of our data set allows us to write much more concise code to acquire granular information later on. In turn, concise code reduces the number of operations our machines have to do, which speeds up our analytical process, especially when working with massive data sets.

__Preparing the script__

In [26]:
import re
import pandas as pd
import email

email = []

fh = open(r'test_emails.txt', 'r').read()

- We first import the `re` and `pandas` modules as standard practice dictates, right at the top of the script. We import Python's `email` package as well, which is especially needed for the body of the email. The body of the email is rather complicated to work with using regex alone. It might even require enough cleaning up to warrant its own tutorial. So, we use the well-developed `email` package to save some time and let us focus on learning regex.
- Next, we create an empty list, `emails`, which will store dictionaries. Each dictionary will contain the details of each email.
- We print the results of our code to the screen frequently to illustrate where code goes right or wrong. However, because there are thousands of emails in the data set, this prints thousands of lines to the screen and clogs up this tutorial page. We certainly don't want to make you scroll down thousands of lines of results over and over again. Thus, as we've done at the beginning of the tutorial, we open and read a shorter version of the Corpus. We prepared it by hand just for the purposes of this tutorial. You can use the actual data set at home though. Every time we run a `print()` function, you'll print thousands of lines to the screen in barely a few seconds.

Now, we begin applying regex.

In [27]:
contents = re.split(r'From r', fh)
contents.pop(0)

''

We use the `re` module's split function to split the entire chunk of text in `fh` into a list of separate emails, which we assign to the variable `contents`. This is important because we want to work on the emails one by one, by iterating through the list with a for loop. But, how do we know to split by the string `"From r"`? We know this because we looked into the file before we wrote the script. We didn't have to peruse the thousands of emails in there. Just the first few, to see what the structure of the data looks like. As it so happens, each email is preceded by the string `"From r"`.

One reason we use the Fraudulent Email Corpus in this tutorial is to show that when data is disorganised, unfamiliar, and comes without documentation, we can't rely solely on code to sort it out. It would require a pair of human eyes. As we've just shown, we had to look into the Corpus itself to study its structure. In addition, such data may require a lot of cleaning up, as does this Corpus. For instance, even though we count 3977 emails in this set using the full script we're about to construct for this tutorial, there are actually more. Some emails are not preceded by `"From r"`, and so are not split into their own. We leave our data set as it is for now, though, lest this tutorial never ends.

Notice also that we use `contents.pop(0)` to get rid of the first element in the list. That's because a `"From r"` string precedes the first email. When that string is split, it produces an empty string at index 0. The script we're about to write is designed for emails. If it works on an empty string, it might throw up errors. Getting rid of the empty string lets us avoid these errors from breaking our script.

__Getting every name and address with a for loop__
```python
for item in contents:
    emails_dict = {}
```

In the code above, we use a `for` loop to iterate through `contents` so we can work with each email in turn. We create a dictionary, `email_dict`, that will hold all the details of each email, such as the sender's address and name. In fact, these are the first items we find.
- This is a three-step process. It begins by finding the `From:` field.

```python
for item in contents: # First two lines again so that Jupyter runs the code.
    emails_dict = {}

# Find sender's email address and name.

    # Step 1: find the whole line beginning with "From:".
    sender = re.search(r"From:.*", item)
```
With __Step 1__, we find the entire `From:` field using the `re.search()` function. The `.` means any character except `\n`, and `*` extends it to the end of the line. We then assign this to the variable `sender`.

But, data isn't always straightforward. It can contain surprises. For instance, what if there's no `From:` field? The script would throw an error and break. We pre-empt errors from this scenario in __Step 2__.

```python
# Step 2: find the email address and name.
if sender is not None:
    s_email = re.search(r"\w\S*@.*\w", sender.group())
    s_name = re.search(r":.*<", sender.group())
else:
    s_email = None
    s_name = None
```
- To avoid errors resulting from missing `From:` fields, we use an `if` statement to check that `sender` isn't `None`. If it is, we assign `s_email` and `s_name` the value of `None` so that the script can move on instead of breaking unexpectedly.
- In Step 2, we use a familiar regex pattern from before, `\w\S*@.*\w`, which matches the email address.
- We use a different tactic for the name. Each name is bounded by the colon, `:`, of the substring `"From:"` on the left, and by the opening angle bracket, `<`, of the email address on the right. Hence, we use `:.*<` to find the name. We get rid of `:` and `<` from each result in a moment.

```python
print("sender type: " + str(type(sender)))
print("sender.group() type: " + str(type(sender.group())))
print("sender: " + str(sender))
print("sender.group(): " + str(sender.group()))
print("\n")
```
Note that we're not using `sender` as the string to search for in each application of `re.search()`. We've printed out the types for `sender` and `sender.group()` so that we can see the difference. It looks like `sender` is an `re` match object, which we can't search with `re.search()`. However, `sender.group()` is a string, precisely what `re.search()` was built for.

Let's see what `s_email` and `s_name` look like.
```python
print(s_email)
print(s_name)
```
Again, we have match objects. Every time we apply `re.search()` to strings, it produces match objects. We have to turn them into string objects.

Before we do this, recall that if there is no `From:` field, `sender` would have the value of `None`, and so too would `s_email` and `s_name`. Hence, we have to check for this scenario again so that the script doesn't break unexpectedly. Let's see how to construct the code with `s_email` first.
```python
# Step 3A: assign email address as string to a variable.
if s_email is not None:
    sender_email = s_email.group()
else:
    sender_email = None

# Add email address to dictionary.
emails_dict["sender_email"] = sender_email
```
In __Step 3A__, we use an `if` statement to check that `s_email` is not `None`, otherwise it would throw an error and break the script.

Then, we simply convert the `s_email` match object into a string and assign it to the `sender_email` variable. We add this to the `emails_dict` dictionary, which will make it incredibly easy for us to turn the details into a pandas dataframe later on.

We do almost exactly the same for `s_name` in Step 3B.

```python
# Step 3B: remove unwanted substrings, assign to variable.
if s_name is not None:
    sender_name = re.sub("\s*<", "", re.sub(":\s*", "", s_name.group()))
else:
    sender_name = None

# Add sender's name to dictionary.
emails_dict["sender_name"] = sender_name
```
Just as we did before, we first check that `s_name` isn't `None` in Step 3B.

Then, we use the `re` module's `re.sub()` function twice before assigning the string to a variable. First, we remove the colon and any whitespace characters between it and the name. We do this by substituting `:\s*` with an empty string `""`. Then, we remove whitespace characters and the angle bracket on the other side of the name, again substituting it with an empty string. Finally, after assigning the string to `sender_name`, we add it to the dictionary.

Let's check out our results.
```python
print(sender_email)
print(sender_name)
```

Perfect. We've isolated the email address and the sender's name. We've also added them to the dictionary, which will come into play soon.

Now that we've found the sender's email address and name, we do exactly the same set of steps to acquire the recipient's email address and name for the dictionary.

First, we find the the `To:` field.

```python
recipient = re.search(r'To:.*', item)
```
Next, we pre-empt the scenario where `recipient` in `None`.
```python
if recipient is not None:
    r_email = re.search(r'\w\S*@.*\w', recipient.group())
    r_name = re.search(r':.*<', recipient.group())
else:
    r_email = None
    r_name = None
```
If `recipient` isn't `None`, we use `re.search()` to find the match object containing the email address and the recipient's name. Otherwise, we pass `r_email` and `r_name` the value of None.

Then, we turn the match objects into strings and add them to the dictionary.
```python
if r_email is not None:
    recipient_email = r_email.group()
else:
    recipient_email = None

emails_dict["recipient_email"] = recipient_email

if r_name is not None:
    recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
else:
    recipient_name = None

emails_dict["recipient_name"] = recipient_name
```
Because the structure of the `From:` and `To:` fields are the same, we can use the same code for both. We need to tailor slightly different code for the other fields.

__Getting the date of the email__
```python
for item in contents:
    emails_dict = {}
    
    date_field = re.search(r'Date:.*', item)
```
We acquire the `Date:` field with the same code for the `From:` and `To:` fields.

And, just as we do for those two fields, we check that the `Date:` field, assigned to the `date_field` variable, is not `None`.

```python
if date_field is not None:
    date = re.search(r"\d+\s\w+\s\d+", date_field.group())
else:
    date = None

print(date_field.group())
```
We've printed out `date_field.group()` so that we can see the structure of the string more clearly. It includes the day, the date in DD MMM YYYY format, and the time. We want just the date. The code for the date is largely the same as for names and email addresses but simpler. Perhaps the only puzzler here is the regex pattern, `\d+\s\w+\s\d+`.

The date starts with a number. Hence, we use `\d` to account for it. However, as the DD part of the date, it could be either one or two digits. Here is where `+` becomes important. In regex, `+` matches 1 or more instances of a pattern on its left. `\d+` would thus match the DD part of the date no matter if it is one or two digits.

After that, there's a space. This is accounted for by `\s`, which looks for whitespace characters. The month is made up of three alphabetical letters, hence `\w+`. Then it hits another space, `\s`. The year is made up of numbers, so we use `\d+` once more.

The full pattern, `\d+\s\w+\s\d+`, works because it is a precise pattern bounded on both sides by whitespace characters.

Next, we do the same check for a value of `None` as before.

```python
if date is not None:
    date_sent = date.group()
    date_star = date_star_test.group()
else:
    date_sent = None

emails_dict["date_sent"] = date_sent
```
If `date` is not `None`, we turn it from a match object into a string and assign it to the variable `date_sent`. We then insert it into the dictionary.

Before we go on, we should note a crucial point. `+` and `*`seem similar but they can produce very different results. Let's use the date string here as an example.
```python
date = re.search(r"\d+\s\w+\s\d+", date_field.group())

# What happens when we use * instead?
date_star_test = re.search(r"\d*\s\w*\s\d*", date_field.group())

date_sent = date.group()
date_star = date_star_test.group()

print(date_sent)
print(date_star)
```
If we use `*`, we'd be matching zero or more occurrences. `+` matches one or more occurrences. We've printed the results for both scenarios. It's a big difference. As you can see, `+` acquires the full date whereas `*` gets a space and the digit `1`.

Next up, the subject line of the email.

__Getting the email subject__

As before, we use the same code and code structure to acquire the information we need.
```python
for item in contents: # First two lines again so that Jupyter runs the code.
    emails_dict = {}

    subject_field = re.search(r"Subject: .*", item)

    if subject_field is not None:
        subject = re.sub(r"Subject: ", "", subject_field.group())
    else:
        subject = None

    emails_dict["subject"] = subject
```
We're becoming more familiar with the use of regex now, aren't we? It's largely the same code as before, except that we substitute `"Subject: "` with an empty string to get only the subject itself.

__Getting the body of the email__

The last item to insert into our dictionary is the body of the email.
```python
full_email = email.message_from_string(item)
body = full_email.get_payload()
emails_dict["email_body"] = body
```
Separating the header from the body of an email is an awfully complicated task, especially when many of the headers are different in one way or another. Consistency is seldom found in raw unorganised data. Luckily for us, the work's already been done. Python's `email` package is highly adept at this task.

Remember that we've already imported the package earlier. Now, we apply its `message_from_string()` function to `item`, to turn the full email into an `email` Message object. A Message object consists of a header and a payload, which correspond to the header and body of an email.

Next, we apply its `get_payload()` function on the Message object. This function isolates the body of the email. We assign it to the variable `body`, which we then insert into our `emails_dict` dictionary under the key `"email_body"`.

__Why the email package and not regex for the body__

You may ask, why use the `email` package rather than regex? This is because there's no good way to do it with regex at the moment that doesn't require significant amounts of cleaning up. It would mean another sheet of code that probably deserves its own tutorial.

It's worth checking out how we arrive at decisions like this one. However, we need to understand what square brackets, `[ ]`, mean in regex before we can do that.

`[ ]` matches any character placed inside them. For instance, if we want to find `"a"`, `"b"`, or `"c"` in a string, we can use `[abc]` as the pattern. The patterns we discussed above apply as well. `[\w\s]` would find either alphanumeric or whitespace characters. The exception is `.`, which becomes a literal period within square brackets.

Now, we can better understand how we made the decision to use the email package instead.

A peek at the data set reveals that email headers stop at the strings `"Status: 0"` or `"Status: R0"`, and end before the string `"From r"` of the next email. We could thus use `Status:\s*\w*\n*[\s\S]*From\sr*` to acquire only the email body. `[\s\S]*` works for large chunks of text, numbers, and punctuation because it searches for either whitespace or non-whitespace characters.

Unfortunately, some emails have more than one `"Status:"` string and others don't contain `"From r"`, which means that we would split the emails into more or less than the number of dictionaries in the emails list. They would not match with the other categories we already have. It becomes problematic when working with pandas. Hence, we elected to leverage the `email` package.

__Greate the list of dictionaries__

Finally, append the dictionary, `emails_dict`, to the `emails` list:
```python
emails.append(emails_dict)
```
You might want to print the `emails` list at this point to see how it looks. You can also run `print(len(emails_dict))` to see how many dictionaries, and therefore emails, are in the list. As we mentioned before, the full Corpus contains 3977. Our little test file contains seven. Here's the code in full:

In [28]:
import re
import pandas as pd
import email

emails = []

fh = open(r"test_emails.txt", "r").read()

contents = re.split(r"From r",fh)
contents.pop(0)

for item in contents:
    emails_dict = {}

    sender = re.search(r"From:.*", item)

    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None

    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None

    emails_dict["sender_email"] = sender_email

    if s_name is not None:
        sender_name = re.sub("\s*<", "", re.sub(":\s*", "", s_name.group()))
    else:
        sender_name = None

    emails_dict["sender_name"] = sender_name

    recipient = re.search(r"To:.*", item)

    if recipient is not None:
        r_email = re.search(r"\w\S*@.*\w", recipient.group())
        r_name = re.search(r":.*<", recipient.group())
    else:
        r_email = None
        r_name = None

    if r_email is not None:
        recipient_email = r_email.group()
    else:
        recipient_email = None

    emails_dict["recipient_email"] = recipient_email

    if r_name is not None:
        recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
    else:
        recipient_name = None

    emails_dict["recipient_name"] = recipient_name

    date_field = re.search(r"Date:.*", item)

    if date_field is not None:
        date = re.search(r"\d+\s\w+\s\d+", date_field.group())
    else:
        date = None

    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None

    emails_dict["date_sent"] = date_sent

    subject_field = re.search(r"Subject: .*", item)

    if subject_field is not None:
        subject = re.sub(r"Subject: ", "", subject_field.group())
    else:
        subject = None

    emails_dict["subject"] = subject

    # "item" substituted with "email content here" so full email not displayed.

    full_email = email.message_from_string(item)
    body = full_email.get_payload()
    emails_dict["email_body"] = "email body here"

    emails.append(emails_dict)

# Print number of dictionaries, and hence, emails, in the list.
print("Number of emails: " + str(len(emails_dict)))

print("\n")

# Print first item in the emails list to see how it looks.
for key, value in emails[0].items():
    print(str(key) + ": " + str(emails[0][key]))

Number of emails: 7


sender_email: james_ngola2002@maktoob.com
sender_name: "MR. JAMES NGOLA."
recipient_email: james_ngola2002@maktoob.com
recipient_name: None
date_sent: 31 Oct 2002
subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
email_body: email body here


We've printed out the first item in the `emails` list, and it's clearly a dictionary with key and value pairs. Because we used a `for` loop, every dictionary has the same keys but different values.

We've substituted `item` with `"email content here"` so that we don't print out the entire mass of the email and clog up our screens. If you're printing this at home using the actual data set, you'll see the entire email.

__Manipulating data with pandas__

With dictionaries in a list, we've made it infinitely easy for the pandas library to do its job. Each key will become a column title, and each value becomes a row in that column.

All we have to do is apply the following code:

In [29]:
import pandas as pd
emails_df = pd.DataFrame(emails)

Let's look at the first few rows.

In [30]:
emails_df.head()

Unnamed: 0,date_sent,email_body,recipient_email,recipient_name,sender_email,sender_name,subject
0,31 Oct 2002,email body here,james_ngola2002@maktoob.com,,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,31 Oct 2002,email body here,R@M,,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",URGENT ASSISTANCE /RELATIONSHIP (P)
2,31 Oct 2002,email body here,obong_715@epatra.com,,obong_715@epatra.com,"""PRINCE OBONG ELEME""",GOOD DAY TO YOU
3,31 Oct 2002,email body here,webmaster@aclweb.org,,obong_715@epatra.com,"""PRINCE OBONG ELEME""",GOOD DAY TO YOU
4,1 Nov 2002,email body here,m_abacha03@www.com,,m_abacha03@www.com,"""Maryam Abacha""",I Need Your Assistance.


The pipe symbol, `|`, looks for characters on either side of itself. For instance, `a|b` looks for either `a` or `b`.

`|` might seem to do the same as `[ ]`, but they really are different. Suppose we want to match either `"crab"`, `"lobster"`, or `"isopod"`. Using `crab|lobster|isopod` would make more sense than `[crablobsterisopod]`, wouldn't it? The former would look for each whole word, whereas the latter would look for every single letter.

Now, let's use `|` to find all the emails sent from one or another domain name.

In [31]:
emails_df[emails_df['sender_email'].str.contains('maktoob|spinfinder')]

Unnamed: 0,date_sent,email_body,recipient_email,recipient_name,sender_email,sender_name,subject
0,31 Oct 2002,email body here,james_ngola2002@maktoob.com,,james_ngola2002@maktoob.com,"""MR. JAMES NGOLA.""",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
1,31 Oct 2002,email body here,R@M,,bensul2004nng@spinfinder.com,"""Mr. Ben Suleman""",URGENT ASSISTANCE /RELATIONSHIP (P)


We can view emails from individual cells too. To do this, we go through four steps. In Step 1, we find the index of the row where the `"sender_email"` column contains the string `"@maktoob"`. Notice how we use regex to do this.

In [32]:
# Step 1: find the index where the "sender_email" column contains "@maktoob.com".
index = emails_df[emails_df["sender_email"].str.contains(r"\w\S*@maktoob.com")].index.values

In Step 2, we use the index to find the email address, which the `loc[]` method returns as a Series object with several different properties. We print it out below to see what it looks like.

In [33]:
# Step 2: use the index to find the value of the cell in the "sender_email" column.
# The result is returned as pandas Series object
address_Series = emails_df.loc[index]["sender_email"]
print(address_Series)
print(type(address_Series))

0    james_ngola2002@maktoob.com
Name: sender_email, dtype: object
<class 'pandas.core.series.Series'>


In Step 3, we extract the email address from the Series object as we would items from a list. You can see that its type is now class.

In [34]:
# Step 3: extract the email address, which is at index 0 in the Series object.
address_string = address_Series[0]
print(address_string)
print(type(address_string))

james_ngola2002@maktoob.com
<class 'str'>


Step 4 is where we extract the email body.

In [35]:
# Step 4: find the value of the "email_body" column where the "sender email" column is address_string.
print(emails_df[emails_df["sender_email"] == address_string]["email_body"].values)

['email body here']


In Step 4, `emails_df['sender_email'] == "james_ngola2002@maktoob.com"` finds the row where the `sender_email` column contains the value `"james_ngola2002@maktoob.com"`. Next, `['email_body'].values` finds the value of the `email_body` column in that same row. Finally, we print out the value.