# Email Pipeline

This notebook defines the pipeline for extracting the different components (header, body, attachments, etc.) of an email (`.eml` file). This notebook contains both exploration code and the code for defining the API. Code cells marked with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract meaningful data from complex source documents, we will apply a series of Bricks with explanations before defining the API.

#### Table of Contents

1. [Take a Look at a Raw EML File](#explore)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Staging Bricks](#staging)
1. [Define the Pipeline API](#pipeline)

## Section 1: Take a Look at a Raw EML File <a id="explore"></a>

Let's take a look at an email with an attachment. As you will see below there is metadata about the email at the top (sender, recipient, subject, etc.) and if you scroll down, you will will see there are different sections of the email and it's metadata.

In [None]:
import os
import json


def get_filename(directory, filename):
    cwd = os.getcwd()
    local_directory = os.path.join(os.path.split(cwd)[0], directory)
    ci_directory = os.path.join(cwd, directory)

    if os.path.exists(local_directory) and filename in os.listdir(local_directory):
        return os.path.join(local_directory, filename)
    elif os.path.exists(ci_directory) and filename in os.listdir(ci_directory):
        return os.path.join(ci_directory, filename)
    else:
        raise FileNotFoundError

In [None]:
filename = get_filename("sample-docs", "family-day.eml")

In [None]:
import email

# Take a look at file 2135.eml
with open(filename) as f:
    msg = email.message_from_file(f)

In [None]:
# Take a look at the eml file with all the metadata and content
for part in msg.walk():
    print(part)

MIME-Version: 1.0
Date: Wed, 21 Dec 2022 10:28:53 -0600
Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com>
Subject: Family Day
From: Mallori Harrell <mallori@unstructured.io>
To: Mallori Harrell <mallori@unstructured.io>
Content-Type: multipart/alternative; boundary="0000000000005c115405f0590ce4"

--0000000000005c115405f0590ce4
Content-Type: text/plain; charset="UTF-8"

Hi All,

Get excited for our first annual family day!

There will be face painting, a petting zoo, funnel cake and more.

Make sure to RSVP!

Best.

-- 
Mallori Harrell
Unstructured Technologies
Data Scientist

--0000000000005c115405f0590ce4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi All,<div><br></div><div>Get excited for our first annua=
l family day!=C2=A0</div><div><br></div><div>There will be face painting, =
a petting zoo, funnel cake and more.</div><div><br></div><div>Make sure to =
RSVP!</div><div><br></div><div>Best.<br

In [None]:
# Take a closer look at the header section of the eml file
for part in msg.raw_items():
    print(part)

('MIME-Version', '1.0')
('Date', 'Wed, 21 Dec 2022 10:28:53 -0600')
('Message-ID', '<CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com>')
('Subject', 'Family Day')
('From', 'Mallori Harrell <mallori@unstructured.io>')
('To', 'Mallori Harrell <mallori@unstructured.io>')
('Content-Type', 'multipart/alternative; boundary="0000000000005c115405f0590ce4"')


## Section 2: Custom Partition Bricks

Let's take a look at the body text of the eml file.

In [None]:
from unstructured.partition.email import partition_email

elements = partition_email(filename=filename)

In [None]:
elements

[<unstructured.documents.html.HTMLText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>]

In [None]:
print(elements[0].text)

Hi All,


In [None]:
for element in elements:
    print(element)

Hi All,
Get excited for our first annual family day! 
There will be face painting, a petting zoo, funnel cake and more.
Make sure to RSVP!
Best.
Mallori Harrell
Unstructured Technologies
Data Scientist


We can use the same code with extra parameters to also extract the header of the eml file

In [None]:
elements_with_header = partition_email(filename=filename, include_headers=True)
elements_with_header

[<unstructured.documents.email_elements.MetaData>,
 <unstructured.documents.email_elements.MetaData>,
 <unstructured.documents.email_elements.MetaData>,
 <unstructured.documents.email_elements.Subject>,
 <unstructured.documents.email_elements.Sender>,
 <unstructured.documents.email_elements.Recipient>,
 <unstructured.documents.email_elements.MetaData>,
 <unstructured.documents.html.HTMLText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLNarrativeText>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>,
 <unstructured.documents.html.HTMLTitle>]

## Section 3: Cleaning Bricks <a id="cleaning"></a>

In addition to partitioning bricks, the Unstructured library has
***cleaning*** bricks for removing unwanted content from text. In this
case, we'll solve our whitespace problem by using the 
`clean_extra_whitespace`. Other uses for cleaning bricks include
cleaning out boilerplate, sentence fragments, and other segments
of text that could impact labeling tasks or the accuracy of
machine learning models. As with partitioning bricks, users can
include custom cleaning bricks in a pipeline.

In [None]:
#This element has a lot of new line characters
elements[0].text

'Hi All,'

In [None]:
from unstructured.cleaners.core import clean_extra_whitespace

clean_extra_whitespace(elements[0].text)

'Hi All,'

In [None]:
# Or let's extract all information before a new line character
from unstructured.cleaners.extract import extract_text_before, extract_text_after
from unstructured.partition.text import split_by_paragraph

print(split_by_paragraph(elements[0].text))

['Hi All,']


## Section 4: Staging Bricks<a id="staging"></a>

In [None]:
elements[2].text

'There will be face painting, a petting zoo, funnel cake and more.'

In [None]:
from unstructured.staging.label_studio import stage_for_label_studio

label_studio_data = stage_for_label_studio(elements)
label_studio_data

[{'data': {'text': 'Hi All,', 'ref_id': 'db1ca22813f01feda8759ff04a844e56'}},
 {'data': {'text': 'Get excited for our first annual family day!\xa0',
   'ref_id': '9ec31559e889d2fd004f1911524143ba'}},
 {'data': {'text': 'There will be face painting, a petting zoo, funnel cake and more.',
   'ref_id': '1ed755f351e19ae96f0dae15b26fc9e3'}},
 {'data': {'text': 'Make sure to RSVP!',
   'ref_id': 'e945c67e6bca859e2d39c4ed33a02346'}},
 {'data': {'text': 'Best.', 'ref_id': '5550577db69c2c8aabcd90979698120a'}},
 {'data': {'text': 'Mallori Harrell',
   'ref_id': 'ca1c571d993b6c1ed8ef56a06c16ba22'}},
 {'data': {'text': 'Unstructured Technologies',
   'ref_id': 'd5b612de8cd918addd9569b0255b65b2'}},
 {'data': {'text': 'Data Scientist',
   'ref_id': 'd69b468e295fa01cdb3b7c3f0bd34114'}}]

## Section 5: Defining the Pipeline API<a id="pipeline"></a>

This API will be able to handle `.txt`, `.docx`, `.pptx`, `.jpg`, `.png`, `.eml`, `.html`, and `.pdf` documents. The following lines of code will demonstrate this for a couple of file types. To learn how to use the specific partition functions (e.g. `partition_email`, `partition_html`, etc.) See the notebooks in the `exploration-notebooks` directory.

In [None]:
# pipeline-api
from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_isd
import tempfile

In [None]:
# pipeline-api
def pipeline_api(file, filename='', response_type="application/json"):
    # NOTE(robinson) - This is a hacky solution due to
    # limitations in the SpooledTemporaryFile wrapper.
    # Specifically, it does't have a `seekable` attribute,
    # which is required for .pptx and .docx. See below
    # the link below
    # ref: https://stackoverflow.com/questions/47160211
    # /why-doesnt-tempfile-spooledtemporaryfile-implement-readable-writable-seekable
    with tempfile.TemporaryDirectory() as tmpdir:
        _filename = os.path.join(tmpdir, filename.split('/')[-1])
        with open(_filename, "wb") as f:
            f.write(file.read())
        elements = partition(filename=_filename)
        
    return convert_to_isd(elements)

In [None]:
with open(filename, 'rb') as file:
    email_data = pipeline_api(file=file, filename=filename)

In [None]:
email_data

[{'element_id': 'db1ca22813f01feda8759ff04a844e56',
  'coordinates': None,
  'text': 'Hi All,',
  'type': 'UncategorizedText',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmp94zpbz3n/family-day.eml'}},
 {'element_id': '9ec31559e889d2fd004f1911524143ba',
  'coordinates': None,
  'text': 'Get excited for our first annual family day!\xa0',
  'type': 'NarrativeText',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmp94zpbz3n/family-day.eml'}},
 {'element_id': '1ed755f351e19ae96f0dae15b26fc9e3',
  'coordinates': None,
  'text': 'There will be face painting, a petting zoo, funnel cake and more.',
  'type': 'NarrativeText',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmp94zpbz3n/family-day.eml'}},
 {'element_id': 'e945c67e6bca859e2d39c4ed33a02346',
  'coordinates': None,
  'text': 'Make sure to RSVP!',
  'type': 'NarrativeText',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w00

## Now let's use the API for a pdf

In [None]:
filename_txt = get_filename("sample-docs", "fake-text.txt")

In [None]:
with open(filename_txt, 'rb') as file:
    text_elements = pipeline_api(file=file, filename=filename_txt)

In [None]:
text_elements

[{'element_id': '1df8eeb8be847c3a1a7411e3be3e0396',
  'coordinates': None,
  'text': 'This is a test document to use for unit tests.',
  'type': 'NarrativeText',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmpkmhjgfzz/fake-text.txt'}},
 {'element_id': '9c218520320f238595f1fde74bdd137d',
  'coordinates': None,
  'text': 'Important points:',
  'type': 'Title',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmpkmhjgfzz/fake-text.txt'}},
 {'element_id': '39a3ae572581d0f1fe7511fd7b3aa414',
  'coordinates': None,
  'text': 'Hamburgers are delicious',
  'type': 'ListItem',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmpkmhjgfzz/fake-text.txt'}},
 {'element_id': 'fc1adcb8eaceac694e500a103f9f698f',
  'coordinates': None,
  'text': 'Dogs are the best',
  'type': 'ListItem',
  'metadata': {'filename': '/var/folders/01/tgx6znjs3p3bfwkbj591hg1w0000gn/T/tmpkmhjgfzz/fake-text.txt'}},
 {'element_id': '0b61e