In [1]:
%%html
<!-- 
If you can see this code, this cell's output is not trusted.
Please execute this cell and save the notebook, or click File -> Trust Notebook
-->
<script>
var shown = true;

function filter_cells_by_tag(tag) {
    out = Array();
    all_cells = Jupyter.notebook.get_cells()
    for (var i=0; i<all_cells.length; i++) {
        var curr_cell = all_cells[i];
        var tags = curr_cell._metadata.tags;
        if (tags != undefined) {
            for (var j=0; j<tags.length; j++) {
                var curr_tag = tags[j];
                if (curr_tag == tag) {
                    out.push(curr_cell);
                    break;
                }
            }
        }
    }
    return out;
}

function set_cell_visibility(tag, show, input_only) {
    var cells = Jupyter.notebook.get_cells();
    var marked_cells = filter_cells_by_tag(tag);
    for (var i=0; i<marked_cells.length; i++) {
        var curr_cell = marked_cells[i];
        if (input_only) {
            obj = curr_cell.input
        } else {
            obj = curr_cell.element
        }
        if (show) {
            obj.show();
        } else {
            obj.hide();
        }
    }
}

function toggle_cell_visibility(tag) {
    set_cell_visibility(tag, shown, false)
    shown = ! shown;
}

set_cell_visibility('execution_cell', false, true);
</script>
To toggle visibility of explanation cells click <a href="javascript:toggle_cell_visibility('explanatory_cell')">here</a>


# Email Pipeline

This notebook defines the pipeline for extracting the different components (header, body, attachments, etc.) of an email (`.eml` file). This notebook contains both exploration code and the code for defining the API. Code cells marked with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract meaningful data from complex source documents, we will apply a series of Bricks with explanations before defining the API.

#### Table of Contents

1. [Take a Look at a Raw EML File](#explore)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Staging Bricks](#staging)
1. [Define the Pipeline API](#pipeline)

## Section 1: Take a Look at a Raw EML File <a id="explore"></a>

Let's take a look at an email with an attachment. As you will see below there is metadata about the email at the top (sender, recipient, subject, etc.) and if you scroll down, you will will see there are different sections of the email and it's metadata. There is one part `X-MS-Has-Attach: yes` which indicates this email has an attachment. 

In [2]:
filename = "../../sample_documents/sample-eml-from-breaches/2135.eml"

In [3]:
import email

# Take a look at file 2135.eml
with open(filename) as f:
    msg = email.message_from_file(f)

In [4]:
# Take a look at the eml file with all the metadata and content
for part in msg.walk():
    print(part)

Received: from NPFEXH01.NPF.local ([fe80::49b9:5557:61d2:64a3]) by
 NPFEXH01.NPF.local ([fe80::49b9:5557:61d2:64a3%15]) with mapi id
 14.03.0169.001; Sun, 27 Aug 2017 23:21:10 +1200
From: Bragga Namaduk <Bragga.Namaduk@npf.gov.nr>
To: Alice Fritz <Alice.Fritz@npf.gov.nr>
Subject: FW: General Duty ROSTER
Thread-Topic: General Duty ROSTER
Thread-Index: AdMe/Yp7tBOYUpQ3QneK1CllQ+thjwAKV+MQ
X-MS-Exchange-MessageSentRepresentingType: 1
Date: Sun, 27 Aug 2017 11:21:09 +0000
Message-ID: <D20A1AD0162D424E8C5694CBD12A5F34011B2C2B15@NPFEXH01.NPF.local>
References: <F5753948A6B5A3478619939081EBFA62011B27F3D5@NPFEXH01.NPF.local>
In-Reply-To: <F5753948A6B5A3478619939081EBFA62011B27F3D5@NPFEXH01.NPF.local>
Accept-Language: en-US
Content-Language: en-US
X-MS-Exchange-Organization-AuthAs: Internal
X-MS-Exchange-Organization-AuthMechanism: 04
X-MS-Exchange-Organization-AuthSource: NPFEXH01.NPF.local
X-MS-Has-Attach: yes
X-MS-Exchange-Organization-SCL: -1
X-MS-TNEF-Correlator: 
X-MS-Exchange-Organizatio

In [5]:
# Take a closer look at the header section of the eml file
for part in msg.raw_items():
    print(part)

('Received', 'from NPFEXH01.NPF.local ([fe80::49b9:5557:61d2:64a3]) by\n NPFEXH01.NPF.local ([fe80::49b9:5557:61d2:64a3%15]) with mapi id\n 14.03.0169.001; Sun, 27 Aug 2017 23:21:10 +1200')
('From', 'Bragga Namaduk <Bragga.Namaduk@npf.gov.nr>')
('To', 'Alice Fritz <Alice.Fritz@npf.gov.nr>')
('Subject', 'FW: General Duty ROSTER')
('Thread-Topic', 'General Duty ROSTER')
('Thread-Index', 'AdMe/Yp7tBOYUpQ3QneK1CllQ+thjwAKV+MQ')
('X-MS-Exchange-MessageSentRepresentingType', '1')
('Date', 'Sun, 27 Aug 2017 11:21:09 +0000')
('Message-ID', '<D20A1AD0162D424E8C5694CBD12A5F34011B2C2B15@NPFEXH01.NPF.local>')
('References', '<F5753948A6B5A3478619939081EBFA62011B27F3D5@NPFEXH01.NPF.local>')
('In-Reply-To', '<F5753948A6B5A3478619939081EBFA62011B27F3D5@NPFEXH01.NPF.local>')
('Accept-Language', 'en-US')
('Content-Language', 'en-US')
('X-MS-Exchange-Organization-AuthAs', 'Internal')
('X-MS-Exchange-Organization-AuthMechanism', '04')
('X-MS-Exchange-Organization-AuthSource', 'NPFEXH01.NPF.local')
('X-MS

## Section 2: Custom Partition Bricks

Let's take a look at the only the body text of the eml file.

In [6]:
from unstructured.partition.email import partition_email

elements = partition_email(filename=filename)

In [7]:
elements

[<unstructured.documents.html.HTMLTitle at 0x11389e5e0>,
 <unstructured.documents.html.HTMLTitle at 0x11389e820>,
 <unstructured.documents.html.HTMLNarrativeText at 0x11389e880>,
 <unstructured.documents.html.HTMLTitle at 0x11389ea90>,
 <unstructured.documents.html.HTMLTitle at 0x11679b4c0>]

In [8]:
for element in elements:
    print(element)

From: Jachin BopSent: Sunday, August 27, 2017 6:35 PMTo: Corey CalebCc: Kalinda Blake; Imran Scotty; Rory Detageouwa2; thubalkain dabuae; Czarist Daniel; Brown Capelle; Bragga Namaduk; Kirsty Karl; Shannon Scotty; Jacaranda Akibwib; John Deidenang; Kempson Detenamo; Francine Dekarube; Mick SerbatoioSubject: General Duty ROSTER
Sir
Submit new ROSTER for front line that you recommended to be change, FYI Monday, Tuesday and Thursday morning shift 0700-1500 will be man by DRILL training team and OIC will be me Insp Bop see the roster attached as well, hope you are satisfy with it or anything you needed to be change just send an email and explain what need to be change, HR please arrange new time sheet for the change been made, unable to be print due to OPS copy out of toner.thank you.
Insp Jachin Bop
Inspector Operation


We can use the same code with extra parameters to also extract the header of the eml file

In [9]:
elements_with_header = partition_email(filename=filename, include_headers=True)

Let's also extract the attachment from the eml file. We can extract the file metadata and payload. You can also save the actual attachment to your local drive by specifying a directory for the `output_dir` paramenter.

In [10]:
from unstructured.partition.email import extract_attachment_info
with open(filename) as f:
    msg = email.message_from_file(f)
    
attachments = extract_attachment_info(msg)

In [11]:
attachments

[{'filename': 'Specila Unit 28th August to 10th September 2017.docx',
  'size': '55999',
  'creation-date': 'Sun, 27 Aug 2017 06:27:19 GMT',
  'modification-date': 'Sun, 27 Aug 2017 06:35:16 GMT',
  'payload': b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00O\xb7\x01i\xa5\x01\x00\x00\xc2\x06\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x

## Section 3: Cleaning Bricks <a id="cleaning"></a>

In addition to partitioning bricks, the Unstructured library has
***cleaning*** bricks for removing unwanted content from text. In this
case, we'll solve our whitespace problem by using the 
`clean_extra_whitespace`. Other uses for cleaning bricks include
cleaning out boilerplate, sentence fragments, and other segments
of text that could impact labeling tasks or the accuracy of
machine learning models. As with partitioning bricks, users can
include custom cleaning bricks in a pipeline.

In [12]:
#This element has a lot of new line characters
elements[0].text

'From: Jachin BopSent: Sunday, August 27, 2017 6:35 PMTo: Corey CalebCc: Kalinda Blake; Imran Scotty; Rory Detageouwa2; thubalkain dabuae; Czarist Daniel; Brown Capelle; Bragga Namaduk; Kirsty Karl; Shannon Scotty; Jacaranda Akibwib; John Deidenang; Kempson Detenamo; Francine Dekarube; Mick SerbatoioSubject: General Duty ROSTER'

In [13]:
from unstructured.cleaners.core import clean_extra_whitespace

clean_extra_whitespace(elements[0].text)

'From: Jachin BopSent: Sunday, August 27, 2017 6:35 PMTo: Corey CalebCc: Kalinda Blake; Imran Scotty; Rory Detageouwa2; thubalkain dabuae; Czarist Daniel; Brown Capelle; Bragga Namaduk; Kirsty Karl; Shannon Scotty; Jacaranda Akibwib; John Deidenang; Kempson Detenamo; Francine Dekarube; Mick SerbatoioSubject: General Duty ROSTER'

In [14]:
# Or let's extract all information before a new line character
from unstructured.cleaners.extract import extract_text_before, extract_text_after
from unstructured.partition.text import split_by_paragraph

print(split_by_paragraph(elements[0].text))

['From: Jachin BopSent: Sunday, August 27, 2017 6:35 PMTo: Corey CalebCc: Kalinda Blake; Imran Scotty; Rory Detageouwa2; thubalkain dabuae; Czarist Daniel; Brown Capelle; Bragga Namaduk; Kirsty Karl; Shannon Scotty; Jacaranda Akibwib; John Deidenang; Kempson Detenamo; Francine Dekarube; Mick SerbatoioSubject: General Duty ROSTER']


## Section 4: Staging Bricks<a id="staging"></a>

In [15]:
elements[2].text

'Submit new ROSTER for front line that you recommended to be change, FYI Monday, Tuesday and Thursday morning shift 0700-1500 will be man by DRILL training team and OIC will be me Insp Bop see the roster attached as well, hope you are satisfy with it or anything you needed to be change just send an email and explain what need to be change, HR please arrange new time sheet for the change been made, unable to be print due to OPS copy out of toner.thank you.'

In [16]:
from unstructured.staging.label_studio import stage_for_label_studio

label_studio_data = stage_for_label_studio(elements)
label_studio_data

[{'data': {'text': 'From: Jachin BopSent: Sunday, August 27, 2017 6:35 PMTo: Corey CalebCc: Kalinda Blake; Imran Scotty; Rory Detageouwa2; thubalkain dabuae; Czarist Daniel; Brown Capelle; Bragga Namaduk; Kirsty Karl; Shannon Scotty; Jacaranda Akibwib; John Deidenang; Kempson Detenamo; Francine Dekarube; Mick SerbatoioSubject: General Duty ROSTER',
   'ref_id': 'f8655433f0452831f1f4462f7f872c8c'}},
 {'data': {'text': 'Sir', 'ref_id': '31bc41c1dbb7212df18845ac71f9669b'}},
 {'data': {'text': 'Submit new ROSTER for front line that you recommended to be change, FYI Monday, Tuesday and Thursday morning shift 0700-1500 will be man by DRILL training team and OIC will be me Insp Bop see the roster attached as well, hope you are satisfy with it or anything you needed to be change just send an email and explain what need to be change, HR please arrange new time sheet for the change been made, unable to be print due to OPS copy out of toner.thank you.',
   'ref_id': '7fc421755047d7a794297d27151ce

## Section 5: Defining the Pipeline API<a id="pipeline"></a>

In [40]:
# pipeline-api
import email
import signal
from unstructured.partition.email import partition_email, extract_attachment_info
from unstructured.staging.base import convert_to_isd

In [18]:
# pipeline-api
class timeout:
    def __init__(self, seconds=1, error_message='Timeout'):
        self.seconds = seconds
        self.error_message = error_message
    def handle_timeout(self, signum, frame):
        raise TimeoutError(self.error_message)
    def __enter__(self):
        try:
            signal.signal(signal.SIGALRM, self.handle_timeout)
            signal.alarm(self.seconds)
        except ValueError:
            pass
    def __exit__(self, type, value, traceback):
        try:
            signal.alarm(0)
        except ValueError:
            pass

In [34]:
# pipeline-api
def pipeline_api(file, response_type="application/json", m_include_headers=[], m_extract_attachment=[], m_output_dir=[]):
    
    elements = partition_email(filename=file, include_headers=m_include_headers)
    
    if m_extract_attachment:
        with open(file) as f:
            msg = email.message_from_file(f)
        attachment = extract_attachment_info(msg, output_dir=m_output_dir)
    return elements, attachment

In [36]:
email_data, attachment = pipeline_api(file=filename, m_include_headers=True, m_extract_attachment=True)

In [37]:
email_data

[<unstructured.documents.email_elements.ReceivedInfo at 0x137e8d220>,
 <unstructured.documents.email_elements.ReceivedInfo at 0x137e7a9a0>,
 <unstructured.documents.email_elements.ReceivedInfo at 0x137e8d4c0>,
 None,
 <unstructured.documents.email_elements.Sender at 0x137ec3880>,
 <unstructured.documents.email_elements.Recipient at 0x137e7a610>,
 <unstructured.documents.email_elements.Subject at 0x137e7a7c0>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a100>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a4f0>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a760>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a700>,
 <unstructured.documents.email_elements.MetaData at 0x137e7ab50>,
 <unstructured.documents.email_elements.MetaData at 0x137e7aa60>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a790>,
 <unstructured.documents.email_elements.MetaData at 0x137e7a190>,
 <unstructured.documents.email_elements.MetaData at 0x137e7

In [38]:
print(email_data[0])
print(email_data[5])
print(email_data[10])

NPFEXH01.NPF.local: fe80::49b9:5557:61d2:64a3
Alice Fritz: alice.fritz@npf.gov.nr
Date: Sun, 27 Aug 2017 11:21:09 +0000


In [39]:
attachment

[{'filename': 'Specila Unit 28th August to 10th September 2017.docx',
  'size': '55999',
  'creation-date': 'Sun, 27 Aug 2017 06:27:19 GMT',
  'modification-date': 'Sun, 27 Aug 2017 06:35:16 GMT',
  'payload': b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00O\xb7\x01i\xa5\x01\x00\x00\xc2\x06\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x