In [53]:
%%html
<!-- 
If you can see this code, this cell's output is not trusted.
Please execute this cell and save the notebook, or click File -> Trust Notebook
-->
<script>
var shown = true;

function filter_cells_by_tag(tag) {
    out = Array();
    all_cells = Jupyter.notebook.get_cells()
    for (var i=0; i<all_cells.length; i++) {
        var curr_cell = all_cells[i];
        var tags = curr_cell._metadata.tags;
        if (tags != undefined) {
            for (var j=0; j<tags.length; j++) {
                var curr_tag = tags[j];
                if (curr_tag == tag) {
                    out.push(curr_cell);
                    break;
                }
            }
        }
    }
    return out;
}

function set_cell_visibility(tag, show, input_only) {
    var cells = Jupyter.notebook.get_cells();
    var marked_cells = filter_cells_by_tag(tag);
    for (var i=0; i<marked_cells.length; i++) {
        var curr_cell = marked_cells[i];
        if (input_only) {
            obj = curr_cell.input
        } else {
            obj = curr_cell.element
        }
        if (show) {
            obj.show();
        } else {
            obj.hide();
        }
    }
}

function toggle_cell_visibility(tag) {
    set_cell_visibility(tag, shown, false)
    shown = ! shown;
}

set_cell_visibility('execution_cell', false, true);
</script>
To toggle visibility of explanation cells click <a href="javascript:toggle_cell_visibility('explanatory_cell')">here</a>


# SEC Filing Section Pipeline

This notebook defines the pipeline for extracting the narrative text sections
from emails in `.eml` files. This notebook contains both
exploration code and the code for defining the API. Code cells marked
with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract
meaningful data from complex source documents, we will apply
a series of Bricks with explanations before defining the API.

#### Table of Contents

1. [Pulling and Reading the Document](#reading)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Define the Pipeline API](#pipeline)

## Section 1: Pulling and Reading the Emails <a id="reading"></a>

First, let's pull in the `.eml` files from a local directory

In [222]:
import os

def get_filenames(directory, file_type, cwd=None):
    if not cwd:
        cwd = os.getcwd()
        
    local_directory = os.path.join(os.path.split(cwd)[0], directory)
    
    files = []
    # Iterate directory
    for file in os.listdir(local_directory):
        # check only text files
        if file.endswith(f'.{file_type}'):
            files.append(local_directory + "/"+ file)
    return files

In [234]:
import email

filenames = get_filenames("sample-docs", "eml")
with open(filenames[1], "r") as f:
    msg = email.message_from_file(f)

In [235]:
for item in msg.walk():
    print(item)

MIME-Version: 1.0
Date: Wed, 21 Dec 2022 10:28:53 -0600
Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com>
Subject: Family Day
From: Mallori Harrell <mallori@unstructured.io>
To: Mallori Harrell <mallori@unstructured.io>
Content-Type: multipart/alternative; boundary="0000000000005c115405f0590ce4"

--0000000000005c115405f0590ce4
Content-Type: text/plain; charset="UTF-8"

Hi All,

Get excited for our first annual family day!

They'll be face painting, a petting zoo, funnel cake and more.

Make sure to RSVP!

Best.

-- 
Mallori Harrell
Unstructured Technologies
Data Scientist

--0000000000005c115405f0590ce4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi All,<div><br></div><div>Get excited for our first annua=
l family day!=C2=A0</div><div><br></div><div>They&#39;ll be face painting, =
a petting zoo, funnel cake and more.</div><div><br></div><div>Make sure to =
RSVP!</div><div><br></div><div>Best.<br c

## Section 2: Custom Partitioning Bricks<a id="custom"></a>

In [227]:
from unstructured.partition.email import partition_email

msg_elements = []
for filename in filenames:
    msg_elements.append(partition_email(filename=filename))

In [228]:
msg_elements

[[<unstructured.documents.html.HTMLNarrativeText at 0x13cddde50>,
  <unstructured.documents.html.HTMLListItem at 0x13cdddd00>,
  <unstructured.documents.html.HTMLListItem at 0x13cddd550>,
  <unstructured.documents.html.HTMLListItem at 0x13cdc6b20>,
  <unstructured.documents.html.HTMLTitle at 0x150381430>,
  <unstructured.documents.html.HTMLTitle at 0x1503816a0>,
  <unstructured.documents.html.HTMLTitle at 0x150381b50>],
 [<unstructured.documents.html.HTMLNarrativeText at 0x13cdc3a30>,
  <unstructured.documents.html.HTMLNarrativeText at 0x150381f70>,
  <unstructured.documents.html.HTMLTitle at 0x150381d60>,
  <unstructured.documents.html.HTMLTitle at 0x1503819a0>],
 [<unstructured.documents.html.HTMLTitle at 0x13cdc3970>,
  <unstructured.documents.html.HTMLNarrativeText at 0x1503816d0>,
  <unstructured.documents.html.HTMLNarrativeText at 0x1503811f0>,
  <unstructured.documents.html.HTMLNarrativeText at 0x150381fd0>,
  <unstructured.documents.html.HTMLTitle at 0x150381520>,
  <unstructur

In [229]:
print(msg_elements[2][0].text)
print(msg_elements[2][1].text)
print(msg_elements[2][2].text)
print(msg_elements[2][3].text)

Hi,
It has come to our attention that as of 9:00am this morning, Harold's lunch is missing. If this was done in error please return the lunch immediately to the fridge on the 2nd floor by noon.
If the lunch has not been returned by noon, we will be reviewing camera footage to determine who stole Harold's lunch.
The perpetrators will be PUNISHED to the full extent of our employee code of conduct handbook.


## Section 3: Cleaning Bricks<a id="cleaning"></a>

In [230]:
from unstructured.cleaners.core import clean_extra_whitespace
clean_element = []

for element in msg_elements[2]:
    print(clean_extra_whitespace(element.text))

Hi,
It has come to our attention that as of 9:00am this morning, Harold's lunch is missing. If this was done in error please return the lunch immediately to the fridge on the 2nd floor by noon.
If the lunch has not been returned by noon, we will be reviewing camera footage to determine who stole Harold's lunch.
The perpetrators will be PUNISHED to the full extent of our employee code of conduct handbook.
Thank you for your time,
Data Scientist


## Section 4: Define the API<a id="pipeline"></a>

In [240]:
# pipeline-api
# from unstructured.cleaners.core import clean_extra_whitespace
# from unstructured.partition.email import partition_email

In [241]:
# pipeline-api
def pipeline_api(file):
    pass
#     if not file.endswith(".eml"):
#         raise ValueError("This file type is not supported at the moment. Use a `.eml` file")
        
#     elements = partition_email(filename=file)
    
#     results = []
#     for element in elements:
#         element.text = clean_extra_whitespace(element.text)
#         results.append(element)
#     return results

In [242]:
# msg = pipeline_api(filenames[2])

In [243]:
# msg[0].text

'Hi,'