In [53]:
%%html
<!-- 
If you can see this code, this cell's output is not trusted.
Please execute this cell and save the notebook, or click File -> Trust Notebook
-->
<script>
var shown = true;

function filter_cells_by_tag(tag) {
    out = Array();
    all_cells = Jupyter.notebook.get_cells()
    for (var i=0; i<all_cells.length; i++) {
        var curr_cell = all_cells[i];
        var tags = curr_cell._metadata.tags;
        if (tags != undefined) {
            for (var j=0; j<tags.length; j++) {
                var curr_tag = tags[j];
                if (curr_tag == tag) {
                    out.push(curr_cell);
                    break;
                }
            }
        }
    }
    return out;
}

function set_cell_visibility(tag, show, input_only) {
    var cells = Jupyter.notebook.get_cells();
    var marked_cells = filter_cells_by_tag(tag);
    for (var i=0; i<marked_cells.length; i++) {
        var curr_cell = marked_cells[i];
        if (input_only) {
            obj = curr_cell.input
        } else {
            obj = curr_cell.element
        }
        if (show) {
            obj.show();
        } else {
            obj.hide();
        }
    }
}

function toggle_cell_visibility(tag) {
    set_cell_visibility(tag, shown, false)
    shown = ! shown;
}

set_cell_visibility('execution_cell', false, true);
</script>
To toggle visibility of explanation cells click <a href="javascript:toggle_cell_visibility('explanatory_cell')">here</a>


# SEC Filing Section Pipeline

This notebook defines the pipeline for extracting the narrative text sections
from emails in `.eml` files. This notebook contains both
exploration code and the code for defining the API. Code cells marked
with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract
meaningful data from complex source documents, we will apply
a series of Bricks with explanations before defining the API.

#### Table of Contents

1. [Pulling and Reading the Document](#reading)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Define the Pipeline API](#pipeline)

## Section 1: Pulling and Reading the Emails <a id="reading"></a>

First, let's pull in the `.eml` files from a local directory

In [173]:
import os

def get_filenames(directory, file_type, cwd=None):
    if not cwd:
        cwd = os.getcwd()
        
    local_directory = os.path.join(os.path.split(cwd)[0], directory)
    
    files = []
    # Iterate directory
    for file in os.listdir(local_directory):
        # check only text files
        if file.endswith(f'.{file_type}'):
            files.append(local_directory + "/"+ file)
    return files

In [174]:
import os
import pathlib

filenames = get_filenames("sample-docs", "eml")
with open(filenames[0], "r") as f:
    msg = email.message_from_file(f)

In [175]:
for item in msg.walk():
    print(item)

MIME-Version: 1.0
Date: Wed, 21 Dec 2022 11:09:08 -0600
Message-ID: <CAPgNNXR+x-xiszwFdZx59eFHz9syApFyODPbAUHT7YVgNtF-fA@mail.gmail.com>
Subject: ANNOUNCEMENT: The holidays are coming!
From: Mallori Harrell <mallori@unstructured.io>
To: Mallori Harrell <mallori@unstructured.io>
Content-Type: multipart/alternative; boundary="00000000000054448805f0599c48"

--00000000000054448805f0599c48
Content-Type: text/plain; charset="UTF-8"

To All,

As the holiday approaches, be sure to let your manager and team know the
following:

   - Your days off
   - The location of your work's documentation
   - How to reach you or your secondary in case of an emergency


Hope you all have a Happy Holidays!

Best,

-- 
Mallori Harrell
Unstructured Technologies
Data Scientist

--00000000000054448805f0599c48
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">To All,<div><br></div><div>As the holiday approaches, be s=
ure to let your manager and team know the f

## Section 2: Custom Partitioning Bricks<a id="custom"></a>

In [177]:
from unstructured.partition.email import partition_email

msg_elements = []
for filename in filenames:
    print(filename)
    msg_elements.append(partition_email(filename=filename))

/Users/mallori/pipeline-emails/sample-docs/announcement.eml
/Users/mallori/pipeline-emails/sample-docs/family_day.eml
/Users/mallori/pipeline-emails/sample-docs/alert.eml


In [179]:
msg_elements

[[<unstructured.documents.html.HTMLNarrativeText at 0x13cd117c0>,
  <unstructured.documents.html.HTMLListItem at 0x13cd36b80>,
  <unstructured.documents.html.HTMLListItem at 0x13cd36370>,
  <unstructured.documents.html.HTMLListItem at 0x13cd3b790>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5b880>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5b5e0>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5bdf0>],
 [<unstructured.documents.html.HTMLNarrativeText at 0x13cd363a0>,
  <unstructured.documents.html.HTMLNarrativeText at 0x13cd5b670>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5b160>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5b220>],
 [<unstructured.documents.html.HTMLTitle at 0x13cd36a60>,
  <unstructured.documents.html.HTMLNarrativeText at 0x13cd5b8e0>,
  <unstructured.documents.html.HTMLNarrativeText at 0x13cd5b970>,
  <unstructured.documents.html.HTMLNarrativeText at 0x13cd5baf0>,
  <unstructured.documents.html.HTMLTitle at 0x13cd5b1f0>,
  <unstructur

In [181]:
print(msg_elements[0][1].text)
print(msg_elements[0][2].text)
print(msg_elements[0][3].text)

Your days off
The location of your work's documentation
How to reach you or your secondary in case of an emergency


## Section 3: Cleaning Bricks<a id="cleaning"></a>

In [183]:
from unstructured.cleaners.core import clean_extra_whitespace
clean_element = []

for element in msg_elements[0]:
    print(clean_extra_whitespace(element.text))

As the holiday approaches, be sure to let your manager and team know the following:
Your days off
The location of your work's documentation
How to reach you or your secondary in case of an emergency
Hope you all have a Happy Holidays!
Best,
Data Scientist


## Section 4: Define the API<a id="pipeline"></a>

In [None]:
# pipeline-api