# Loading Data into Spacy

The goal of this notebook is to show you how to start a spacy project with Unstructured's Elements. This allows you to create your NLP projects.

Make sure you have Spacy installed on your local computer before running this notebook. If not, you can find the instructions for installation [here](https://spacy.io/usage).

# Preprocess Documents with Unstructured

First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick.

In [16]:
import os

from unstructured.partition.auto import partition

In [17]:
# NOTE: Update this directory if you are running the notebook
# from somewhere other than the examples/mysql folder in the
# unstructured repo
EXAMPLE_DOCS_FOLDER = "../unstructured/example-docs/"

In [18]:
documents_to_process = [
    "fake-email.eml",
    "fake.docx",
    "layout-parser-paper-fast.pdf",
]

In [19]:
elements = []
for document in documents_to_process:
    filename = os.path.join(EXAMPLE_DOCS_FOLDER, document)
    elements.extend(partition(filename=filename, strategy="fast"))

In [20]:
elements[0].text

'This is a test email to use for unit tests.'

In [21]:
elements[0].metadata.to_dict()

{'filename': 'fake-email.eml',
 'date': '2022-12-16T17:04:16-05:00',
 'sent_from': ['Matthew Robinson <mrobinson@unstructured.io>'],
 'sent_to': ['Matthew Robinson <mrobinson@unstructured.io>'],
 'subject': 'Test Email'}

# Convert to Spacy Object

In [22]:
import spacy

In [23]:
nlp = spacy.load("en_core_web_sm")

In [34]:
doc = nlp(elements[0].text)

In [37]:
for i in doc.noun_chunks:
    print(i)

This
a test email
unit tests


In [39]:
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Verbs: ['use']


In [43]:
for token in doc:
    print(token.pos, token.lemma_)

95 this
87 be
90 a
92 test
92 email
94 to
100 use
85 for
92 unit
92 test
97 .
