# Loading Data into Spacy

The goal of this notebook is to show you how to start a spacy project with Unstructured's Elements. This allows you to create your NLP projects.

Make sure you have Spacy installed on your local computer before running this notebook. If not, you can find the instructions for installation [here](https://spacy.io/usage).

# Preprocess Documents with Unstructured

First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick.

In [3]:
import os

from unstructured.partition.auto import partition

In [8]:
# NOTE: Update this directory if you are running the notebook
# from somewhere other than the examples/spacy folder in the
# unstructured repo
EXAMPLE_DOCS_FOLDER = "../../example-docs/"

In [9]:
document_to_process = "fake-memo.pdf"
filename = os.path.join(EXAMPLE_DOCS_FOLDER, document_to_process)
elements = partition(filename=filename, strategy="fast")

In [10]:
elements[0].text

'May 5, 2023'

In [11]:
elements[0].metadata.to_dict()

{'filename': 'fake-memo.pdf',
 'file_directory': '../../example-docs',
 'filetype': 'application/pdf',
 'page_number': 1}

# Extract Numbers Using Spacy


Now let's import `spacy` and create a function to extract noun phrases with numbers. First we'll use a simple example then we'll use the text extracted by `unstructured`.

The function first creates a spacy object with the text, then iterates through the spacy object to find the noun phrases with numbers. It then formats the phrases and appends to a list.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_numbers_with_context(text):
    doc = nlp(text)
    numbers = []
    
    for token in doc:
        if token.like_num and token.dep_ == 'nummod' and token.head.pos_ == 'NOUN':
            number = token.text
            noun = token.head.text
            context = ' '.join([number, noun])
            numbers.append((number, noun, context))
    
    return numbers

# Example usage
text = "I bought 10 apples and 5 oranges yesterday."
numbers_with_context = extract_numbers_with_context(text)

for number, noun, context in numbers_with_context:
    print(f"Number: {number}, Noun: {noun}, Context: {context}")

Number: 10, Noun: apples, Context: 10 apples
Number: 5, Noun: oranges, Context: 5 oranges


### Using the Data Extracted with Unstructured's Library

In [28]:
numbers_with_context = extract_numbers_with_context(elements[2].text)

In [29]:
for number, noun, context in numbers_with_context:
    print(f"Number: {number}, Noun: {noun}, Context: {context}")

Number: 20,000, Noun: bottles, Context: 20,000 bottles
Number: 10,000, Noun: blankets, Context: 10,000 blankets
Number: 200, Noun: laptops, Context: 200 laptops
Number: 3, Noun: trucks, Context: 3 trucks
Number: 15, Noun: hours, Context: 15 hours
