## Loading Data into Weaviate with `unstructured`

This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`.

In [1]:
import json

import tqdm
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate
import weaviate
from weaviate.util import generate_uuid5

The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function.

In [None]:
filename = "../../example-docs/layout-parser-paper-fast.pdf"
elements = partition_pdf(filename=filename, strategy="fast")

Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable.

In [3]:
unstructured_class_name = "UnstructuredDocument"

In [None]:
# not used, we are creating the schema from the provided data
# unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)
# schema = {"classes": [unstructured_class]}

In [4]:
# Connecting to Weaviate
# https://weaviate.io/developers/weaviate/starter-guides/connect
client = weaviate.connect_to_local()

In [5]:
client.collections.delete(unstructured_class_name)
collection = client.collections.create(
    name=unstructured_class_name
)
# we can get our collection at any time:
collection = client.collections.get(unstructured_class_name)

Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate.

In [6]:
data_objects = stage_for_weaviate(elements)

In [7]:
# this one of our objects
data_objects[0]

{'file_directory': '../../example-docs',
 'filename': 'layout-parser-paper-fast.pdf',
 'languages': ['eng'],
 'last_modified': '2024-06-04T17:26:18',
 'page_number': 1,
 'filetype': 'application/pdf',
 'text': '1 2 0 2',
 'category': 'UncategorizedText'}

In [8]:
with collection.batch.dynamic() as batch:
    for data_object in tqdm.tqdm(data_objects):
        batch.add_object(
            properties=data_object
        )
    failed_objs_a = client.batch.failed_objects  # check if we have failed objects
    print("FAILED: ", failed_objs_a)

100%|██████████| 25/25 [00:00<00:00, 26620.36it/s]


FAILED:  []


Now that the documents are in Weaviate, we're able to run queries against Weaviate!

In [9]:
# lets just get a single object
object = collection.query.fetch_objects(limit=1).objects[0]
print(object)

Object(uuid=_WeaviateUUIDInt('117e4b2d-1222-4d2e-9a40-2e761ecdafe8'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'text': '2', 'languages': ['eng'], 'page_number': 2.0, 'category': 'UncategorizedText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'filename': 'layout-parser-paper-fast.pdf', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'file_directory': '../../example-docs'}, references=None, vector={}, collection='UnstructuredDocument')


In [None]:
# We leveraged Weaviate AUTO SCHEMA to generate our collection
# you can get the collection schema dict like this
# collection.config.get().to_dict()
# we can use this same dict to create the collection
# new_collection = client.collections.create_from_dict(collection.config.get().to_dict())

In [10]:
results = collection.query.bm25(
    query="document understanding",
    limit=2,
    return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)
for object in results.objects:
    print(object.metadata.score, object.properties)

0.36298108100891113 {'text': 'Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,', 'languages': ['eng'], 'page_number': 1.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}
0.3443584442138672 {'text': 'LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', 'languages': ['eng'], 'page_number': 1.0, 'category': 'Title', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': None, 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}


In [11]:
# We can also perform similarity search
results = collection.query.near_text(
    query="document understanding",
    limit=4
)
for object in results.objects:
    print(object.properties)

{'text': 'Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,', 'languages': ['eng'], 'page_number': 1.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}
{'text': 'Z. Shen et al.', 'languages': ['eng'], 'page_number': 2.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}
{'text': 'The library implements simple and intuitive Python APIs without sacriﬁcing generalizability and versatility, and can be easily installed via pip. Its convenient functions for handling document image data can be seamlessly

In [12]:
client.close()