## Hack Day Quick Start challenge: try out this workflow (or a modification) on one of your own files!

## Embed your Local Docs in a Weaviate Vector Database with Unstructured!


Author: Nina Lopatina from Unstructured

Nina's X handle: [@NinaLopatina](https://x.com/ninalopatina)

Nina's LinkedIn: https://www.linkedin.com/in/ninalopatina

Last updated: 09.06.24

Weaviate content sections borrowed from @MariaKhalusova

Do you have some local files in that you want to parse, embed, and import to your Weaviate Vector DataBase for RAG? If so, this notebook will guide you through all the steps to do so!

Here are the initial non-code steps:

A. Sign up for your [Unstructured API key](https://app.unstructured.io/) with a 2 week free trial for up to 1000 documents. You can find your API credentials in your dashboard.

B. Decide on your [source connector](https://docs.unstructured.io/api-reference/ingest/source-connectors/overview). This notebook uses the [Local](https://docs.unstructured.io/api-reference/ingest/source-connectors/local) but feel free to use the connector of your choice.

C. Sign up to get your [Weaviate](https://weaviate.io/) URL and API Key after you create a cluster. Here is our documentation for the [Weaviate destination connector](https://docs.unstructured.io/api-reference/ingest/destination-connector/weaviate) with more info  

D. Decide on which embeddings to use, and obtain the appropriate API Token as needed (in this notebook we are using OpenAI for embedding generation).

Set up the any private API keys in Google Colab [Secrets](https://www.youtube.com/watch?v=LPa51KxqUAw) (or adapt the notebook to work with .env instead)

_______________




1. Now starting with the code below, we will install all the necessary libraries

In [None]:
!pip install -U -q "unstructured-ingest[weaviate]" "unstructured[openai]"

2. Set the below variables:

A: Pull in your secrets

In [None]:
from google.colab import userdata
import os

os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')
os.environ['UNSTRUCTURED_API_URL'] = userdata.get('UNSTRUCTURED_API_URL')
os.environ['WEAVIATE_API_KEY'] = userdata.get('WEAVIATE_API_KEY')
os.environ['WEAVIATE_URL'] = userdata.get('WEAVIATE_URL')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

B. Set up your local file; make sure you have uploaded a file to the path you specify (click the folder icon on the left the upload), if you are running the Colab instance. I am using [constitution.pdf](https://constitutioncenter.org/media/files/constitution.pdf) in this example. Use any file for your quick start!

In [None]:
os.environ['LOCAL_FILE_INPUT_DIR'] = '/content/constitution.pdf'

C. We will set additional parameters here, that are not secret, that we can modify more easily in a notebook

In [None]:
os.environ['WEAVIATE_COLLECTION_CLASS_NAME'] = 'UnstructuredOAI'
os.environ['EMBEDDING_MODEL'] = '002'
os.environ['EMBEDDING_NAME'] = 'ada'

3. Connect to Weaviate using [Weaviate Cloud](https://console.weaviate.cloud/), [Weaviate Embedded](https://weaviate.io/developers/weaviate/installation/embedded), or [locally](https://weaviate.io/developers/weaviate/installation/docker-compose) and configure your Weaviate Schema.

In [None]:
# Weaviate Cloud

import weaviate

# Set these environment variables
URL = os.getenv("WEAVIATE_URL")
APIKEY = os.getenv("WEAVIATE_API_KEY")

# Connect to your WCD instance
client = weaviate.connect_to_wcs(
    cluster_url=URL,
    auth_credentials=weaviate.auth.AuthApiKey(APIKEY),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")  # Replace with your OpenAI key
    }
)

client.is_ready()



True

In [None]:
import weaviate.classes.config as wc
from weaviate.classes.config import Configure
from weaviate.classes.config import ReferenceProperty

client.collections.create(
    name=os.getenv("WEAVIATE_COLLECTION_CLASS_NAME"),

    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( # specify the vectorizer and model type you're using
        model=os.getenv("EMBEDDING_NAME"),
        model_version=os.getenv("EMBEDDING_MODEL"),
        type_="text",
    ),
    generative_config=wc.Configure.Generative.openai(
        model="gpt-4"  # Optional - Defaults to `gpt-3.5-turbo`
    ),


    # Weaviate can infer schema, but it is considered best practice to define it upfront
    properties=[
        wc.Property(name="type", data_type=wc.DataType.TEXT),
        wc.Property(name="element_id", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="embeddings", data_type=wc.DataType.NUMBER_ARRAY, skip_vectorization=True),
        wc.Property(name="metadata", data_type=wc.DataType.OBJECT, nested_properties=[
            wc.Property(name="filename", data_type=wc.DataType.TEXT),
            wc.Property(name="filetype", data_type=wc.DataType.TEXT),
            wc.Property(name="languages", data_type=wc.DataType.TEXT_ARRAY),
            wc.Property(name="page_number",  data_type=wc.DataType.TEXT, skip_vectorization=True),

        ])
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7e984c09ee60>

4. Set up Unstructured API access and process the documents as per our [Weaviate destination connector](https://docs.unstructured.io/api-reference/ingest/destination-connector/weaviate) with a local source.

  At the end of this workflow, your unstructured documents have been extracted, chunked, summarized, embedded, and loaded in your Weaviate DB!

In [None]:
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig

from unstructured_ingest.v2.processes.connectors.weaviate import (
    WeaviateConnectionConfig,
    WeaviateAccessConfig,
    WeaviateUploaderConfig,
    WeaviateUploadStagerConfig
)
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

In [None]:
Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        chunker_config=ChunkerConfig(chunking_strategy="by_title"),
        embedder_config=EmbedderConfig(embedding_provider="langchain-openai", embedding_api_key=os.getenv("OPENAI_API_KEY")),
        destination_connection_config=WeaviateConnectionConfig(
            access_config=WeaviateAccessConfig(
                api_key=os.getenv("WEAVIATE_API_KEY")
            ),
            host_url=os.getenv("WEAVIATE_URL"),
            class_name=os.getenv("WEAVIATE_COLLECTION_CLASS_NAME")
        ),
        stager_config=WeaviateUploadStagerConfig(),
        uploader_config=WeaviateUploaderConfig()
    ).run()

2024-09-07 05:16:59,541 MainProcess INFO     Created index with configs: {"input_path": "/content/constitution.pdf", "recursive": false}, connection configs: {"access_config": "**********"}
2024-09-07 05:16:59,547 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-09-07 05:16:59,551 MainProcess INFO     Created partition with configs: {"strategy": "fast", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": true, "split_pdf_allow_failed": true, "split_pdf_concurrency_level": 15}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructuredapp.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-09-07 05:16:59,553 MainProcess INFO     Created chunk with 

## 5. Time to Search! Put in your own questions here


### Aggregate query

In [None]:
# count how many chunks are in the database

documents = client.collections.get(os.getenv("WEAVIATE_COLLECTION_CLASS_NAME"))
response = documents.aggregate.over_all(total_count=True)

print(response.total_count)

INFO: HTTP Request: POST https://gmm4kvzs82lcafga8vmng.c0.us-west3.gcp.weaviate.cloud/v1/graphql "HTTP/1.1 200 OK"


145


### Hybrid search (mix of keyword and vector search)

In [None]:
import json

documents = client.collections.get(os.getenv("WEAVIATE_COLLECTION_CLASS_NAME"))

response = documents.query.hybrid(
    query="what's the first amendment?",
    alpha=0.5, # equal weighting of BM25 and vector search
    return_properties=['text'],
    auto_limit=2  # autocut after 2 jumps
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "ARTICLES in addition to, and Amendment of the Constitution of the United States of America, proposed by Congress, and ratified by the Legislatures of the several States, pursuant to the fifth Article of the original Constitution.\n\n(Note: The first 10 amendments to the Constitution were ratified December 15, 1791, and form what is known as the \u201cBill of Rights.\u201d)"
}
{
  "text": "Amendment I.\n\nCongress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridg- ing the freedom of speech, or of the press, or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.\n\nAmendment II.\n\nA well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed."
}
{
  "text": "Ratification may be proposed by the Congress; Provided that no Amendment which may be made prior to the Year One thousand 

### Vector Search

In [None]:
documents = client.collections.get(os.getenv("WEAVIATE_COLLECTION_CLASS_NAME"))

response = documents.query.near_text(
    query="what's the first amendment?",
    return_properties=['text'],
    limit=5  # limit to 5
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

{
  "text": "Amendment I.\n\nCongress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridg- ing the freedom of speech, or of the press, or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.\n\nAmendment II.\n\nA well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed."
}
{
  "text": "ARTICLES in addition to, and Amendment of the Constitution of the United States of America, proposed by Congress, and ratified by the Legislatures of the several States, pursuant to the fifth Article of the original Constitution.\n\n(Note: The first 10 amendments to the Constitution were ratified December 15, 1791, and form what is known as the \u201cBill of Rights.\u201d)"
}
{
  "text": "Ratification may be proposed by the Congress; Provided that no Amendment which may be made prior to the Year One thousand 

Generative Search

In [None]:
generateTask = "Please write a short summary of amendement 1"

documents = client.collections.get(os.getenv("WEAVIATE_COLLECTION_CLASS_NAME"))
response = documents.generate.near_text(
    query="amendment 1",
    limit=5,
    grouped_task=generateTask
)

In [None]:
import textwrap

# Assuming response.generated is a long string
wrapped_text = textwrap.fill(response.generated, width=80)  # Set width to desired character limit per line
print(wrapped_text)


Amendment I of the United States Constitution, part of the Bill of Rights,
prohibits Congress from making any law respecting an establishment of religion,
or prohibiting the free exercise thereof; or abridging the freedom of speech, or
of the press, or the right of the people peaceably to assemble, and to petition
the Government for a redress of grievances. This amendment was ratified on
December 15, 1791.
