# Using Unstructured with LangChain & AstraDB

In this notebook, we show a basic RAG-style example that uses the Unstructured API to parse a PDF document, store the corresponding document into a vector store (`AstraDB`) and finally, perform some basic queries against that store. The notebook is modeled after the quick start notebooks and hence is meant as a way of getting started with Unstructured, backed by a vector database.

### Requirements

In [None]:
# First, install the required dependencies
!pip install --quiet ragstack-ai

### Configuration

In [None]:
import os
from getpass import getpass

os.environ["UNSTRUCTURED_API_KEY"] = getpass("Enter your Unstructured API Key:")
os.environ["ASTRA_DB_ENDPOINT"] = input("Enter you Astra DB API Endpoint: ")
os.environ["ASTRA_DB_TOKEN"] = getpass("Enter you Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

### Using the Unstructured API to parse a PDF

Note that we will use a simple, single-page PDF that was generated by the RAGstack team. We are using this because the free version of the Unstructured API is limited to 1,000 pages per month, and we don't want to use up too much of your credit with this example notebook.

In [None]:
# Download a simple PDF
import requests

# The URL of the file you want to download
url = "https://raw.githubusercontent.com/datastax/ragstack-ai/e0d91b269113bb4c26c8d68b1255a1f6b060f9a9/ragstack-e2e-tests/e2e_tests/resources/tree.pdf"
# The local path where you want to save the file
file_path = "./tree.pdf"

# Perform the HTTP request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Open the file in binary write mode and save the content
    with open(file_path, "wb") as file:
        file.write(response.content)
    print("Download complete.")
else:
    print("Error downloading the file.")

In [3]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

loader = UnstructuredAPIFileLoader(
    file_path="./tree.pdf",
    mode="single",
    strategy="auto",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
)
documents = loader.load()

In [4]:
# Take a quick look at the parsed text from the pdf:
print(documents[0].page_content)

Attribute

Value

Type

Broadleaf

Height

Can reach up to 100 ft

Age

Several centuries possible

Leaf Color

Green in Summer, Yellow/Brown in Autumn

Once upon a time, in a lush meadow bordered by the whispering woods, stood a magnificent oak tree. This grand oak was known by all as Eldenroot, a sentinel that had watched over the land for countless generations. Eldenroot's branches stretched out like the arms of a wise elder, offering shade to travelers and a home to the creatures of the forest. Its leaves, a vibrant green throughout the warm months, turned into a fiery tapestry of reds, oranges, and yellows with the arrival of autumn. Legends spoke of its mystical beginnings, planted by nature spirits on a night when the moon shone brightest. Eldenroot was not just a tree; it was a timeless guardian, a keeper of secrets, and a symbol of the enduring beauty of nature.


### Storing into Astra DB

In [5]:
from langchain_community.vectorstores import AstraDB
from langchain_openai import OpenAIEmbeddings

astra_db_store = AstraDB(
    collection_name="langchain_unstructured",
    embedding=OpenAIEmbeddings(),
    token=os.getenv("ASTRA_DB_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_ENDPOINT")
)

In [6]:
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=0)
astra_db_store.add_documents(splitter.split_documents(documents))

['823a261e3c6242aaac1fd9066a9145a6']

### Simple RAG Example

In [13]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = """
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
Context: {context}
Question: {question}
Your answer:
"""

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", streaming=False, temperature=0)

chain = (
    {"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
    | PromptTemplate.from_template(prompt)
    | llm
    | StrOutputParser()
)

In [14]:
chain.invoke("What was Eldenroot?")

'Eldenroot was a magnificent oak tree.'

In [None]:
# Take a look at one of the source nodes from the response
response_1.source_nodes[0].get_content()

In [9]:
# Query fails to be answered due to lack of context in Astra DB
chain.invoke("When was George Washington born?")

"I don't know."