# Using Unstructured with LangChain & AstraDB

In this notebook, we show a basic RAG-style example that uses the Unstructured API to parse a PDF document, store the corresponding document into a vector store (`AstraDB`) and finally, perform some basic queries against that store. The notebook is modeled after the quick start notebooks and hence is meant as a way of getting started with Unstructured, backed by a vector database.

To use Unstructured, you need an API key. Sign-up for one here: https://unstructured.io/api-key-hosted. A key will be emailed to you.

### Requirements

In [None]:
# First, install the required dependencies
! pip install --quiet ragstack-ai

In [None]:
# Next, download the test pdf
import requests

url = 'https://raw.githubusercontent.com/datastax/ragstack-ai/main/examples/notebooks/resources/attention_pages_9_10.pdf'
response = requests.get(url)
with open('attention_pages_9_10.pdf', 'wb') as file:
    file.write(response.content)


### Configuration

In [None]:
import os
from getpass import getpass

os.environ["UNSTRUCTURED_API_KEY"] = getpass("Enter your Unstructured API Key:")
os.environ["UNSTRUCTURED_API_URL"] = getpass("Enter your Unstructured API URL:")
os.environ["ASTRA_DB_API_ENDPOINT"] = input("Enter your Astra DB API Endpoint: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

### Using the Unstructured API to parse a PDF

In this example notebook, we'll focus our analysis on pages 9 and 10 of the referenced paper, available at https://arxiv.org/pdf/1706.03762.pdf, to limit API usage.

#### Simple Parsing

First we will start with the most basic parsing mode. This works well if your document doesn't contain any complex formatting or tables.

In [None]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader
import os

loader = UnstructuredAPIFileLoader(
    file_path="attention_pages_9_10.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
    url = os.getenv("UNSTRUCTURED_API_URL"),
)
simple_docs = loader.load()
len(simple_docs)

By default, the parser returns 1 document per pdf file.  Lets examine some the contents of the document:

In [None]:
print(simple_docs[0].page_content[0:400])

This sample of the document contents shows the first table's description, and the start of a very poorly formatted table.

#### Advanced Parsing

By changing the processing strategy and response mode, we can get more detailed document structure. Unstructured can break the document into elements of different types, which can be helpful for improving your RAG system.

For example, the `Table` element type includes the table formatted as simple html, which can help the LLM answer questions from the table data, and we could exclude elements of type `Footer` from our vector store.

A list of all the different element types can be found here: https://unstructured-io.github.io/unstructured/introduction/overview.html#id1

Returned metadata can also be helpful. For example, the `page_number` of the pdf input, and a `parent_id` property which helps define nesting of text sections.

In [None]:
from langchain_community.document_loaders import unstructured

elements = unstructured.get_elements_from_api(
    file_path="attention_pages_9_10.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
    api_url = os.getenv("UNSTRUCTURED_API_URL"),
    strategy="hi_res", # default "auto"
    pdf_infer_table_structure=True,
)

len(elements)

Instead of a single document returned from the pdf, we now have 27 elements. Below, we use element type and `parent_id` to show a clearer representation of the document structure.

In [None]:
from IPython.display import display, HTML

parents = {}

for el in elements:
    parents[el.id] = el.text

for el in elements:
    if el.category == "Table":
        display(HTML(el.metadata.text_as_html))
    elif el.metadata.parent_id:
        print(f"parent: '{parents[el.metadata.parent_id]}' content: {el.text}")
    else:
        print(el)

Here we clearly see that Unstructured is parsing both table and document structure.

### Storing into Astra DB

Now we will continue with the RAG process, by creating embeddings for the pdf, and storing them in Astra.

In [None]:
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

astra_db_store = AstraDBVectorStore(
    collection_name="langchain_unstructured",
    embedding=OpenAIEmbeddings(),
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT")
)

We will create LangChain Documents by splitting the text after `Table` elements and before `Title` elements. Additionally, we use the html output format for table data.

In [None]:
from langchain_core.documents import Document

documents = []
current_doc = None

for el in elements:
    if el.category in ["Header", "Footer"]:
        continue # skip these
    if el.category == "Title":
        documents.append(current_doc)
        current_doc = None
    if not current_doc:
        current_doc = Document(page_content="", metadata=el.metadata.to_dict())
    current_doc.page_content += el.metadata.text_as_html if el.category == "Table" else el.text
    if el.category == "Table":
        documents.append(current_doc)
        current_doc = None

astra_db_store.add_documents(documents)

### Querying

Now that we have populated our vector store, we will build a RAG pipeline and execute some queries.

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = """
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
Context: {context}
Question: {question}
Your answer:
"""

llm = ChatOpenAI(model="gpt-3.5-turbo", streaming=False, temperature=0)

chain = (
    {"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
    | PromptTemplate.from_template(prompt)
    | llm
    | StrOutputParser()
)

First we can ask a question about some text in the document:

In [None]:
chain.invoke("What does reducing the attention key size do?")

Next we can try to get a value from the 2nd table:

In [None]:
chain.invoke("For the transformer to English constituency results, what was the 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]'?")

And finally we can ask a question that doesn't exist in our content to confirm that the LLM rejection is working correctly.

In [None]:
# Query fails to be answered due to lack of context in Astra DB
chain.invoke("When was George Washington born?")