# Using Unstructured with LangChain & AstraDB

In this notebook, we show a basic RAG-style example that uses the Unstructured API to parse a PDF document, store the corresponding document into a vector store (`AstraDB`) and finally, perform some basic queries against that store. The notebook is modeled after the quick start notebooks and hence is meant as a way of getting started with Unstructured, backed by a vector database.

### Requirements

In [None]:
# First, install the required dependencies
!pip install --quiet ragstack-ai

### Configuration

In [32]:
import os
from getpass import getpass

os.environ["UNSTRUCTURED_API_KEY"] = getpass("Enter your Unstructured API Key:")
os.environ["ASTRA_DB_ENDPOINT"] = input("Enter you Astra DB API Endpoint: ")
os.environ["ASTRA_DB_TOKEN"] = getpass("Enter you Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

### Using the Unstructured API to parse a PDF

In this example notebook, we'll focus our analysis on pages 9 and 10 of the referenced paper, available at https://arxiv.org/pdf/1706.03762.pdf. 

We're doing this to save your usage credits, as the free version of Unstructured's API is limited to 1,000 pages per month.

#### Simple Parsing

First we will start with the most basic parsing mode. This works well if your document doesn't contain any complex formatting or tables.

In [2]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

loader = UnstructuredAPIFileLoader(
    file_path="./resources/attention_pages_9_10.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
)
documents = loader.load()
len(documents)

1

By default, the parser returns 1 document per pdf file.  Lets examine the contents of the document:

In [3]:
print(documents[0].page_content)

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.

base

(A)

(B)

(C)

(D)

N dmodel

6

512

2 4 8

256 1024

dff

2048

1024 4096

h

8 1 4 16 32

dk

64 512 128 32 16 16 32

32 128

dv

64 512 128 32 16

32 128

Pdrop

0.1

0.0 0.2

✏ls

0.1

0.0 0.2

PPL train steps (dev) 100K 4.92 5.29 5.00 4.91 5.01 5.16 5.01 6.11 5.19 4.88 5.75 4.66 5.12 4.75 5.77 4.95 4.67 5.47 4.92 300K 4.33

BLEU params 106 (dev) 25.8 65 24.9 25.5 25.8 25.4 25.1 25.4 23.7 25.3 25.5 24.5 26.0 25.4 26.2 24.6 25.5 25.3 25.7 25.7 26.4

⇥

58 60 36 50 80 28 168 53 90

(E) big

6

positional embedding instead of sinusoids

1024

4096

16

0.3

213

development set, newstest2013. We used beam search as described in the previous section, but no ch

Scrolling through the text above, you can see that the table data isn't well formatted.  

#### Advanced Parsing

By changing the processing strategy and response mode, we can get more detailed document structure. Unstructured can break the document into elements of different types, which can be helpful for improving your RAG system.  For example, you might want to exclude elements of type `Footer` from your vector store.

A list of all the different element types can be found here: https://unstructured-io.github.io/unstructured/introduction/overview.html#id1

In [5]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

loader = UnstructuredAPIFileLoader(
    file_path="./resources/attention_pages_9_10.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
    mode="elements", # default: "single"
    strategy="hi_res", # default: "auto"
    pdf_infer_table_structure=True,
)
documents = loader.load()
len(documents)

27

Now we have 27 documents returned from the pdf. We will use the following script to examine the new contents. 

Note that we will use the `text_as_html` property on Table elements to show the 2 tables.

In [10]:
from IPython.display import display, HTML

for doc in documents:
    category = doc.metadata["category"] if "category" in doc.metadata else None
    page_number = doc.metadata["page_number"] if "page_number" in doc.metadata else None
    parent_id = doc.metadata["parent_id"] if "parent_id" in doc.metadata else None

    if category == "Table":
        display(HTML(doc.metadata["text_as_html"]))
    else:
        print(f"category: {category}, page_number: {page_number}, parent_id: {parent_id}, content: {doc.page_content}")

category: FigureCaption, page_number: 1, parent_id: None, content: Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.


Unnamed: 0,N,dwss,dn,b,di,Pug,as,gon,| 08,BORT,PR,Unnamed: 12
base,| 6,512.0,2048,8,64,0.1,1.0,100K,| 492,258.0,65.0,
,,,,1,512,,,,529,249.0,,
,,,,4,128,,,,500,255.0,,
(A),,,,16,32,,,,491,258.0,,
,,,,32,16,,,,501,254.0,,
,,,,,16,,,,516,251.0,58.0,
®),,,,,32,,,,501,254.0,60.0,
©),2,,,,,,,,6.11,237.0,36.0,
©),,4.0,,,,,,,,519.0,253.0,50.0
©),,8.0,,,,,,,,488.0,255.0,80.0


category: NarrativeText, page_number: 1, parent_id: None, content: development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
category: NarrativeText, page_number: 1, parent_id: None, content: In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
category: NarrativeText, page_number: 1, parent_id: None, content: In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneﬁcial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in 

Parser,Training,WSJ 23 F1
Vinyals & Kaiser el al. (2014) B2,"T WST only, discriminative",88.3
Petrov et al. (2006),"WSIJ only, discriminative",90.4
Zhu et al. (2013) [A0],"WSIJ only, discriminative",90.4
Dyer et al. (2016) (5],"WSJ only, discriminative",91.7
Transformer (4 layers),"WSIJ only, discriminative",91.3
Zhu et al. (2013) [A0],semi-supervised,91.3
Vinyals Transformer (4 layers),semi-supervised semi-supervised,92.7
Luong et al. (2015) 23],multi-task,93.0
Dyer et al. (2016),generative,93.3


category: NarrativeText, page_number: 2, parent_id: None, content: increased the maximum output length to input length + 300. We used a beam size of 21 and ↵ = 0.3 for both WSJ only and the semi-supervised setting.
category: NarrativeText, page_number: 2, parent_id: None, content: Our results in Table 4 show that despite the lack of task-speciﬁc tuning our model performs sur- prisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
category: NarrativeText, page_number: 2, parent_id: None, content: In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley- Parser [29] even when training only on the WSJ training set of 40K sentences.
category: Title, page_number: 2, parent_id: None, content: 7 Conclusion
category: NarrativeText, page_number: 2, parent_id: 1acda5750c5f47875b86ebb114a0bb80, content: In this work, we presented the Transformer, the ﬁrst sequence transduction

Here we can see that the table structure was parsed fairly well. The 2nd (simpler) table is correct. We also see hinting of document structure from the `parent_id` parameter, but unfortunately the LangChain `Document` type masks the element_id of each document.

We can instead use the Unstructured API directly to build up the element structure:

In [2]:
from langchain_community.document_loaders import unstructured

elements = unstructured.get_elements_from_api(
    file_path="./resources/attention_pages_9_10.pdf",
    api_key=os.getenv("UNSTRUCTURED_API_KEY"),
    strategy="hi_res",
    pdf_infer_table_structure=True,
)

In [3]:
from IPython.display import display, HTML

parents = {}

for el in elements:
    parents[el.id] = el.text

for el in elements:
    if el.category == "Table":
        display(HTML(el.metadata.text_as_html))
    elif el.metadata.parent_id:
        print(f"parent: '{parents[el.metadata.parent_id]}' content: {el.text}")
    else:
        print(el)

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.


Unnamed: 0,N,dwss,dn,b,di,Pug,as,gon,| 08,BORT,PR,Unnamed: 12
base,| 6,512.0,2048,8,64,0.1,1.0,100K,| 492,258.0,65.0,
,,,,1,512,,,,529,249.0,,
,,,,4,128,,,,500,255.0,,
(A),,,,16,32,,,,491,258.0,,
,,,,32,16,,,,501,254.0,,
,,,,,16,,,,516,251.0,58.0,
®),,,,,32,,,,501,254.0,60.0,
©),2,,,,,,,,6.11,237.0,36.0,
©),,4.0,,,,,,,,519.0,253.0,50.0
©),,8.0,,,,,,,,488.0,255.0,80.0


development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneﬁcial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-ﬁtting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
6.3 English Constituency P

Parser,Training,WSJ 23 F1
Vinyals & Kaiser el al. (2014) B2,"T WST only, discriminative",88.3
Petrov et al. (2006),"WSIJ only, discriminative",90.4
Zhu et al. (2013) [A0],"WSIJ only, discriminative",90.4
Dyer et al. (2016) (5],"WSJ only, discriminative",91.7
Transformer (4 layers),"WSIJ only, discriminative",91.3
Zhu et al. (2013) [A0],semi-supervised,91.3
Vinyals Transformer (4 layers),semi-supervised semi-supervised,92.7
Luong et al. (2015) 23],multi-task,93.0
Dyer et al. (2016),generative,93.3


increased the maximum output length to input length + 300. We used a beam size of 21 and ↵ = 0.3 for both WSJ only and the semi-supervised setting.
Our results in Table 4 show that despite the lack of task-speciﬁc tuning our model performs sur- prisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley- Parser [29] even when training only on the WSJ training set of 40K sentences.
7 Conclusion
parent: '7 Conclusion' content: In this work, we presented the Transformer, the ﬁrst sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
parent: '7 Conclusion' content: For translation tasks, the Transformer can be trained signiﬁcantly faster than architectures based on recurrent or convolutional layers. O

Here we clearly see that Unstructured is parsing both table and document structure.

### Storing into Astra DB

Now we will continue with the RAG process, by creating embeddings for the pdf, and storing them in Astra.

In [26]:
from langchain_community.vectorstores import AstraDB
from langchain_openai import OpenAIEmbeddings

astra_db_store = AstraDB(
    collection_name="langchain_unstructured",
    embedding=OpenAIEmbeddings(),
    token=os.getenv("ASTRA_DB_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_ENDPOINT")
)

We will create LangChain Documents by splitting the text after `Table` elements and before `Title` elements. Additionally, we will use the html output format for table data.

In [15]:
from langchain_core.documents import Document

documents = []
current_doc = None

for el in elements:
    if el.category in ["Header", "Footer"]:
        continue # skip these
    if el.category == "Title":
        documents.append(current_doc)
        current_doc = None
    if not current_doc:
        current_doc = Document(page_content="", metadata=el.metadata.to_dict())
    current_doc.page_content += el.metadata.text_as_html if el.category == "Table" else el.text
    if el.category == "Table":
        documents.append(current_doc)
        current_doc = None

documents

[Document(page_content='Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.<table><thead><th></th><th>N</th><th>dwss</th><th>dn</th><th>b</th><th>di</th><th>Pug</th><th>as</th><th>gon</th><th>| 08</th><th>BORT</th><th>PR</th></thead><tr><td>base</td><td>| 6</td><td>512</td><td>2048</td><td>8</td><td>64</td><td>0.1</td><td>01</td><td>100K</td><td>| 492</td><td>258</td><td>65</td></tr><tr><td></td><td></td><td></td><td></td><td>1</td><td>512</td><td></td><td></td><td></td><td>529</td><td>249</td><td></td></tr><tr><td></td><td></td><td></td><td></td><td>4</td><td>128</td><td></td><td></td><td></td><td>500</td><td>255</td><td></td></tr><tr><td>(A)</td><td></td><td></td><td></td><td>16</td><td>32</td><td></td><td></td><td><

In [27]:
astra_db_store.add_documents(documents)

['124f9da7f5794806bbb681b1277c5d45',
 '19757da3ce71413899b2a78eff6657ee',
 'd9b8862819e14b469416a4f6f087e53e',
 'd1f98d9d7db34e45b081c3c0416901a0']

### Simple RAG Example

Now that we have populated our vector store, we will build a RAG pipeline and build some queries.

In [28]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = """
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
Context: {context}
Question: {question}
Your answer:
"""

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", streaming=False, temperature=0)

chain = (
    {"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
    | PromptTemplate.from_template(prompt)
    | llm
    | StrOutputParser()
)

First we can ask a question about some text in the document:

In [29]:
chain.invoke("What does reducing the attention key size do?")

'Reducing the attention key size hurts model quality.'

Next we can try to get a value from the 2nd table:

In [31]:
chain.invoke("For the transformer to English constituency results, what was the 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]'?")

"The 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]' was 91.7."

And finally we can ask a question that doesn't exist in our content to confirm that the LLM rejection is working correctly.

In [23]:
# Query fails to be answered due to lack of context in Astra DB
chain.invoke("When was George Washington born?")

"I don't know. The supplied context does not provide any information about George Washington's birthdate."