# Structured Chunking

from https://blog.langchain.dev/a-chunk-by-any-other-name/

In [1]:
# Required for csv to markdown conversion
!pip install tabulate



In [2]:
from dotenv import load_dotenv

### Workaround for local package imports

To import from app.preprocessing package in the root of the repo we need to CD into the root.

Note: this code is only necessary in a Jupyter notebook. In a regular python script, the imports work fine.

In [3]:
import os
import sys

# This assumes your working directory is in the experiments/<name>/ dir
repo_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))

# Check if app/ dir exists in current working directory along with .env (to make this cell re-runnable)
if os.path.exists(os.path.join(os.getcwd(), "app")) and os.path.exists(os.path.join(os.getcwd(), ".env")):
    app_path = os.path.join(os.getcwd(), "app")
    sys.path.append(app_path)
    print("app/ dir and .env found in current working directory, keeping CWD as is.")
else:
    os.chdir(repo_root)
    print("app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.")

# Load .env
load_dotenv()

app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.


True

## Load structured document 

Steps to get `Document` object from a pdf path:
1. create `AdobeExtractAPIManager` instance
2. call `doc = extract_document` method with path to document
3. call `AdobeDocumentSplitter().document_to_chunks(doc)` 

In [4]:
from app.preprocessing.adobe.manager import AdobeExtractAPIManager

adobe_manager = AdobeExtractAPIManager(
    os.getenv("ADOBE_CLIENT_ID"),
    os.getenv("ADOBE_CLIENT_SECRET"),
    extract_dir_path="data/interim/000-adobe-extract/"
)

####
# Edit the path below to point to a PDF file on your local machine
####
document = adobe_manager.get_document(
    "/Users/dvdblk/Downloads/pdf_files_complete/UK_47.pdf"
)

2023-11-19 03:37:46.278 app.preprocessing.adobe.manager INFO     Initialized AdobeExtractAPIManager (with extract_dir_path=data/interim/000-adobe-extract/)


#### Visualize the sections of the document

In [5]:
def print_hierarchy(section, level=0):
    print("\t" * level, section.section_type, section.title, section.starting_page, section.id)
    for subsection in section.subsections:
        print_hierarchy(subsection, level + 1)

# Print document section title hierarchy
print_hierarchy(document)

 document None 1 
	 H1 CONTENTS 2 1
	 H1 EXECUTIVE SUMMARY 4 2
	 H1 1. INTRODUCTION1 8 3
	 H1 2.THE UK SPACE INDUSTRY:AN OVERVIEW 10 4
		 H2 FIGURE 1:THE STRUCTURE OF THE SPACE SECTOR 11 4.1
	 H1 3. RESEARCH METHODOLOGY 15 5
	 H1 4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS 17 6
		 H2 4.1.THE SIZE OF THE TECHNICIAN WORKFORCE 17 6.1
		 H2 4.2 TYPES OF TECHNICIAN AND THE NATURE OF TECHNICAL SUPPORT 18 6.2
			 H3 4.2.1: Mechanical Engineering Technicians 19 6.2.1
				 H4 4.2.1 (a) Machinists 19 6.2.1.1
					 H5 4.2.1 (b) Mechanical Assembly Technician 20 6.2.1.1.1
						 H6 4.2.1 (c) Composites Laminators 21 6.2.1.1.1.1
						 H6 4.2.1 (d) Manufacturing or Production Engineers 21 6.2.1.1.1.2
						 H6 4.2.2 Electrical and Electronics Engineering Technicians 22 6.2.1.1.1.3
						 H6 4.2.2 (a) PCB (Electronics) Assembly and Inspection Technician 23 6.2.1.1.1.4
					 H5 4.2.2 (b) Electronics and Electrical Assembly Technicians 24 6.2.1.1.2
						 H6 

## Create chunks from sections

In [6]:
from app.preprocessing.adobe.splitter import DocumentSplitter

# Get chunks
docs = DocumentSplitter().document_to_chunks(document)

In [7]:
# Print avg chunk length
avg_chunk_len = sum([len(doc.page_content) for doc in docs]) / len(docs)
avg_chunk_len

3863.242424242424

## Structured chunking methods below

In [8]:
def print_result(response_obj):
    print("SOURCES: \n")
    cnt = 1
    for source_doc in response_obj["source_documents"]:
        print(f"Chunk #{cnt}")
        cnt += 1
        print("Source Metadata: ", source_doc.metadata)
        print("Source Text:")
        print(source_doc.page_content)
        print("\n")
    print("RESULT: \n")
    print(response_obj["result"] + "\n\n")

In [9]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA


llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding=embeddings)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

In [9]:
query = "Are there any mentions of skills or technologies?"
response = retriever({"query": query})
print_result(response)

SOURCES: 

Chunk #1
Source Metadata:  {'H1': '4. Talent and Skills'}
Source Text:
Vision: The UK has a large, varied base of skilled, technical and entrepreneurial talent which is agile and quickly responds to the needs of industry, academia and government. This includes talent in STEM, digital and data, commercialisation and national security.


Chunk #2
Source Metadata:  {'H1': '8. Access to Physical and Digital Infrastructure'}
Source Text:
Vision: Accessibility and coordination of infrastructure attracts talent and investment, establishes anchors for innovation clusters and enables companies to scale. The UK has diverse, agile and resilient facilities to support its technology choices and works with partners globally to deliver major science and technology projects.


Chunk #3
Source Metadata:  {'H2': 'Outcomes – by 2030 we will have:', 'H1': '4. Talent and Skills'}
Source Text:
● Created an agile and responsive skills system, which delivers the skills needed to support a world-cla

# Summarizing each section

(structured chunking ends with the cell above)

In [10]:
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

"""
    Your answer will be exactly in the following format:
    ```json
    {{
        "summary": "<your summary here>"
    }}
    ```
"""

class SectionSummaryOutput(BaseModel):
    """Contains summary of a given section"""
    summary: Optional[str] = Field(None, description="the summary of the section")

# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
prompt_template_structured = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
You're an expert policy analyst that is analyzing an economic policy document. Your goal is to summarize a given section text of a document in no more than 15-20 sentences. Don't make a longer summary than the original text.

The section text will be given to you in the following json format:
```json
{{
    "section": {{
        "title": "<section title>",
        "text": "<section text to summarize>"
    }}
}}
```

Make sure to follow these rules while summarizing (as if your life depended on it):
1. absolutely make sure that you don't skip any mentions of technologies, skills, capabilities or investments related to any of these topics: Advanced Computing, Battery Technologies, Semiconductors, Clean Energy.
2. pay attention to the intention of the section especially with regards to sentiment towards adoption or promotion of any skills.
3. if the text mentions or discusses policy initiatives related to inclusion, health, digital, green resilience make sure to include them in the summary.
4. mention any discussion of funding, investments or budget allocations.
5. in the summary, make sure to mention whether there is a certain future need for any skills or technologies
6. mention any explicit skill needs that are mentioned in the text.
7. if the section is a table of contents or an index, just return "table of contents" as the summary
8. if the entire section contains only publication citations, don't summarize it just return "references" as the summary.
            """,
        ),
        (
            "human",
            """
Here is the section json of a document to summarize:
```json
{{
    "section": {{
        "title": {section_title},
        "text": {section_text}
    }}
}}
```
            """,
        ),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)

In [11]:
from langchain.chains.openai_functions import create_structured_output_runnable
import json

def get_all_document_sections(document):
    sections = []

    if subsections := document.subsections:
        for subsection in subsections:
            sections.append(subsection)
            if s := get_all_document_sections(subsection):
                sections.extend(s)

    return sections


all_sections = get_all_document_sections(document)

runnable = create_structured_output_runnable(SectionSummaryOutput, llm, prompt_template_structured)

section_summaries = []

for section in all_sections:
    title, text = section.title, "\n".join([p.text for p in section.paragraphs])

    if len(text) > 0:
        # Response now has no content, it's the pydantic object instead
        response = runnable.invoke({"section_title": title, "section_text": text})
        pretty_resp = json.dumps(response.dict(), indent=2)

        section_summaries.append((section, response))

        print(section.section_type, section.title, section.starting_page, section.id)
        print(pretty_resp)
        print("\n\n")
    else:
        section_summaries.append((section, SectionSummaryOutput(summary=None)))
        print("Not querying section with no text:", section.title)
        print("Saved summary as None")
        print("\n\n")


H1 CONTENTS 2 1
{
  "summary": "table of contents"
}



H1 EXECUTIVE SUMMARY 4 2
{
  "summary": "The section discusses the government's goal of creating a modern class of technicians to address skills shortages and the aging workforce in the UK economy. It investigates the role of technicians in the space industry, their duties, required skills, and how employers obtain them. The report focuses on the space sector's upstream and downstream activities, highlighting the sector's significant contribution to the UK GDP and employment. It also emphasizes the rapid growth and high labor productivity of the space sector. The data collected indicates that most technicians are employed by upstream manufacturers and possess qualifications such as HNC in engineering. The report highlights the shift towards in-house training and apprenticeships to meet the growing demand for technicians. It discusses the advantages of apprenticeship training, the qualifications apprentices are studying for, and th

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).


H2 4.1.THE SIZE OF THE TECHNICIAN WORKFORCE 17 6.1
{
  "summary": "The section discusses the employment of technicians in the space sector, particularly in upstream manufacturing roles. It highlights that technicians account for around 20% of the total workforce in space primes and specialist upstream manufacturers, while in organizations combining upstream manufacturing with downstream services, the share of technicians is much smaller, averaging under 5% of the total workforce. The study indicates that downstream firms and satellite operators accounted for fewer than 5% of the total number of technicians employed. The highly demanding and specialized nature of work in software, IT engineering services, and consultancy implies a preference for hiring individuals with at least a degree level qualification. Exceptions include a small number of vocationally educated IT technicians in database management and front-line helpdesk roles. The report suggests that the industry relies heavily o

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).


H3 5.2.2 Rationale 34 7.2.2
{
  "summary": "The section discusses the rationale behind organizations taking on apprenticeships, with a focus on the need to acquire specialist technician skills due to a declining availability of such skills in the external labor market. It highlights the use of apprenticeships for meeting medium-term skills needs, workforce planning, and addressing recruitment problems. Additionally, it emphasizes the role of apprenticeships in planning for the orderly succession of an aging technician workforce. The text also mentions the need for more training and apprenticeships to fill jobs arising from planned growth in manufacturing and services, as indicated in a recent report on the space sector. Furthermore, the advantage of apprenticeship training is noted as an opportunity for employers to socialize young people into the organization's culture, shaping their habits and modes of thought. The section emphasizes the value of starting with individuals who have no

### PDFTriage style OpenAI functions

1. initial prompt question (context=document section hierarchy + summaries)


the following is an impl of openai functions with langchain (https://python.langchain.com/docs/modules/chains/how_to/openai_functions#getting-structured-outputs):

In [12]:
from typing import Optional

from langchain.chains.openai_functions import (
    create_openai_fn_chain,
    create_openai_fn_runnable,
    create_structured_output_chain,
    create_structured_output_runnable,
)
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel, Field

In [67]:
class Person(BaseModel):
    """Identifying information about a person."""

    name: str = Field(..., description="The person's name")
    age: int = Field(..., description="The person's age")
    fav_food: Optional[str] = Field(None, description="The person's favorite food")

# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a world class algorithm for extracting information in structured formats.",
        ),
        (
            "human",
            "Use the given format to extract information from the following input: {input}",
        ),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)

runnable = create_structured_output_runnable(Person, llm, prompt)
runnable.invoke({"input": "Sally is 13"})

In [15]:
from typing import Sequence


class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person] = Field(..., description="The people in the text")


runnable = create_structured_output_runnable(People, llm, prompt)
runnable.invoke(
    {
        "input": "Sally is 13, Joey just turned 12 and loves spinach. Caroline is 10 years older than Sally."
    }
)

People(people=[Person(name='Sally', age=13, fav_food=None), Person(name='Joey', age=12, fav_food='spinach'), Person(name='Caroline', age=23, fav_food=None)])

## Tools as OpenAI Functions

https://python.langchain.com/docs/modules/agents/tools/tools_as_openai_functions

In [48]:
from langchain.schema import HumanMessage
from langchain.tools import format_tool_to_openai_function, BaseTool
from typing import Type, Sequence
from langchain.prompts import HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain.callbacks.manager import (
    AsyncCallbackManagerForToolRun,
    CallbackManagerForToolRun,
)


# Fetch Section tool
class FetchSectionsSchema(BaseModel):
    reasoning: str = Field(description="the reasoning behind the selection of a section to fetch")
    section_ids: Sequence[str] = Field(description="the exact ID(s) of the section(s) to fetch")

class FetchSectionsTool(BaseTool):
    name = "fetch_sections"
    description = "fetches an entire section or sections from a document that might contain an answer to the question"
    args_schema: Type[FetchSectionsSchema] = FetchSectionsSchema

    def _run(
        self,
        reasoning: str,
        section_ids: Sequence[str],
        run_manager: Optional[CallbackManagerForToolRun] = None,
        **kwargs,
    ) -> Sequence[str]:
        """Use the tool."""
        sections = []
        # get full section text from document
        for section, _ in section_summaries:
            if section.id in section_ids:
                result = {
                    "title": section.title_clean,
                    "id": section.id,
                    "text": "\n".join([p.text for p in section.paragraphs])
                }
                sections.append(result)
        return sections


# Fetch pages tool
class FetchPagesSchema(BaseModel):
    page_numbers: Sequence[int] = Field(description="the page numbers to fetch")

class FetchPagesTool(BaseTool):
    name = "fetch_pages"
    description = "useful when you need to fetch specific pages from a document, for example to fetch multiple sections at a time"
    args_schema: Type[FetchPagesSchema] = FetchPagesSchema

# Create langchain tools
tools = [
    FetchSectionsTool(),
    #FetchPagesTool()
]
# Transform to openai functions
openai_functions = [format_tool_to_openai_function(t) for t in tools]

"""
<< Example structure of the document >>
```json
{{
    "document": {{
        "title": <title of the document>,
        "sections": [
            {{
                "title": <title of the section>,
                "pages": <list of page numbers this section spans over>,
                "summary": <brief summary of the section>,
                "sections": <list of nested sections in this section, same structure as above>
            }}
        ]
    }}
}}
```

"""
# Structured metadata prompt (initial for most questions)
structured_metadata_system_prompt = SystemMessagePromptTemplate.from_template(
"""
You're an expert policy analyst that needs to find the appropriate sections of an economic policy document that answers the given question.
Your task is to look at the summaries in the following structural metadata json and find the appropriate section IDs of the document that might contain the answer.

Strictly adhere to these rules under all circumstances:
1. if you can't find the answer from the summaries, just fetch the most relevant sections to the question
2. for questions that are similar to "what is the document about?" or "what is the summary of the document?": try to fetch initial or final sections with "summary" or "conclusion" in their title.
3. always make sure to return all sections that might be relevant to the question as their respective IDs
""")

structured_metadata_prompt = HumanMessagePromptTemplate.from_template("""
<< Question >>
{question}

<< Document >>
{document_structural_metadata}
""")
structured_metadata_prompt

HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['document_structural_metadata', 'question'], template='\n<< Question >>\n{question}\n\n<< Document >>\n{document_structural_metadata}\n'))

In [49]:
def document_to_structured_metadata(section):
    """Convert document to structured metadata"""
    # Check if the document is the root node
    if section.section_type == "document":
        return {
            "document": {
                "title": document.title,
                "sections": [
                    document_to_structured_metadata(section) for section in document.subsections
                ]
            }
        }
    else:
        # find section from section summaries
        section, summary_response = next(filter(lambda x: x[0] == section, section_summaries))
        result = {
            "title": section.title_clean,
            "id": section.id,
            "pages": sorted(section.pages),
            "summary": summary_response.summary
        }
        if subsections := [document_to_structured_metadata(subsection) for subsection in section.subsections]:
            result["sections"] = subsections

        return result


print(json.dumps(document_to_structured_metadata(document), indent=2))


{
  "document": {
    "title": null,
    "sections": [
      {
        "title": "CONTENTS",
        "id": "1",
        "pages": [
          2
        ],
        "summary": "table of contents"
      },
      {
        "title": "EXECUTIVE SUMMARY",
        "id": "2",
        "pages": [
          4,
          5,
          6
        ],
        "summary": "The section discusses the government's goal of creating a modern class of technicians to address skills shortages and the aging workforce in the UK economy. It investigates the role of technicians in the space industry, their duties, required skills, and how employers obtain them. The report focuses on the space sector's upstream and downstream activities, highlighting the sector's significant contribution to the UK GDP and employment. It also emphasizes the rapid growth and high labor productivity of the space sector. The data collected indicates that most technicians are employed by upstream manufacturers and possess qualifications such

In [50]:
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser

llm_with_fns = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)
llm_with_fns.bind(functions=openai_functions)

fns_prompt = ChatPromptTemplate.from_messages(
    [
        structured_metadata_system_prompt,
        structured_metadata_prompt,
    ]
)

fns_agent = (
    fns_prompt
    | llm_with_fns
)

# fns_agent.invoke({
#     "document_structural_metadata": document_to_structured_metadata(document),
#     "question": "Which section of the document mentions funding?"
# })

In [58]:
# This works fairly well
question = "Which sections talk about Machinists?"
fns_response_message = llm.predict_messages(
    [
        structured_metadata_system_prompt.format(),
        structured_metadata_prompt.format(
            document_structural_metadata=document_to_structured_metadata(document),
            question=question
        )
    ],
    functions=openai_functions,
)

In [59]:
fns_response_message

AIMessage(content='', additional_kwargs={'function_call': {'name': 'fetch_sections', 'arguments': '{"reasoning":"The sections that talk about Machinists are likely to provide information on the tasks, qualifications, and roles of machinists in the space industry.","section_ids":["6.2.1.1","6.2.1.1.1","6.2.1.1.1.1","6.2.1.1.1.2","6.2.1.1.1.3","6.2.1.1.1.4","6.2.1.1.2","6.2.1.1.2.1","6.2.1.1.3"]}'}})

In [60]:
import json

# Refine all sections into one answer if there are more than 1 section returned by the chain above
def parse_function_output(response) -> str:
    # Get the function call
    fn_call = response.additional_kwargs.get("function_call")

    # Check if the response content is empty and that there is a function call
    if response.content == "" and fn_call is not None:
        # Get the attributes of the function call
        tool_name = fn_call["name"]
        tool_args = json.loads(fn_call["arguments"])
        # Get the correct tool from the tools list
        tool = next(filter(lambda x: x.name == tool_name, tools))
        fn_output = tool._run(**tool_args)
        return fn_output
    else:
        # Otherwise return the content
        return response.content

# Fetched sections
fetched_sections = parse_function_output(fns_response_message)

{'reasoning': 'The sections that talk about Machinists are likely to provide information on the tasks, qualifications, and roles of machinists in the space industry.', 'section_ids': ['6.2.1.1', '6.2.1.1.1', '6.2.1.1.1.1', '6.2.1.1.1.2', '6.2.1.1.1.3', '6.2.1.1.1.4', '6.2.1.1.2', '6.2.1.1.2.1', '6.2.1.1.3']}


In [69]:
fetched_sections

[{'title': ' (a) Machinists',
  'id': '6.2.1.1',
  'text': 'One prominent task carried out by some – though not, as we shall elaborate below, all – mechanical engineering technicians who work in the space industry is machining.Typically, this involves the technician carrying out both manual and CNC milling, turning and drilling in order to produce a variety of components and sub-assemblies for satellites, including: the aluminium and composite panels that constitute a satellite’s basic structure; various optical components used for the thermal control of spacecraft; and parts for the scientific instruments, detectors, and optical devices that make up the satellite’s payload. Much of this work must be carried out to very fine tolerances, in some cases requiring the technicians to view the part being machined under a microscope. In the case of CNC devices, the operators will typically work from 3-D CAD drawings provided either by technicians occupying the role of manufacturing or product

### Map re-rank chain

In [62]:
# Map re-rank all sections into one answer and return the id of the sections
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers.openai_functions import PydanticOutputFunctionsParser
from langchain.prompts import PromptTemplate
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema.prompt_template import format_document
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

# Chain to apply to each individual document. Chain
# provides an answer to the question based on the document
# and scores it's confidence in the answer.
map_prompt = PromptTemplate.from_template(
    "You're a world class policy analyst that is analyzing an economic policy document. Your goal is to answer the question based on the given context. "
    "\n\n<< Context >>:\n\n{context}\n\n<<Question>>: {question}"
)


class AnswerAndScore(BaseModel):
    """Return the answer to the question and a relevance score."""

    answer: str = Field(
        description="The answer to the question, which is based ONLY on the provided context."
    )
    score: float = Field(
        decsription="A 0.0-1.0 relevance score, where 1.0 indicates the provided context answers the question completely and 0.0 indicates the provided context does not answer the question at all."
    )


function = convert_pydantic_to_openai_function(AnswerAndScore)
map_chain = (
    map_prompt
    | ChatOpenAI().bind(
        temperature=0, functions=[function], function_call={"name": "AnswerAndScore"}
    )
    | PydanticOutputFunctionsParser(pydantic_schema=AnswerAndScore)
).with_config(run_name="Map")

# Final chain, which after answer and scoring based on
# each doc return the answer with the highest score
def top_answer(scored_answers):
    return max(scored_answers, key=lambda x: x.score).answer


document_prompt = PromptTemplate.from_template("{page_content}")
map_rerank_chain = (
    (
        lambda x: [
            {
                "context": section_json_str,
                "question": x["question"],
            }
            for section_json_str in x["fetched_sections"]
        ]
    )
    | map_chain.map()
    | top_answer
).with_config(run_name="Map rerank")

In [63]:
# Runs the map-rerank chain
# map_rerank_chain.invoke({"fetched_sections": fetched_sections, "question": question})

'Section 6.2.1.1 talks about Machinists.'

### Refine chain

Recursive summary with intermediate steps over all the answers while keeping the relevant sections

In [92]:
class RefineIO(BaseModel):
    intermediate_answer: str = Field(description="your previous intermediate answer that might need to be refined with the additional context")
    section_ids: Sequence[str] = Field(description="the exact ID(s) of the sections that were used to generate the intermediate answer")

refine_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
You're a world class policy analyst that is analyzing an economic policy document by going section over section. Your goal is to answer the question based on the given section text and intermediate_answer.
            """
        ),
        (
            "human",
            """
            Here is the intermediate_answer you generated along with the section IDs that were used to generate it: \n{refine_io}
            Use the given format to refine your previous intermediate_answer with the following section: \n{section}
            Here is the question that you need to answer in intermediate_answer: {question}
            """
        ),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)

temp_prompt = PromptTemplate.from_template("This is just a test: {section}")

refine_runnable = create_structured_output_runnable(RefineIO, llm, refine_prompt)

# Run the refine runnable for each fetched section while feeding the previous answer as the intermediate answer
initial_refine_io = RefineIO(intermediate_answer="", section_ids=[])
for section in fetched_sections:
    refine_io = refine_runnable.invoke({
        "refine_io": initial_refine_io.json(),
        "section": section,
        "question": question
    })
    print(refine_io)
    initial_refine_io = refine_io

refine_result = initial_refine_io

ValidationError: 1 validation error for _OutputFormatter
output
  field required (type=value_error.missing)

---
#### Utils

In [240]:
for section, _ in section_summaries:
    if section.title == "5. RESULTS II:THE FUTURE TECHNICIAN WORKFORCE":
        for para in section.paragraphs:
            print(para.text)
            print()

Having discussed the origins of the case study organisations’ current technicians, we move on now to consider how the organisations in question propose to satisfy their future need for technicians.That is to say, we shall consider in this section the workforce planning strategies adopted by those space companies that employ technicians.This is an interesting and important issue, for a number of reasons.The first is that, as noted above, the increasingly difficulty of recruiting experienced technicians means that the approach most commonly adopted hitherto, namely recruitment-plus-upgrading, may not be as sustainable in the future as it was in the past. Second, many of the case study organisations, including seven of the 12 who are either currently training apprentices or are planning to do so – are growing, often very rapidly, and require increasing numbers of technicians. Of course, this reflects the more general point – made above – that the space industry is growing rapidly, both in