# Structured Chunking

from https://blog.langchain.dev/a-chunk-by-any-other-name/

In [1]:
from dotenv import load_dotenv

### Workaround for local package imports

To import from app.preprocessing package in the root of the repo we need to CD into the root.

Note: this code is only necessary in a Jupyter notebook. In a regular python script, the imports work fine.

In [2]:
import os
import sys

# This assumes your working directory is in the experiments/<name>/ dir
repo_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))

# Check if app/ dir exists in current working directory along with .env (to make this cell re-runnable)
if os.path.exists(os.path.join(os.getcwd(), "app")) and os.path.exists(os.path.join(os.getcwd(), ".env")):
    app_path = os.path.join(os.getcwd(), "app")
    sys.path.append(app_path)
    print("app/ dir and .env found in current working directory, keeping CWD as is.")
else:
    os.chdir(repo_root)
    print("app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.")

# Load .env
load_dotenv()

app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.


True

## Load structured document 

Steps to get `Document` object from a pdf path:
1. create `AdobeExtractAPIManager` instance
2. call `doc = extract_document` method with path to document
3. call `AdobeDocumentSplitter().document_to_chunks(doc)` 

In [3]:
from app.preprocessing.adobe.manager import AdobeExtractAPIManager

adobe_manager = AdobeExtractAPIManager(
    os.getenv("ADOBE_CLIENT_ID"),
    os.getenv("ADOBE_CLIENT_SECRET"),
    extract_dir_path="data/interim/000-adobe-extract/"
)

####
# Edit the path below to point to a PDF file on your local machine
####
document = adobe_manager.get_document(
    "/Users/dvdblk/Downloads/pdf_files_complete/UK_47.pdf"
)

2023-11-17 18:32:19.560 app.preprocessing.adobe.manager INFO     Initialized AdobeExtractAPIManager (with extract_dir_path=data/interim/000-adobe-extract/)


#### Visualize the sections of the document

In [76]:
def get_starting_page_nr(section):
    if pages := sorted(section.pages):
        return pages[0]
    else:
        return None

def print_hierarchy(section, level=0):
    print("\t" * level, section.section_type, section.title, get_starting_page_nr(section))
    for subsection in section.subsections:
        print_hierarchy(subsection, level + 1)

# Print document section title hierarchy
print_hierarchy(document)

 document None 1
	 H1 CONTENTS 2
	 H1 EXECUTIVE SUMMARY 4
	 H1 1. INTRODUCTION1 8
	 H1 2.THE UK SPACE INDUSTRY:AN OVERVIEW 10
		 H2 FIGURE 1:THE STRUCTURE OF THE SPACE SECTOR 11
	 H1 3. RESEARCH METHODOLOGY 15
	 H1 4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS 17
		 H2 4.1.THE SIZE OF THE TECHNICIAN WORKFORCE 17
		 H2 4.2 TYPES OF TECHNICIAN AND THE NATURE OF TECHNICAL SUPPORT 18
			 H3 4.2.1: Mechanical Engineering Technicians 19
				 H4 4.2.1 (a) Machinists 19
					 H5 4.2.1 (b) Mechanical Assembly Technician 20
						 H6 4.2.1 (c) Composites Laminators 21
						 H6 4.2.1 (d) Manufacturing or Production Engineers 21
						 H6 4.2.2 Electrical and Electronics Engineering Technicians 22
						 H6 4.2.2 (a) PCB (Electronics) Assembly and Inspection Technician 23
					 H5 4.2.2 (b) Electronics and Electrical Assembly Technicians 24
						 H6 4.2.3 Test Technicians and Engineers 26
						 H6 4.2.3 (a) Test technicians 26
					 H5 4.2.3 (b) Test eng

## Create chunks from sections

In [5]:
from app.preprocessing.splitter import AdobeDocumentSplitter

# Get chunks
docs = AdobeDocumentSplitter().document_to_chunks(document)

In [6]:
# Print avg chunk length
avg_chunk_len = sum([len(doc.page_content) for doc in docs]) / len(docs)
avg_chunk_len

3863.242424242424

## Structured chunking methods below

In [7]:
def print_result(response_obj):
    print("SOURCES: \n")
    cnt = 1
    for source_doc in response_obj["source_documents"]:
        print(f"Chunk #{cnt}")
        cnt += 1
        print("Source Metadata: ", source_doc.metadata)
        print("Source Text:")
        print(source_doc.page_content)
        print("\n")
    print("RESULT: \n")
    print(response_obj["result"] + "\n\n")

In [81]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA


llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding=embeddings)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

In [9]:
query = "Are there any mentions of skills or technologies?"
response = retriever({"query": query})
print_result(response)

SOURCES: 

Chunk #1
Source Metadata:  {'H1': '4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS'}
Source Text:
This section of the report outlines what the research carried out for this project reveals about issues such as: the size of the technician workforce; the types of roles that are typically undertaken by technicians in the space industry and the kinds of duties that are associated with those roles; the kind (level, and subject-matter) of qualifications those technicians typically possess; and how organisations in the space industry have until now gone about satisfying their need for technicians, in particular the balance they have struck between recruitment and (various forms of) training as a means of acquiring the technicians they need.


Chunk #2
Source Metadata:  {'H1': 'REFERENCES'}
Source Text:
ADS Group (2012). ‘Apprenticeships: Written Evidence Submitted by ADS Group Limited.’  House of Commons Select Committee on Business, Innovatio

# Summarizing each section

(structured chunking ends with the cell above)

In [100]:
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel, Field

"""
    Your answer will be exactly in the following format:
    ```json
    {{
        "summary": "<your summary here>"
    }}
    ```
"""

class SectionSummaryOutput(BaseModel):
    """Contains summary of a given section"""
    summary: str = Field(..., description="the summary of the section")

# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
prompt_template_structured = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
You're an expert policy analyst that is analyzing an economic policy document. Your goal is to summarize a given section text of a document in 15-20 sentences.

The section text will be given to you in the following json format:
```json
{{
    "section": {{
        "title": "<section title>",
        "text": "<section text to summarize>"
    }}
}}
```

Make sure to follow these rules while summarizing (as if your life depended on it):
1. absolutely make sure that you don't skip any mentions of technologies, skills, capabilities or investments related to any of these topics: Advanced Computing, Battery Technologies, Semiconductors, Clean Energy.
2. pay attention to the intention of the section especially with regards to sentiment towards adoption or promotion of any skills.
3. if the text mentions or discusses policy initiatives related to inclusion, health, digital, green resilience make sure to include them in the summary.
4. mention any discussion of funding, investments or budget allocations.
5. in the summary, make sure to mention whether there is a certain future need for any skills or technologies
6. mention any explicit skill needs that are mentioned in the text.
7. if the section is a table of contents or an index, just return "table of contents" as the summary
8. if the vast majority of the section is about references or sources, don't summarize it just return "references" as the summary.
            """,
        ),
        (
            "human",
            """
Here is the section json of a document to summarize:
```json
{{
    "section": {{
        "title": {section_title},
        "text": {section_text}
    }}
}}
```
            """,
        ),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)

prompt_template = PromptTemplate.from_template(
    """
    You're an expert policy analyst that is analyzing an economic policy document. Your goal is to summarize a given section text of a document in 15-20 sentences.

    The section text will be given to you in the following json format:
    ```json
    {{
        "section": {{
            "title": "<section title>",
            "text": "<section text to summarize>"
        }}
    }}
    ```

    Make sure to follow these rules while summarizing (as if your life depended on it):
    1. absolutely make sure that you don't skip any mentions of technologies, skills, capabilities or investments related to any of these topics: Advanced Computing, Battery Technologies, Semiconductors, Clean Energy.
    2. pay attention to the intention of the section especially with regards to sentiment towards adoption or promotion of any skills.
    3. if the text mentions or discusses policy initiatives related to inclusion, health, digital, green resilience make sure to include them in the summary.
    4. mention any discussion of funding, investments or budget allocations.
    5. in the summary, make sure to mention whether there is a certain future need for any skills or technologies
    6. mention any explicit skill needs that are mentioned in the text.
    7. if the section is a table of contents or an index, just return "table of contents" as the summary
    8. if the vast majority of the section is about references or sources, don't summarize it just return "references" as the summary.

    Here is the section json of a document to summarize:
    ```json
    {{
        "section": {{
            "title": {section_title},
            "text": {section_text}
        }}
    }}
    ```
    """
)


selected_section = document.subsections[0]
title, text = selected_section.title, "\n".join([p.text for p in selected_section.paragraphs])
prompt = prompt_template.format(section_title=title, section_text=text)

In [101]:
#llm.invoke(prompt).content
import json
from json.decoder import JSONDecodeError

def get_summary(llm_response_content):
    """Remove markdown and return summary json"""
    return json.loads(llm_response_content.replace("```json", "").replace("```", ""))

In [105]:
from langchain.chains.openai_functions import create_structured_output_runnable


def get_all_document_sections(document):
    sections = []

    if subsections := document.subsections:
        for subsection in subsections:
            sections.append(subsection)
            if s := get_all_document_sections(subsection):
                sections.extend(s)

    return sections


all_sections = get_all_document_sections(document)

runnable = create_structured_output_runnable(SectionSummaryOutput, llm, prompt_template_structured)

section_summaries = []

for section in all_sections:
    title, text = section.title, "\n".join([p.text for p in section.paragraphs])

    if len(text) > 0:
        prompt = prompt_template.format(section_title=title, section_text=text)
        #response = llm.invoke(prompt)

        # Response now has no content, it's the pydantic object instead
        response = runnable.invoke({"section_title": title, "section_text": text})
        pretty_resp = json.dumps(response.dict(), indent=2)

        section_summaries.append((section, response))

        print(section.section_type, section.title, get_starting_page_nr(section))
        print(pretty_resp)
        print("\n\n")
    else:
        section_summaries.append((section, SectionSummaryOutput(summary=None)))
        print("Not querying section with no text:", section.title)
        print("Saved summary as None")
        print("\n\n")


H1 CONTENTS 2
{
  "summary": "table of contents"
}



H1 EXECUTIVE SUMMARY 4
{
  "summary": "The section discusses the government's goal of creating a modern class of technicians to address skills shortages and the aging workforce in the UK economy. It investigates the role of technicians in the space industry, emphasizing the sector's significant contribution to the UK GDP and employment. The report aims to inform government policy by examining technician duties, required skills, and training in the space sector, as part of a wider research program into strategically important sectors. The data collected indicates that most technicians are employed by upstream manufacturers and possess HNC qualifications. The report highlights the shift towards in-house training and apprenticeships to meet the growing demand for technicians, especially in the context of an aging workforce. It also discusses the challenges faced by employers in establishing apprenticeship training programs and the impo

### PDFTriage style OpenAI functions

1. initial prompt question (context=document section hierarchy + summaries)


the following is an impl of openai functions with langchain (https://python.langchain.com/docs/modules/chains/how_to/openai_functions#getting-structured-outputs):

In [78]:
from typing import Optional

from langchain.chains.openai_functions import (
    create_openai_fn_chain,
    create_openai_fn_runnable,
    create_structured_output_chain,
    create_structured_output_runnable,
)
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel, Field

In [82]:
class Person(BaseModel):
    """Identifying information about a person."""

    name: str = Field(..., description="The person's name")
    age: int = Field(..., description="The person's age")
    fav_food: Optional[str] = Field(None, description="The person's favorite food")

# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a world class algorithm for extracting information in structured formats.",
        ),
        (
            "human",
            "Use the given format to extract information from the following input: {input}",
        ),
        ("human", "Tip: Make sure to answer in the correct format"),
    ]
)

runnable = create_structured_output_runnable(Person, llm, prompt)
runnable.invoke({"input": "Sally is 13"})

Person(name='Sally', age=13, fav_food=None)

In [85]:
from typing import Sequence


class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person] = Field(..., description="The people in the text")


runnable = create_structured_output_runnable(People, llm, prompt)
runnable.invoke(
    {
        "input": "Sally is 13, Joey just turned 12 and loves spinach. Caroline is 10 years older than Sally."
    }
)

People(people=[Person(name='Sally', age=13, fav_food=None), Person(name='Joey', age=12, fav_food='spinach'), Person(name='Caroline', age=23, fav_food=None)])

## Tools as OpenAI Functions

https://python.langchain.com/docs/modules/agents/tools/tools_as_openai_functions

In [199]:
from langchain.schema import HumanMessage
from langchain.tools import format_tool_to_openai_function, BaseTool
from typing import Type
from langchain.prompts import HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain.callbacks.manager import (
    AsyncCallbackManagerForToolRun,
    CallbackManagerForToolRun,
)


# Fetch Section tool
class FetchSectionSchema(BaseModel):
    reasoning: str = Field(description="the reasoning behind the selection of a section to fetch")
    section_title: str = Field(description="the exact title of the section to fetch")

class FetchSectionTool(BaseTool):
    name = "fetch_section"
    description = "fetches an entire section from a document that might contain an answer to the question"
    args_schema: Type[FetchSectionSchema] = FetchSectionSchema

    def _run(
        self,
        reasoning: str,
        section_title: str,
        run_manager: Optional[CallbackManagerForToolRun] = None,
    ) -> str:
        """Use the tool."""
        # get full section text from document
        for section, _ in section_summaries:
            if section.title == section_title:
                return "\n".join([p.text for p in section.paragraphs])

        return None


# Fetch pages tool
class FetchPagesSchema(BaseModel):
    page_numbers: Sequence[int] = Field(description="the page numbers to fetch")

class FetchPagesTool(BaseTool):
    name = "fetch_pages"
    description = "useful when you need to fetch specific pages from a document, for example to fetch multiple sections at a time"
    args_schema: Type[FetchPagesSchema] = FetchPagesSchema

# Create langchain tools
tools = [
    FetchSectionTool(),
    #FetchPagesTool()
]
# Transform to openai functions
openai_functions = [format_tool_to_openai_function(t) for t in tools]

"""

"""
# Structured metadata prompt (initial for most questions)
structured_metadata_system_prompt = SystemMessagePromptTemplate.from_template(
"""
You're an expert policy analyst needs to fetch the appropriate section of an economic policy document to answer the given question.
Your task is to fetch the appropriate section or pages of the document that might contain the answer. If you can't find the answer from the summaries, just fetch the most relevant section to the question.

<< Example structure of the document >>
```json
{{
    "document": {{
        "title": <title of the document>,
        "sections": [
            {{
                "title": <title of the section>,
                "pages": <list of page numbers this section spans over>,
                "summary": <brief summary of the section>,
                "sections": <list of nested sections in this section, same structure as above>
            }}
        ]
    }}
}}
```

Available functions: fetch_section
""")

structured_metadata_prompt = HumanMessagePromptTemplate.from_template("""
<< Question >>
{question}

<< Document >>
{document_structural_metadata}
""")
structured_metadata_prompt

HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['document_structural_metadata', 'question'], template='\n<< Question >>\n{question}\n                                                                      \n<< Document >>\n{document_structural_metadata}\n'))

In [200]:
def document_to_structured_metadata(section):
    """Convert document to structured metadata"""
    # Check if the document is the root node
    if section.section_type == "document":
        return {
            "document": {
                "title": document.title,
                "sections": [
                    document_to_structured_metadata(section) for section in document.subsections
                ]
            }
        }
    else:
        # find section from section summaries
        section, summary_response = next(filter(lambda x: x[0] == section, section_summaries))
        result = {
            "title": section.title,
            "pages": sorted(section.pages),
            "summary": summary_response.summary
        }
        if subsections := [document_to_structured_metadata(subsection) for subsection in section.subsections]:
            result["sections"] = subsections

        return result


print(json.dumps(document_to_structured_metadata(document), indent=2))


{
  "document": {
    "title": null,
    "sections": [
      {
        "title": "CONTENTS",
        "pages": [
          2
        ],
        "summary": "table of contents"
      },
      {
        "title": "EXECUTIVE SUMMARY",
        "pages": [
          4,
          5,
          6
        ],
        "summary": "The section discusses the government's goal of creating a modern class of technicians to address skills shortages and the aging workforce in the UK economy. It investigates the role of technicians in the space industry, emphasizing the sector's significant contribution to the UK GDP and employment. The report aims to inform government policy by examining technician duties, required skills, and training in the space sector, as part of a wider research program into strategically important sectors. The data collected indicates that most technicians are employed by upstream manufacturers and possess HNC qualifications. The report highlights the shift towards in-house training and

In [204]:
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser

llm_with_fns = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)
llm_with_fns.bind(functions=openai_functions)

# message = llm.predict_messages(
#     [
#         structured_metadata_system_prompt.format(),
#         structured_metadata_prompt.format(
#             document_structural_metadata=document_to_structured_metadata(document),
#             question="Does the document discuss the funding of the programme to support the development of new technicians?"
#         )
#     ],
#     functions=openai_functions,
# )

fns_prompt = ChatPromptTemplate.from_messages(
    [
        #structured_metadata_system_prompt,
        structured_metadata_prompt,
    ]
)

fns_agent = (
    fns_prompt
    | llm_with_fns
    | OpenAIFunctionsAgentOutputParser()
)

# from langchain.agents import AgentExecutor

# agent_executor = AgentExecutor(agent=fns_agent, tools=tools, verbose=True)
# agent_executor.invoke()

fns_agent.invoke({
    "document_structural_metadata": document_to_structured_metadata(document),
    "question": "Does the document discuss funding?"
})

AgentFinish(return_values={'output': 'Based on the provided document, the section that might contain information about funding is not explicitly mentioned in the summaries. However, the section that discusses "APPRENTICESHIP" under "RESULTS II: THE FUTURE TECHNICIAN WORKFORCE" might contain information related to funding, as it discusses the involvement of manufacturers of space hardware in apprenticeship training due to internal and external pressures, and the distinction between \'apprenticeships\' and \'Apprenticeship\' in terms of public funding and state regulation. Therefore, the section "5.2 APPRENTICESHIP" under "RESULTS II: THE FUTURE TECHNICIAN WORKFORCE" is the most relevant section to explore for information related to funding.'}, log='Based on the provided document, the section that might contain information about funding is not explicitly mentioned in the summaries. However, the section that discusses "APPRENTICESHIP" under "RESULTS II: THE FUTURE TECHNICIAN WORKFORCE" mi

In [176]:
for section, _ in section_summaries:
    if section.title == "5.2.4 Impediments to the use apprenticeships":
        for para in section.paragraphs:
            print(para.text)
            print()

Neither those organisations that are currently take apprentices, nor the ones thinking of doing so, have found in the process of running on apprenticeship to be entirely straightforward. Finding an FE college to provide the off-the-job training, and in many cases also the assessment for the on-the-job element of the apprenticeship, has been a particular source of difficulty, both for those organisations that currently take apprentices and also for those intending to do so. More specifically, four of the eight space sector organisations that currently take apprentices highlighted problems in their relations with FE colleges, including apprentices being put on the wrong technical certificates, inadequate support and feedback being provided by external NVQ assessors, and concerns about the quality of the practical, hand-skills training being provided in college workshops for apprentices spending their first year on block release. While, as noted above, all the organisations taking apprent