# Structured Chunking

from https://blog.langchain.dev/a-chunk-by-any-other-name/

In [1]:
from dotenv import load_dotenv

### Workaround for local package imports

To import from app.preprocessing package in the root of the repo we need to CD into the root.

Note: this code is only necessary in a Jupyter notebook. In a regular python script, the imports work fine.

In [2]:
import os
import sys

# This assumes your working directory is in the experiments/<name>/ dir
repo_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))

# Check if app/ dir exists in current working directory along with .env (to make this cell re-runnable)
if os.path.exists(os.path.join(os.getcwd(), "app")) and os.path.exists(os.path.join(os.getcwd(), ".env")):
    app_path = os.path.join(os.getcwd(), "app")
    sys.path.append(app_path)
    print("app/ dir and .env found in current working directory, keeping CWD as is.")
else:
    os.chdir(repo_root)
    print("app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.")

# Load .env
load_dotenv()

app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.


True

## Load structured document 

Steps to get `Document` object from a pdf path:
1. create `AdobeExtractAPIManager` instance
2. call `doc = extract_document` method with path to document
3. call `AdobeDocumentSplitter().document_to_chunks(doc)` 

In [3]:
from app.preprocessing.adobe.manager import AdobeExtractAPIManager

adobe_manager = AdobeExtractAPIManager(
    os.getenv("ADOBE_CLIENT_ID"),
    os.getenv("ADOBE_CLIENT_SECRET"),
    extract_dir_path="data/interim/000-adobe-extract/"
)

####
# Edit the path below to point to a PDF file on your local machine
####
document = adobe_manager.get_document(
    "/Users/dvdblk/Downloads/pdf_files_complete/UK_47.pdf"
)

2023-11-17 18:32:19.560 app.preprocessing.adobe.manager INFO     Initialized AdobeExtractAPIManager (with extract_dir_path=data/interim/000-adobe-extract/)


#### Visualize the sections of the document

In [14]:
def get_starting_page_nr(section):
    if pages := sorted(section.pages):
        return pages[0]
    else:
        return None

# Print document section title hierarchy
for section in document.subsections:
    print(section.section_type, section.title, get_starting_page_nr(section))
    for subsection in section.subsections:
        print("\t", subsection.section_type, subsection.title, get_starting_page_nr(subsection))
        for subsubsection in subsection.subsections:
            print("\t\t", subsubsection.section_type, subsubsection.title, get_starting_page_nr(subsubsection))

H1 CONTENTS 2
H1 EXECUTIVE SUMMARY 4
H1 1. INTRODUCTION1 8
H1 2.THE UK SPACE INDUSTRY:AN OVERVIEW 10
	 H2 FIGURE 1:THE STRUCTURE OF THE SPACE SECTOR 11
H1 3. RESEARCH METHODOLOGY 15
H1 4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS 17
	 H2 4.1.THE SIZE OF THE TECHNICIAN WORKFORCE 17
	 H2 4.2 TYPES OF TECHNICIAN AND THE NATURE OF TECHNICAL SUPPORT 18
		 H3 4.2.1: Mechanical Engineering Technicians 19
	 H2 4.3 QUALIFICATIONS 27
	 H2 4.4 DIVISION OF KNOWLEDGE AND EXPERTISE BETWEEN TECHNICIANS AND GRADUATES 27
	 H2 4.5 SOURCES OF TECHNICIANS 28
H1 5. RESULTS II:THE FUTURE TECHNICIAN WORKFORCE 32
	 H2 5.1 RECRUITMENT 32
	 H2 5.2 APPRENTICESHIP None
		 H3 5.2.1 Definition and involvement 33
		 H3 5.2.2 Rationale 34
		 H3 5.2.3 Organisation 35
		 H3 5.2.4 Impediments to the use apprenticeships 38
	 H2 5.3 CAREERS: ONGOING TRAINING AND PROFESSIONAL DEVELOPMENT 39
H1 6. CONCLUSIONS 41
H1 REFERENCES 43


## Create chunks from sections

In [5]:
from app.preprocessing.splitter import AdobeDocumentSplitter

# Get chunks
docs = AdobeDocumentSplitter().document_to_chunks(document)

In [6]:
# Print avg chunk length
avg_chunk_len = sum([len(doc.page_content) for doc in docs]) / len(docs)
avg_chunk_len

3863.242424242424

## Structured chunking methods below

In [7]:
def print_result(response_obj):
    print("SOURCES: \n")
    cnt = 1
    for source_doc in response_obj["source_documents"]:
        print(f"Chunk #{cnt}")
        cnt += 1
        print("Source Metadata: ", source_doc.metadata)
        print("Source Text:")
        print(source_doc.page_content)
        print("\n")
    print("RESULT: \n")
    print(response_obj["result"] + "\n\n")

In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA


llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding=embeddings)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

In [9]:
query = "Are there any mentions of skills or technologies?"
response = retriever({"query": query})
print_result(response)

SOURCES: 

Chunk #1
Source Metadata:  {'H1': '4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS'}
Source Text:
This section of the report outlines what the research carried out for this project reveals about issues such as: the size of the technician workforce; the types of roles that are typically undertaken by technicians in the space industry and the kinds of duties that are associated with those roles; the kind (level, and subject-matter) of qualifications those technicians typically possess; and how organisations in the space industry have until now gone about satisfying their need for technicians, in particular the balance they have struck between recruitment and (various forms of) training as a means of acquiring the technicians they need.


Chunk #2
Source Metadata:  {'H1': 'REFERENCES'}
Source Text:
ADS Group (2012). ‘Apprenticeships: Written Evidence Submitted by ADS Group Limited.’  House of Commons Select Committee on Business, Innovatio

In [48]:
from langchain.prompts import PromptTemplate


prompt_template = PromptTemplate.from_template(
    """
    You're an expert policy analyst that is analyzing an economic policy document. Your goal is to summarize a given section text of a document in 15-20 sentences.

    The section text will be given to you in the following json format:
    ```json
    {{
        "section": {{
            "title": "<section title>",
            "text": "<section text to summarize>"
        }}
    }}
    ```

    Make sure to follow these rules while summarizing (as if your life depended on it):
    1. absolutely make sure that you don't skip any mentions of technologies, skills, capabilities or investments related to any of these topics: Advanced Computing, Battery Technologies, Semiconductors, Clean Energy.
    2. pay attention to the intention of the section especially with regards to sentiment towards adoption or promotion of any skills.
    3. if the text mentions or discusses policy initiatives related to inclusion, health, digital, green resilience make sure to include them in the summary.
    4. mention any discussion of funding, investments or budget allocations.
    5. in the summary, make sure to mention whether there is a certain future need for any skills or technologies
    6. mention any explicit skill needs that are mentioned in the text.
    7. if the section is a table of contents or an index, only mention the keywords related to the rules above in the summary or "table of contents" as the summary
    8. if the vast majority of the section is about references or sources, don't summarize it just return "references" as the summary.

    Your answer will be exactly in the following format:
    ```json
    {{
        "summary": "<your summary here>"
    }}
    ```

    Here is the section json of a document to summarize:
    ```json
    {{
        "section": {{
            "title": {section_title},
            "text": {section_text}
        }}
    }}
    ```
    """
)


selected_section = document.subsections[0]
title, text = selected_section.title, "\n".join([p.text for p in selected_section.paragraphs])
prompt = prompt_template.format(section_title=title, section_text=text)

In [69]:
#llm.invoke(prompt).content
import json

def get_summary(llm_response_content):
    """Remove markdown and return summary json"""
    return json.loads(llm_response_content.replace("```json", "").replace("```", ""))

{'summary': 'table of contents'}

In [70]:

def get_all_document_sections(document):
    sections = []

    if subsections := document.subsections:
        for subsection in subsections:
            sections.append(subsection)
            if s := get_all_document_sections(subsection):
                sections.extend(s)

    return sections


all_sections = get_all_document_sections(document)
# for sec in all_sections:
#     print(sec.title)

for sec in all_sections:
    title, text = sec.title, "\n".join([p.text for p in sec.paragraphs])
    prompt = prompt_template.format(section_title=title, section_text=text)
    pretty_resp = json.dumps(get_summary(llm.invoke(prompt).content), indent=2)
    print(section.section_type, section.title)
    print(pretty_resp)
    print("\n\n")

H1 REFERENCES


NameError: name 'pretty_resp' is not defined