# Structured Chunking

from https://blog.langchain.dev/a-chunk-by-any-other-name/

In [1]:
from dotenv import load_dotenv

### Workaround for local package imports

To import from app.preprocessing package in the root of the repo we need to CD into the root.

In [2]:
import os
import sys

# This assumes your working directory is in the experiments/<name>/ dir
repo_root = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))

# Check if app/ dir exists in current working directory along with .env (to make this cell re-runnable)
if os.path.exists(os.path.join(os.getcwd(), "app")) and os.path.exists(os.path.join(os.getcwd(), ".env")):
    app_path = os.path.join(os.getcwd(), "app")
    sys.path.append(app_path)
    print("app/ dir and .env found in current working directory, keeping CWD as is.")
else:
    os.chdir(repo_root)
    print("app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.")

# Load .env
load_dotenv()

app/ dir and/or .env not found in current working directory, changing CWD to assumed repo root.


True

## Load structured document 

In [3]:
from app.preprocessing.adobe.manager import AdobeExtractAPIManager

adobe_manager = AdobeExtractAPIManager(
    os.getenv("ADOBE_CLIENT_ID"),
    os.getenv("ADOBE_CLIENT_SECRET"),
    extract_dir_path="data/interim/000-adobe-extract/"
)

document = adobe_manager.get_document("/Users/dvdblk/Downloads/pdf_files_complete/UK_47.pdf")

2023-11-16 09:43:53.469 app.preprocessing.adobe.manager INFO     Initialized AdobeExtractAPIManager (with extract_dir_path=data/interim/000-adobe-extract/)


In [4]:
# Print document section title hierarchy
for section in document.subsections:
    print(section.section_type, section.title)
    for subsection in section.subsections:
        print("\t", subsection.section_type, subsection.title)
        for subsubsection in subsection.subsections:
            print("\t\t", subsubsection.section_type, subsubsection.title)

H1 CONTENTS
H1 EXECUTIVE SUMMARY
H1 1. INTRODUCTION1
H1 2.THE UK SPACE INDUSTRY:AN OVERVIEW
	 H2 FIGURE 1:THE STRUCTURE OF THE SPACE SECTOR
H1 3. RESEARCH METHODOLOGY
H1 4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS
	 H2 4.1.THE SIZE OF THE TECHNICIAN WORKFORCE
	 H2 4.2 TYPES OF TECHNICIAN AND THE NATURE OF TECHNICAL SUPPORT
		 H3 4.2.1: Mechanical Engineering Technicians
	 H2 4.3 QUALIFICATIONS
	 H2 4.4 DIVISION OF KNOWLEDGE AND EXPERTISE BETWEEN TECHNICIANS AND GRADUATES
	 H2 4.5 SOURCES OF TECHNICIANS
H1 5. RESULTS II:THE FUTURE TECHNICIAN WORKFORCE
	 H2 5.1 RECRUITMENT
	 H2 5.2 APPRENTICESHIP
		 H3 5.2.1 Definition and involvement
		 H3 5.2.2 Rationale
		 H3 5.2.3 Organisation
		 H3 5.2.4 Impediments to the use apprenticeships
	 H2 5.3 CAREERS: ONGOING TRAINING AND PROFESSIONAL DEVELOPMENT
H1 6. CONCLUSIONS
H1 REFERENCES


## Create chunks from sections

In [5]:
from app.preprocessing.splitter import AdobeDocumentSplitter

# Get chunks
docs = AdobeDocumentSplitter().document_to_chunks(document)

avg_chunk_len = sum([len(doc.page_content) for doc in docs]) / len(docs)
avg_chunk_len

3863.242424242424

In [7]:
def print_result(response_obj):
    print("SOURCES: \n")
    cnt = 1
    for source_doc in response_obj["source_documents"]:
        print(f"Chunk #{cnt}")
        cnt += 1
        print("Source Metadata: ", source_doc.metadata)
        print("Source Text:")
        print(source_doc.page_content)
        print("\n")
    print("RESULT: \n")
    print(response_obj["result"] + "\n\n")

In [14]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains.query_constructor.base import (
    AttributeInfo,
    get_query_constructor_prompt,
    load_query_constructor_runnable,
)

llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0, request_timeout=15)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embedding=embeddings)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

In [9]:
query = "Are there any mentions of skills or technologies?"
response = retriever({"query": query})
print_result(response)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=10.0).


SOURCES: 

Chunk #1
Source Metadata:  {'H1': '4. RESULTS I:THE CURRENT TECHNICIAN WORKFORCE: SIZE, ROLES, QUALIFICATIONS,AND ORIGINS'}
Source Text:
This section of the report outlines what the research carried out for this project reveals about issues such as: the size of the technician workforce; the types of roles that are typically undertaken by technicians in the space industry and the kinds of duties that are associated with those roles; the kind (level, and subject-matter) of qualifications those technicians typically possess; and how organisations in the space industry have until now gone about satisfying their need for technicians, in particular the balance they have struck between recruitment and (various forms of) training as a means of acquiring the technicians they need.


Chunk #2
Source Metadata:  {'H1': 'REFERENCES'}
Source Text:
ADS Group (2012). ‘Apprenticeships: Written Evidence Submitted by ADS Group Limited.’  House of Commons Select Committee on Business, Innovatio

# Self-Querying

Note: doesn't work because examples not adapted to policy docs

In [16]:
from typing import Dict

headers_to_split_on = [
    ("H1", "H1_section"),
    ("H2", "H2_subsection"),
    ("H3", "H3_subsection"),
    ("H4", "H4_subsection"),
]

def build_index_schema(documents: List[LangchainDocument]) -> Dict:
    schema = {"text": []}
    for doc in documents:
        for key in doc.metadata:
            name_dict = {"name": f"{key}"}
            if name_dict not in schema["text"]:
                schema["text"].append(name_dict)
    return schema

index_schema = build_index_schema(docs)
print(index_schema)

document_content_description = "an economic policy document discussing various skills and technologies in one country"

def generate_metadata_desc(field_name: str, header_info: List[tuple], qa_retriever: RetrievalQA) -> str:
    query = f"""
    given a list of tuples indicating all possible metadata fields for this document
    provide a very brief description (15 words or less) of the specified field
    including its position in the hierarchy of headers (h1, h2, h3, etc)

    all fields {header_info}
    specified field: {field_name}
    """

    return qa_retriever({"query": query})["result"]

def build_metadata_field_info(schema: Dict, header_info: List[tuple], qa_retriever: RetrievalQA) -> List[AttributeInfo]:
    filter_instructions = \
        "ALWAYS filter with one or more CONTAINS comparators, and use the OR operator to check ALL other fields."\
        "if the value of this field contains a word or phrase that is very similar to a word or phrase in the query, " \
        "filter for the exact string from the value rather than the query."
    filter_h1 = \
        "this field contains the name of a philosopher. "\
        "if the query contains a misspelling or an alternate spelling of this name, "\
        "filter for the value of this field rather than the name in the query."\
        "the H1-level filter should ALWAYS be combined with subsection filters using an AND operator. \n"
    filter_exclusion = \
        " NEVER filter this field on the value of 'H1_section'."
    attr_info_list = []
    for field in schema["text"]:
        desc = generate_metadata_desc(field["name"], header_info, qa_retriever) + filter_instructions
        if field["name"] == "H1_section":
            new_attr = AttributeInfo(
                name=field["name"],
                description = filter_h1 + desc,
                type="string"
            )
        else:
            new_attr = AttributeInfo(
                name=field["name"],
                description = desc + filter_exclusion,
                type="string"
            )
        attr_info_list.append(new_attr)
    return attr_info_list

metadata_field_info = build_metadata_field_info(index_schema, [headers_to_split_on], retriever)

{'text': [{'name': 'H1'}, {'name': 'H2'}, {'name': 'H3'}, {'name': 'H4'}, {'name': 'H5'}, {'name': 'H6'}]}


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).


In [17]:
metadata_field_info

[AttributeInfo(name='H1', description="I don't know the answer to this question.ALWAYS filter with one or more CONTAINS comparators, and use the OR operator to check ALL other fields.if the value of this field contains a word or phrase that is very similar to a word or phrase in the query, filter for the exact string from the value rather than the query. NEVER filter this field on the value of 'H1_section'.", type='string'),
 AttributeInfo(name='H2', description="I don't have that information.ALWAYS filter with one or more CONTAINS comparators, and use the OR operator to check ALL other fields.if the value of this field contains a word or phrase that is very similar to a word or phrase in the query, filter for the exact string from the value rather than the query. NEVER filter this field on the value of 'H1_section'.", type='string'),
 AttributeInfo(name='H3', description="I don't have that information.ALWAYS filter with one or more CONTAINS comparators, and use the OR operator to chec

In [18]:
# example queries contain misspellings / imprecise phrases
# example filters use the spelling / phrasing from the metadata values rather than the query
examples = [
    (
        "what does russell say about descriptors?",
        {"query": "russell, descriptors",
         "filter": 'and(contain("article_h1_main", "Bertrand Russell"), or(contain("article_h2_subsection", "descriptions"), '
                   'contain("article_h3_subsection", "descriptions"), contain("article_h4_subsection", "descriptions")))'}
    ),
    (
        "explain leibniz's idea of sufficient reason.",
        {"query": "leibniz, idea of sufficient reason",
         "filter": 'and(contain("article_h1_main", "Gottfried Wilhelm Leibniz"), or(contain("article_h2_subsection", "Principle of Sufficient Reason"), '
                   'contain("article_h3_subsection", "Principle of Sufficient Reason"), contain("article_h4_subsection", "Principle of Sufficient Reason")))'}
    ),
    (
        "what was goodel's continuum theory?",
        {"query": "goodel, continuum theory",
         "filter": 'and(contain("article_h1_main", "Kurt Gödel"), or(contain("article_h2_subsection", "Continuum Hypothesis"), '
                   'contain("article_h3_subsection", "Continuum Hypothesis"), contain("article_h4_subsection", "Continuum Hypothesis")))'}
    ),
]

In [19]:
prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
    examples=examples
)
print(prompt.format(query="{query}"))

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not

In [20]:
chain = load_query_constructor_runnable(
    llm=llm,
    attribute_info=metadata_field_info,
    document_contents=document_content_description,
    examples=examples,
    fix_invalid=True
)
query = "explain goedel's 1st incompleteness theory"
chain.invoke(({"query": query}))

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=15.0).


StructuredQuery(query='goedel, 1st incompleteness theory', filter=None, limit=None)

# PDFTriage