## Exploring various pdf chunking strategies

In [4]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
        filename="raw/ap_history_guide.pdf",
        strategy="fast",
        languages=["eng"],
        infer_table_structure=True,
        include_metadata=True,

    )

In [30]:
for element in elements[:20]:
    print("----element----")
    print(element.text)


----element----
I. CHAPTER OVERVIEW There’s a lot of stuf in this chapter—nine thousand years’ worth. So before you begin, read through the outline below so you’ll know where to find what you’re looking for when you return to this chapter for a mini-review. (Remember, the key to doing well is to go through the chapter once, delve into the areas you are clueless or semi-clueless about, then return here for a mini-review.) I. Chapter Overview You’re reading it!
----element----
II. Stay Focused on the Big Picture
----element----
Organize the zillions of facts from the 9,000 years covered in this chapter into some big-picture concepts.
----element----
III. History Review through 600 C.E.
----element----
This is the bulk of the chapter, where we plow through the major civilizations, people, and events. Again, we suggest that if you’re totally clueless on a section, review the corresponding section in your textbook. Here’s a list of the major sections.
----element----
A. Nomads: Follow the F

## Cleaning Strategies

In [None]:
import re
def clean_cid_chars(text: str) -> str:
    """
    Cleans common (cid:X) characters and broken ligatures from text.
    This is a heuristic approach and may not cover all cases or be perfect.
    """
    if not isinstance(text, str):
        return text # Return non-string inputs as-is
    
    cid_replacements = {
        "(cid:35)": "ff",
        "(cid:69)": "f", 
        "(cid:44)": "fi", 
        "(cid:48)": "fi", 
        "(cid:49)": "fi", 
        "(cid:59)": "fi", 
        # Add other common ones if you find them consistently:
        # "(cid:80)": "i",
        # "(cid:81)": "l",
        # "(cid:82)": "t",
        # "(cid:83)": "s",
        # "(cid:84)": "r",
        # "(cid:85)": "e",
    }

    for cid_pattern, replacement_char in cid_replacements.items():
        text = text.replace(cid_pattern, replacement_char)

    # 2. Handle common broken ligatures (ff, fi, fl, ffi, ffl)
    text = text.replace("ﬁ", "fi") 
    text = text.replace("ﬀ", "ff") 
    text = text.replace("ﬂ", "fl") 
    text = text.replace("ﬃ", "ffi") 
    text = text.replace("ﬄ", "ffl") 

    # 3. Remove any remaining (cid:X) patterns (fallback for unknown CIDs)
    # This regex looks for (cid: followed by one or more digits and a closing parenthesis
    text = re.sub(r'\(cid:\d+\)', '', text)

    # Optional: Clean up extra spaces that might result from removals
    text = re.sub(r'\s+', ' ', text).strip()

    return text

elements_cleaned = []
for element in elements:
    if hasattr(element, 'text'):
        original_text = element.text
        cleaned_text = clean_cid_chars(original_text)        
        
        element.text = cleaned_text        
        
    elements_cleaned.append(element)
print("Finished cleaning elements.")

for element in elements_cleaned[:20]:
    print(f"element --> {element}")


Finished cleaning elements.
element --> I. CHAPTER OVERVIEW There’s a lot of stuf in this chapter—nine thousand years’ worth. So before you begin, read through the outline below so you’ll know where to find what you’re looking for when you return to this chapter for a mini-review. (Remember, the key to doing well is to go through the chapter once, delve into the areas you are clueless or semi-clueless about, then return here for a mini-review.) I. Chapter Overview You’re reading it!
element --> II. Stay Focused on the Big Picture
element --> Organize the zillions of facts from the 9,000 years covered in this chapter into some big-picture concepts.
element --> III. History Review through 600 C.E.
element --> This is the bulk of the chapter, where we plow through the major civilizations, people, and events. Again, we suggest that if you’re totally clueless on a section, review the corresponding section in your textbook. Here’s a list of the major sections.
element --> A. Nomads: Follow t

## Chunking Strategies

In [16]:
from unstructured.chunking.basic import chunk_elements

chunks = chunk_elements(elements_cleaned, max_characters=1000, new_after_n_chars=800, overlap=100)
for chunk in chunks[50:100]:
    print(f"Chunk -> {chunk}")


Chunk -> ndividual communities. It just goes to show, once again, that not all human societies have followed the same path toward sophistication, and that urbanization doesn’t necessarily mean centralization.
Chunk -> Focus On: Migrations Why do people migrate? People migrate for the same reason animals do: to find food and a hospitable environment in which to live. Nomadic peoples by definition are migratory, moving from place to place with the seasons to follow food sources. Agricultural peoples also migrated, following the seasons and therefore agricultural cycles. To maintain a stable home, people also migrated to avoid natural disasters or climatic changes that permanently change the environment, making it too hot and dry (the Sahara Desert’s expansion), too cold (Ice Ages), or too wet (flooding cycles of major rivers such as the Yellow River in China). Migration isn’t always solely the result of random environmental change. Overpopulation of a particular area can exhaust the food

In [25]:
from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(elements_cleaned, combine_text_under_n_chars=50, max_characters=1000, new_after_n_chars=800)
for chunk in chunks[200:250]:
    print(f"Chunk -> {chunk}")


Chunk -> empire-building, especially among those in the church. Under Charlemagne, a strong focus was placed on the arts and education, but not surprisingly with a much more religious bent—much of this eort centered in the monasteries under the direction of the church. And though Charlemagne was very powerful, his rule was not absolute. Society was structured around Feudalism (more on feudalism shortly). Thus Charlemagne had overall control of the empire, but the local lords held power over the local territories, answering to Charlemagne only on an as-needed basis. And because Charlemagne did not levy taxes, he failed to build a strong and united empire. After his death, and the death of his son Louis, the empire was divided among his three grandsons according to the Treaty of Verdun in 843. The Vikings: Raiders from the Norse During this time, western Europe continued to be attacked by powerful invaders, notably the Vikings from Scandinavia and the Magyars from Hungary. Although the V