## Exploring various pdf chunking strategies

In [1]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
        filename="raw/ap_history_guide.pdf",
        strategy="fast",
        languages=["eng"],
        infer_table_structure=True,
        include_metadata=True,

    )

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
for element in elements[:20]:
    print("----element----")
    print(element.text)


----element----
I. CHAPTER OVERVIEW There’s a lot of stu(cid:35) in this chapter—nine thousand years’ worth. So before you begin, read through the outline below so you’ll know where to (cid:44)nd what you’re looking for when you return to this chapter for a mini-review. (Remember, the key to doing well is to go through the chapter once, delve into the areas you are clueless or semi-clueless about, then return here for a mini-review.)    I.  Chapter Overview You’re reading it!
----element----
II.  Stay Focused on the Big Picture
----element----
Organize the zillions of facts from the 9,000 years covered in this chapter into some big-picture concepts.
----element----
III.  History Review through 600 C.E.
----element----
This is the bulk of the chapter, where we plow through the major civilizations, people, and events. Again, we suggest that if you’re totally clueless on a section, review the corresponding section in your textbook. Here’s a list of the major sections.
----element----
A.  

## Cleaning Strategies

In [3]:
import re
def clean_cid_chars(text: str) -> str:
    """
    Cleans common (cid:X) characters and broken ligatures from text.
    This is a heuristic approach and may not cover all cases or be perfect.
    """
    if not isinstance(text, str):
        return text # Return non-string inputs as-is
    
    cid_replacements = {
        "(cid:35)": "ff",
        "(cid:69)": "f", 
        "(cid:44)": "fi", 
        "(cid:48)": "fi", 
        "(cid:49)": "fi", 
        "(cid:59)": "fi", 
        # Add other common ones if you find them consistently:
        # "(cid:80)": "i",
        # "(cid:81)": "l",
        # "(cid:82)": "t",
        # "(cid:83)": "s",
        # "(cid:84)": "r",
        # "(cid:85)": "e",
    }

    for cid_pattern, replacement_char in cid_replacements.items():
        text = text.replace(cid_pattern, replacement_char)

    # 2. Handle common broken ligatures (ff, fi, fl, ffi, ffl)
    text = text.replace("ﬁ", "fi") 
    text = text.replace("ﬀ", "ff") 
    text = text.replace("ﬂ", "fl") 
    text = text.replace("ﬃ", "ffi") 
    text = text.replace("ﬄ", "ffl") 

    # 3. Remove any remaining (cid:X) patterns (fallback for unknown CIDs)
    # This regex looks for (cid: followed by one or more digits and a closing parenthesis
    text = re.sub(r'\(cid:\d+\)', '', text)

    # Optional: Clean up extra spaces that might result from removals
    text = re.sub(r'\s+', ' ', text).strip()

    return text

elements_cleaned = []
for element in elements:
    if hasattr(element, 'text'):
        original_text = element.text
        cleaned_text = clean_cid_chars(original_text)        
        
        element.text = cleaned_text        
        
    elements_cleaned.append(element)
print("Finished cleaning elements.")

for element in elements_cleaned[:20]:
    print(f"element --> {element}")


Finished cleaning elements.
element --> I. CHAPTER OVERVIEW There’s a lot of stuff in this chapter—nine thousand years’ worth. So before you begin, read through the outline below so you’ll know where to find what you’re looking for when you return to this chapter for a mini-review. (Remember, the key to doing well is to go through the chapter once, delve into the areas you are clueless or semi-clueless about, then return here for a mini-review.) I. Chapter Overview You’re reading it!
element --> II. Stay Focused on the Big Picture
element --> Organize the zillions of facts from the 9,000 years covered in this chapter into some big-picture concepts.
element --> III. History Review through 600 C.E.
element --> This is the bulk of the chapter, where we plow through the major civilizations, people, and events. Again, we suggest that if you’re totally clueless on a section, review the corresponding section in your textbook. Here’s a list of the major sections.
element --> A. Nomads: Follow 

## Chunking Strategies

In [16]:
from unstructured.chunking.basic import chunk_elements

chunks = chunk_elements(elements_cleaned, max_characters=750, new_after_n_chars=650, overlap=100)
for chunk in chunks[50:100]:
    print(f"Chunk -> {chunk}")


Chunk -> lives. They also believed that they would be able to use their bodies in the afterlife, and this led to the invention of mummification, a process of preserving dead bodies (although this was only available to the elite members of Egyptian society). The pharaohs, as you know, built huge pyramids to house their mummified bodies and earthly treasures. Egyptian Women, Hear Them Roar The first female ruler known in history was Queen Hatshepsut, who ruled for 22 years during the New Kingdom. She is credited with greatly expanding Egyptian trade expeditions. The relatively high status of women extended beyond royalty with most Egyptian women enjoying more rights and opportunities to express individuality than their counterparts in Mesopotamia.
Chunk -> ying more rights and opportunities to express individuality than their counterparts in Mesopotamia. During the New Kingdom in particular, women could buy and sell property, inherit property, and choose to will their property how they p

In [15]:
from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(elements_cleaned, combine_text_under_n_chars=50, max_characters=750, new_after_n_chars=650, overlap=100)
for chunk in chunks[300:350]:
    print(f"Chunk -> {chunk}")


Chunk -> captured by the French, tried by the English, and burned at the stake by the French. Nevertheless, she had a signicant impact on the Hundred Years’ War (1337–1453) between England and France, which eventually resulted in England’s withdrawal from France. After the Hundred Years’ War, royal power in France became more centralized. Under a series of monarchs known as Bourbons, France was unified and became a major power on the European continent. At around the same time, Spain was united by Queen Isabella, the ruler of Castille (present-day central Spain). Power in the Spanish- speaking region of Europe had been divided for two reasons: rst, Castille was one of three independent Spanish kingdoms, and therefore no single ruler controlled the
Chunk -> astille was one of three independent Spanish kingdoms, and therefore no single ruler controlled the region, and second, the peasants were split along religious lines (mostly Christian and Muslim), due to the lasting inuences of the M