# Pdf exploration and preparation test

Main goal is to check if we can read the pdf, extract only relevant content, see how we can post-process the extracted text and finally have some informations about the extracted text.

Links to `Simple Local RAG Tutorial` :
* [GitHub](https://github.com/mrdbourke/simple-local-rag) ;
* [YouTube](https://youtu.be/qN_2fnOPY-M?si=APnkpsGY0z_scJ9Z).

In [2]:
from pathlib import Path
from pprint import pprint
import re

import pdfplumber
import pandas as pd
import nltk

## Extract the pdf pages

### Set the file path

In [29]:
PDF_FILENAME = "source.pdf"

p = Path()
p = p.resolve() / "pdf"
q = p / PDF_FILENAME

print(type(q))

if q.is_file():
    print(f"Pdf file path : '{q}'.")
else:
    print("No pdf file found.")

<class 'pathlib.PosixPath'>
Pdf file path : '/home/anquetos/gcp-professional-data-engineer-rag/pdf/source.pdf'.


### Read the pdf

Let's see if the number of pages found is the right one.

In [4]:
with pdfplumber.open(q) as pdf:
    print(
        f"* Expected number of pages : \t355\n* Number of pages found : \t{len(pdf.pages)}"
    )

* Expected number of pages : 	355
* Number of pages found : 	355


That's ok, we can try to extract text from a random test page.

In [5]:
with pdfplumber.open(q) as pdf:
    page = pdf.pages[101]
    text = page.extract_text()
    print(text[:90])

Data pipelines are sequences of operations that copy, trans-
form, load, and analyze data.


The extraction works but the text doesn't correspond to the one in the selected page above. First thing to take in account is the fact that the first item in a list is at index 0. So when we write `page = pdf.pages[101]`, in fact it is the page 102 which is extracted.
But it is still not ok : the extracted text correpond to page 62 which means page 1 in the pdf is actually the page 41 (index 40). The reason is all the "About", "Introduction", etc. sections are not numbered the same way in the pdf file.
This is something to take in account to extract the desired content.

### Target relevant text

Documents can have several information which are not relevant to build a RAG :
* headers and footers ;
* tables ;
* hyperlink ;
* figures ;
* etc..

We only want to keep the body of the document but also the code samples even if a part of this last is not always relevant. Since each document is different, there is not a unique method to determine what is relevant or not. The only way to handle this is to take time to inspect the document structure, layout, etc..

In my case, it appears that the **font** will be the best way to help me target the body and the code.

> Take note that working with fonts means we will extract the text character by character to access its properties thanks to the [`chars` object](https://github.com/jsvine/pdfplumber?tab=readme-ov-file#objects) available for each instance of `pdfplumber.PDF` and `pdfplumber.Page`.

In [6]:
with pdfplumber.open(q) as pdf:
    page = pdf.pages[43]
    header_font = page.chars[3].get("fontname")
    body_font = page.chars[103].get("fontname")
    print(f"* Header fontname : \t{header_font}\n* Body fontname : \t{body_font}")

* Header fontname : 	GHSRZR+UniversLTStd
* Body fontname : 	GHSRZR+SabonLTStd-Roman


In [7]:
with pdfplumber.open(q) as pdf:
    page = pdf.pages[50]
    code_font = page.extract_text_lines(return_chars=True)[8]["chars"][0].get(
        "fontname"
    )
    print(f"* Sample code fontname : \t{code_font}")

* Sample code fontname : 	GHSRZR+SourceCodePro-Regular


Header, body and code have different fonts which is of great help. The last thing to take care of is the fact that the text we want to target can be *italic* or **bold**. So let's make a list of all available fonts in the file.

In [8]:
# Extract all fonts in the document
fontname_list = []
with pdfplumber.open(q) as pdf:
    for page in pdf.pages:
        [
            fontname_list.append(char.get("fontname"))
            for char in page.chars
            if char.get("fontname") not in fontname_list
        ]

In [11]:
# List only the necessary fonts
body_fontname_list = [
    fontname
    for fontname in fontname_list
    if "Sabon" in fontname or "SourceCode" in fontname
]
print(body_fontname_list)

['GHSRZR+SabonLTStd-Roman', 'GHSRZR+SourceCodePro-Regular', 'GHSRZR+SabonLTStd-Bold', 'GHSRZR+SabonLTStd-Italic', 'URTXBU+SourceCodePro-Bold']


Last step for the font part : we will create a helper function to filter the extracted text by font using the fontname of each character.

In [12]:
# Font filter helper funtion
def filter_text_by_font(chars: list[dict], target_fonts: list[str]) -> str:
    """Filters extracted text and, more precisely, its letters by their fonts.

    Args:
        chars (list[dict]): chars object from pdfplumber.
        target_fonts (list[str]): list of fontnames for which we want to keep the characters/text.

    Returns:
        str: filtered text.
    """
    char_text = [char["text"] for char in chars if char.get("fontname") in target_fonts]
    text = "".join(char_text)
    return text

### Text post-processing

#### Basic formatting

The goal is to have the cleanest text as possible for further steps. We will remove uppercase and unecessary spaces. In addition to that, we will also replace *fifi* string by *fi*. This is a specific error I noticed after the extraction of my document which shows how important it is to inspect each document carefully to identify the best way to process it.
Here is a sample text.

In [23]:
# Basic text formatter function
def basic_text_formatter(text: str) -> str:
    """Applies different operations to format and clean the text.

    Args:
        text (str): original text.

    Returns:
        str: formatted text.
    """
    formatted_text = " ".join(
        text.casefold().replace("\n", " ").replace("fifi", "fi").split()
    )
    return formatted_text

In [26]:
basic_text_sample = " I'm a Basic   text sample. "

print(
    f"* Before : \t{basic_text_sample}\n* After : \t{basic_text_formatter(basic_text_sample)}"
)

* Before : 	 I'm a Basic   text sample. 
* After : 	i'm a basic text sample.


#### Hyphens

Hyphens are used to break words so that the appearance of the page is nicer but it will interfere in the words recognition.

In [27]:
with pdfplumber.open(q) as pdf:
    page = pdf.pages[237]
    text = page.extract_text()
    hyphen_text_sample = text[1066:1078]
    print(hyphen_text_sample)

con-
necting


In [17]:
def remove_hyphens(text: str) -> str:
    """Removes hyphens from text.

    Args:
        text (str): original text.

    Returns:
        str: processed text.
    """
    lines = [line.rstrip() for line in text.split("\n")]

    # Find dashes
    line_numbers = []
    for line_no, line in enumerate(lines[:-1]):
        if line.endswith("-"):
            line_numbers.append(line_no)

    # Replace
    for line_no in line_numbers:
        lines = dehyphenate(lines, line_no)

    return " ".join(lines)


def dehyphenate(lines: list[str], line_no: int) -> list[str]:
    """Rebuilds lines (words) separated by hyphen.

    Args:
        lines (list[str]): lines to process.
        line_no (int): index of lines to process.

    Returns:
        list[str]: list of modified lines.
    """
    next_line = lines[line_no + 1]
    word_suffix = next_line.split(" ")[0]

    lines[line_no] = lines[line_no][:-1] + word_suffix
    lines[line_no + 1] = lines[line_no + 1][len(word_suffix) :]
    return lines

In [18]:
print(
    f"* Before : \t{hyphen_text_sample}\n* After : \t{remove_hyphens(hyphen_text_sample)}"
)

* Before : 	con-
necting
* After : 	connecting 


### Text extraction

We now have all our "tools"" to extract the pdf pages correctly and in a relevant way. To refine a bit more our target will remove the pages we don't want to keep (like introduction, glossary, etc.) and we will skip the blank pages (with no content).

Do do this, We will write a final function to process our whole document. Pages will be stored in a list of dictionnaries where we will be able to add information like page number, number of characters, tokens, sentences, etc.. This will help us append different information an make further analysis by converting it to a DataFrame.

For more information about tokens [see here](https://python.langchain.com/docs/concepts/tokens/) and [here](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them).

In [54]:
def extract_and_process_pdf(path: Path) -> list[dict]:
    """Open a pdf file with pdfplumber, extracts and formats relevant pages then append 
    their content and statistics in a list.

    Args:
        path (Path): Pathlib path of the document.

    Returns:
        list[dict]: Extracted content and informations of pages.
    """
    extracted_pages = []

    with pdfplumber.open(path) as pdf:
        for page_idx, page in enumerate(pdf.pages):
            page_number = page_idx - 39
            lines = page.extract_text_lines(return_chars=True, keep_blank_chars=True)

            kept_lines = []
            for line in lines:
                kept_lines.append(
                    filter_text_by_font(line["chars"], body_fontname_list)
                )
            text = "\n".join(kept_lines)

            text = remove_hyphens(text)
            text = basic_text_formatter(text)

            if 0 < page_number <= 305 and text:
                extracted_pages.append(
                    {
                        "page_number": page_number,
                        "page_chars_count": len(text),
                        "page_words_count": len(text.split(" ")),
                        "page_raw_sentences_count": len(re.split(r'[.?!]', text)),
                        "page_raw_tokens_count": len(text) // 4,
                        "page_text": text
                    }
                )

    return extracted_pages

In [55]:
extracted_pages = extract_and_process_pdf(q)

In [56]:
list(filter(lambda d: d.get('page_number') == 2, extracted_pages))

[{'page_number': 2,
  'page_chars_count': 2022,
  'page_words_count': 314,
  'page_raw_sentences_count': 18,
  'page_raw_tokens_count': 505,
  'page_text': 'data engineers choose how to store data for many different situations. sometimes data is written to a temporary staging area, where it stays only seconds or less before it is read by an application and deleted. in other cases, data engineers arrange long-term archival storage for data that needs to be retained for years. data engineers are increasingly called on to work with data that streams into storage constantly and in high volumes. internet of things (iot) devices are an example of streaming data. another common use case is storing large volumes of data for batch processing, including using data to train machine learning models. data engineers also consider the range of variety in the structure of data. some data, like the kind found in online transaction processing, is highly structured and varies little from one datum to the

In [57]:
df = pd.DataFrame(extracted_pages)

In [74]:
df.describe().drop(columns=["page_number"]).loc[["mean", "min", "max"]]

Unnamed: 0,page_chars_count,page_words_count,page_raw_sentences_count,page_raw_tokens_count
mean,2082.933333,331.968421,21.315789,520.382456
min,121.0,22.0,1.0,30.0
max,3705.0,624.0,55.0,926.0


## WIP

In [74]:
words_nltk = nltk.tokenize.word_tokenize(filtered_text)
len_words_nltk = len(nltk.tokenize.word_tokenize(filtered_text))
sentences_nltk = nltk.tokenize.sent_tokenize(filtered_text)
page_sentences_count_nltk = len(sentences_nltk)

In [75]:
pprint(sentences_nltk)

['catalog that lists both appliances and furniture.',
 'Here is an example of how a dishwasher and a chair might be represented:In '
 'addition to document databases, wide-column databases, such as Bigtable and '
 'Cassandra, are also used with datasets with varying attributes.Data is '
 'accessed in different ways for different use cases.',
 'Some time-series data points may be read immediately after they are written, '
 'but they are not likely to be read once they are more than a day old.',
 'Customer order data may be read repeatedly as an order is processed.',
 'Archived data may be accessed less than once a year.',
 'Four metrics to consider about data access are as follows:How much data is '
 'retrieved in a read operation?How much data is written in an insert '
 'operation?How often is data written?How often is data read?Some read and '
 'write operations apply to small amounts of data.',
 'Reading or writing a single piece of telemetry data is an example.',
 'Writing an e-comm

In [None]:
def text_formatter(text: str) -> str:
    formatted_text = text.casefold()
    formatted_text = formatted_text.replace("\n", " ").strip()
    formatted_text = " ".join(formatted_text.split())
    return formatted_text


def get_text_from_pdf(filepath: str, page_offset: int = 0) -> list[dict]:
    pages_information = []
    reader = PdfReader(filepath)
    for page_number in range(40, 316):
        page = reader.pages[page_number]
        text = page.extract_text()
        text = text_formatter(text)
        pages_information.append(
            {
                "page_number": page_number - page_offset + 1,
                "page_characters_count": len(text),
                "page_words_count": len(text.split(" ")),
                "page_sentences_count_raw": len(text.split(". ")),
                "page_tokens_count": len(text) / 4,
                "text": text,
            }
        )

    return pages_information


pages_info = get_text_from_pdf(q, 40)

In [24]:
df = pd.DataFrame(pages_info)
df.head()

Unnamed: 0,page_number,page_characters_count,page_words_count,page_sentences_count_raw,page_tokens_count,text
0,1,406,54,2,101.5,chapter 1 selecting appropriate storage techno...
1,2,2094,336,16,523.5,data engineers choose how to store data for ma...
2,3,2130,354,13,532.5,from business requirements to storage systems ...
3,4,2823,474,18,705.75,4 chapter 1 ■ selecting appropriate storage te...
4,5,2598,436,19,649.5,from business requirements to storage systems ...


In [25]:
df.describe()

Unnamed: 0,page_number,page_characters_count,page_words_count,page_sentences_count_raw,page_tokens_count
count,276.0,276.0,276.0,276.0,276.0
mean,138.5,2028.518116,326.405797,18.40942,507.129529
std,79.818544,800.690886,132.07489,10.289401,200.172721
min,1.0,0.0,1.0,1.0,0.0
25%,69.75,1635.25,254.75,11.0,408.8125
50%,138.5,2173.0,349.5,17.5,543.25
75%,207.25,2596.5,418.25,25.0,649.125
max,276.0,3629.0,586.0,46.0,907.25


### Words and sentences with NLTK

In [None]:
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/anquetos/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [37]:
for item in pages_info:
    item["words_nltk"] = len(nltk.tokenize.word_tokenize(item["text"]))
    item["sentences_nltk"] = nltk.tokenize.sent_tokenize(item["text"])
    item["page_sentences_count_nltk"] = len(item["sentences_nltk"])

NameError: name 'pages_info' is not defined

In [27]:
pprint(pages_info[11])

{'page_characters_count': 2316,
 'page_number': 12,
 'page_sentences_count_nltk': 19,
 'page_sentences_count_raw': 18,
 'page_tokens_count': 579.0,
 'page_words_count': 382,
 'sentences_nltk': ['12 chapter 1 ■ selecting appropriate storage technologies '
                    'cloud storage supports ingesting large volumes of data in '
                    'bulk using tools such as the cloud transfer service and '
                    'transfer appliance.',
                    '(cloud storage also supports streaming transfers, but '
                    'bulk reads and writes are more common.)',
                    'data in cloud storage is read at the object or the file '
                    'level.',
                    'you typically don’t, for example, seek a particular block '
                    'within a file as you can when storing a file on a '
                    'filesystem.',
                    'it is common to read large volumes of data in bigquery as '
                    'we

In [91]:
df = pd.DataFrame(pages_info)
df.describe()

Unnamed: 0,page_number,page_characters_count,page_words_count,page_sentences_count_raw,page_tokens_count,words_nltk,page_sentences_count_nltk
count,276.0,276.0,276.0,276.0,276.0,276.0,276.0
mean,138.5,2028.518116,326.405797,18.40942,507.129529,363.884058,16.199275
std,79.818544,800.690886,132.07489,10.289401,200.172721,146.445923,8.170911
min,1.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,69.75,1635.25,254.75,11.0,408.8125,282.0,10.75
50%,138.5,2173.0,349.5,17.5,543.25,390.0,17.0
75%,207.25,2596.5,418.25,25.0,649.125,460.5,22.0
max,276.0,3629.0,586.0,46.0,907.25,650.0,40.0


### Chunking sentences

In [None]:
# Define split size to turn groups of sentences into chunks
# The default value (16) is based on the averagen number of sentences per page
sentence_chunk_size = 16


# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, slice_size: int) -> list[list[str]]:
    return [
        input_list[i : i + slice_size] for i in range(0, len(input_list), slice_size)
    ]


# Loop through pages and texts and split sentences into chunks
for item in pages_info:
    item["sentence_chunks"] = split_list(
        input_list=item["sentences_nltk"], slice_size=sentence_chunk_size
    )
    item["page_chunks_count"] = len(item["sentence_chunks"])

In [99]:
pprint(pages_info[11])

{'num_chunks': 2,
 'page_characters_count': 2316,
 'page_chunks_count': 2,
 'page_number': 12,
 'page_sentences_count_nltk': 19,
 'page_sentences_count_raw': 18,
 'page_tokens_count': 579.0,
 'page_words_count': 382,
 'sentence_chunks': [['12 chapter 1 ■ selecting appropriate storage '
                      'technologies cloud storage supports ingesting large '
                      'volumes of data in bulk using tools such as the cloud '
                      'transfer service and transfer appliance.',
                      '(cloud storage also supports streaming transfers, but '
                      'bulk reads and writes are more common.)',
                      'data in cloud storage is read at the object or the file '
                      'level.',
                      'you typically don’t, for example, seek a particular '
                      'block within a file as you can when storing a file on a '
                      'filesystem.',
                      'it is common to 

In [100]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_info)
df.describe().round(2)

Unnamed: 0,page_number,page_characters_count,page_words_count,page_sentences_count_raw,page_tokens_count,words_nltk,page_sentences_count_nltk,num_chunks,page_chunks_count
count,276.0,276.0,276.0,276.0,276.0,276.0,276.0,276.0,276.0
mean,138.5,2028.52,326.41,18.41,507.13,363.88,16.2,1.52,1.52
std,79.82,800.69,132.07,10.29,200.17,146.45,8.17,0.57,0.57
min,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,69.75,1635.25,254.75,11.0,408.81,282.0,10.75,1.0,1.0
50%,138.5,2173.0,349.5,17.5,543.25,390.0,17.0,2.0,2.0
75%,207.25,2596.5,418.25,25.0,649.12,460.5,22.0,2.0,2.0
max,276.0,3629.0,586.0,46.0,907.25,650.0,40.0,3.0,3.0


### Split chunk into its own item

In [None]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in pages_info:
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(
            r"\.([A-Z])", r". \1", joined_sentence_chunk
        )  # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len(
            [word for word in joined_sentence_chunk.split(" ")]
        )
        chunk_dict["chunk_token_count"] = (
            len(joined_sentence_chunk) / 4
        )  # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

420

In [102]:
# View a random sample
pages_and_chunks[12]

{'page_number': 8,
 'sentence_chunk': 'cloud bigtable, which is used for telemetry data and large-volume analytic applications, can store up to 8 tb per node when using hard disk drives, and it can store up to 2.5 tb per node when',
 'chunk_char_count': 191,
 'chunk_word_count': 36,
 'chunk_token_count': 47.75}

In [104]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,420.0,420.0,420.0,420.0
mean,136.94,1323.03,204.85,330.76
std,80.16,687.61,107.44,171.9
min,1.0,33.0,6.0,8.25
25%,68.0,759.75,111.75,189.94
50%,133.5,1493.5,228.5,373.38
75%,207.25,1852.5,291.0,463.12
max,276.0,2979.0,475.0,744.75


In [None]:
# Show random chunks with under 30 tokens in length
min_token_length = 10
for row in df[df["chunk_token_count"] <= min_token_length].iterrows():
    print(
        f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}'
    )

Chunk token count: 8.25 | Text: cloud dataflow is based on apache
Chunk token count: 9.0 | Text: a. hipaa b. gdpr c. coppa d. fedramp


### Embedding text chunks

In [None]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(
    model_name_or_path="all-mpnet-base-v2", device=None
)  # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer.",
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07982659e-02  3.03164534e-02 -2.01218035e-02  6.86484650e-02
 -2.55255979e-02 -8.47684871e-03 -2.07209232e-04 -6.32377788e-02
  2.81607267e-02 -3.33353542e-02  3.02634221e-02  5.30721508e-02
 -5.03526740e-02  2.62288693e-02  3.33314016e-02 -4.51577567e-02
  3.63045111e-02 -1.37119880e-03 -1.20170908e-02  1.14946989e-02
  5.04510663e-02  4.70857024e-02  2.11913791e-02  5.14606349e-02
 -2.03746632e-02 -3.58889550e-02 -6.67788729e-04 -2.94394027e-02
  4.95859347e-02 -1.05639659e-02 -1.52014326e-02 -1.31760747e-03
  4.48197573e-02  1.56022962e-02  8.60379203e-07 -1.21391530e-03
 -2.37978660e-02 -9.09376249e-04  7.34485686e-03 -2.53933994e-03
  5.23370355e-02 -4.68043499e-02  1.66215003e-02  4.71579656e-02
 -4.15599309e-02  9.01947613e-04  3.60278152e-02  3.42214368e-02
  9.68227461e-02  5.94829284e-02 -1.64984372e-02 -3.51249054e-02
  5.92513569e-03 -7.07929139e-04 -2.4103