## **INTRODUCTION**

This interactive notebook is created to perform PDF chunking in a generic way. Through the notebook, it is possible to create embeddings by dividing titles, paragraphs, links, images, tables, and text within images into chunks.

### **Used Packages**

The packages used in the developed notebook and their usage purposes are listed as follows.

* **PyMuPDF:** Extracting paragraphs, links, and titles
* **pdfplumber:** Extracting tables
* **pdfminer:** Extracting images
* **pytesseract:** Extracting text with OCR
* **transformers:** Tokenizing
* **sentence-transformers:** Creating embeddings

The source codes of the used packages can be accessed through forums such as GitHub and StackOverflow.

### **Considerations for the Document**

Matters to be considered in the document should be examined under many headings.

* **Links**
  There is no need to pay attention to anything other than the incoming text being in a proper link format while extracting links.

* **Titles**
  No artificial intelligence support has been taken while extracting titles. Therefore, many parameters such as font size, font type, and boldness come into play in the process of extracting titles. In this case, the following should be considered for a text in the PDF file to be defined as a title:
    * The title should be in bold or have a point size greater than twelve.
    * A text defined as a title must comply with Python title case rules. For example, while the text "Great Day" is a title for the developed code, the text "Great day" is considered in the text class, even if it is written in bold or has a point size greater than twelve.
    * Titles should be kept as short as possible. Although the developed code can detect long titles, it is recommended to take this precaution to avoid unexpected situations.
    * The use of special characters other than Turkish characters should be avoided. If the use of special characters other than Turkish is mandatory, this character must be added to the avoid_letters section along with Turkish characters. Otherwise, these symbols will be lost in the process of cleaning unwanted ASCII characters.

* **Texts**
  The code perceives all texts other than titles as paragraphs. Therefore, to define a text as text, it must be ensured that it does not meet the title criteria.

* **Tables**
  Tables should be defined with **each column having a header**. Although a None check is performed for columns without column headers, it is very important to design tables in accordance with this criterion in order to obtain a correct output.

* **Images**
  There is no need to meet any criteria other than being an image for images to be saved as byte arrays. However, the image quality must be paid attention to when extracting text within the image with OCR. Complex texts should also be avoided. The text within the image should, if possible, be in contrast with the background color. For example, if the image is black, better results are obtained when the text is white.

* **PDF lines**
  In the studies conducted in the developed code, it was observed that there were no problems in some examples with pages divided in half. However, in pages divided into three, pdfplumber perceives the texts as tables. For this reason, it is important that the PDF files are prepared at most by dividing them in half.

### **User Interface**

Some text fields in the user interface work only in the **GOOGLE Colab** environment. In local developments, it is necessary to manually write the required information to these text fields. A comment line **"#@"** usually precedes these fields.

### **Suggestions**

The cells in the code can be defined as definitions and turned into a Python file, and a user interface can be designed. Although there is a simple user interface design in the notebook, it is a drawback that it works efficiently only in the Google Colab environment, and it does not offer the ease of use of a desktop application. The developed code is not fully integrated with an interface for now, as it is just an example code for the department in the PDF Chunking process. After a fully satisfactory PDF Chunking code is created, it would be healthier to start the interface studies.

### **Results**
In this study, chunking operations on PDF files were studied. In the light of the findings obtained, it can be said that PDF chunking depends on many parameters due to the lack of parameter abundance in docx or html files. In this study, it was reached to the creation of embeddings, which is the last stage to be done before sending data to an LLM model. Due to the bank's banking rules, the development process has been limited here within the scope of the internship.

The following can be said from the obtained results: Developing a completely generic code for PDF Chunking without using ready-made artificial intelligence models or complex algorithms is quite costly in terms of time. In this case, it would be more logical for the business unit to organize the texts created within a certain concept.

### **References**
Since this study is not in the nature of a scientific article, an IEEE standard References section has not been created. You can reach the articles used during the development of the code below.

For a comprehensive guide:
[https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517](https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517)

Creating smaller pdf files from pdf files:
[https://medium.com/@mahedi154/automated-pdf-content-extraction-and-chunking-with-python-d8f8012defda](https://medium.com/@mahedi154/automated-pdf-content-extraction-and-chunking-with-python-d8f8012defda)

A different approach, chunking process using Spacy:
[https://medium.com/jina-ai/search-pdf-text-images-and-tables-with-python-clip-d5f5dd961c77](https://medium.com/jina-ai/search-pdf-text-images-and-tables-with-python-clip-d5f5dd961c77)


In [None]:
# @title INSTALLING REQUIRED PACKAGES
!pip install pdfminer.six
!pip install pdfplumber
!pip install pdf2image
!pip install Pillow
!pip install PyMuPDF
!pip install transformers
!pip install sentence-transformers
!pip install ipywidgets
!pip install pytesseract
!sudo apt-get install tesseract-ocr
!pip freeze > requirements.txt

Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Downloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20240706
Collecting pdfplumber
  Downloading pdfplumber-0.11.3-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m614.4 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-

In [None]:
# @title IMPORTING PACKAGES
#For basic data science operations
import pandas as pd
# To analyze the PDF layout and extract text
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextContainer, LTChar, LTRect, LTFigure
# To extract text from tables in PDF
import pdfplumber
# To extract the images from the PDFs
from PIL import Image, ImageEnhance, ImageFilter
from pdf2image import convert_from_path
#To extract links
import re
#To convert the data to JSON format
import json
#Importing PyMuPDF
import fitz
# To remove the additional created files
import os
#To tokenize the output
from transformers import AutoTokenizer
#Imports for embedding
from sentence_transformers import SentenceTransformer
import numpy as np
#Importing UI elements
import ipywidgets as widgets
from IPython.display import display
#For file process
import io
#For OCR
import pytesseract

In [None]:
# @title UPLOAD FILE
# Import necessary libraries
import ipywidgets as widgets
from IPython.display import display
import os
import time

# Create a file upload widget
uploader = widgets.FileUpload(
    accept='.pdf',  # Accept only PDF files
    multiple=False,  # Single file upload
    layout=widgets.Layout(width='200px', height='50px')  # Resize the upload button
)

# Create a box to center the upload button
center_box = widgets.HBox([uploader], layout=widgets.Layout(justify_content='center'))

# Create a progress bar widget
progress = widgets.IntProgress(
    value=0,
    min=0,
    max=100,
    step=1,
    description='Uploading:',
    bar_style='',  # 'success', 'info', 'warning', 'danger' or ''
    orientation='horizontal'
)

# Function to simulate progress (for demo purposes)
def simulate_progress(progress_widget, duration=2):
    steps = 100
    delay = duration / steps
    for i in range(steps):
        time.sleep(delay)
        progress_widget.value = i + 1

# Define an event handler to save the uploaded file, update the progress bar, and return the file path
def on_upload_change(change):
    global file_path
    for filename, file_info in uploader.value.items():
        # Show the progress bar
        display(progress)

        # Simulate file processing time (for demonstration)
        simulate_progress(progress)

        # Save the uploaded file to the current working directory
        file_path = os.path.join('/content', filename)
        with open(file_path, 'wb') as f:
            f.write(file_info['content'])

        # Update the progress bar to 100% once the file is saved
        progress.value = 100

        # Notify that the upload is complete
        print(f'File {filename} uploaded and saved to {file_path}')

        # Return the file path
        return file_path

# Attach the event handler to the uploader
uploader.observe(on_upload_change, names='value')

# Display the uploader widget
display(center_box)


HBox(children=(FileUpload(value={}, accept='.pdf', description='Upload', layout=Layout(height='50px', width='2…

IntProgress(value=0, description='Uploading:')

File ast_sci_data_tables_sample-1.pdf uploaded and saved to /content/ast_sci_data_tables_sample-1.pdf


In [None]:
# @title SETTING UP PARAMETERS
#Setting an empty list for pages
pages = list()
#Setting an empty list for text formats
format_list = list()
#Creating an empty list for rows of tables
all_rows = list()
#Creating an empty list for tables
tables = list()
#Setting up an empty array for table bounding boxes
table_bounding_boxes = []
#Setting an empty list for pdfplumber pages
pdfplumberpages = list()
#List of tokenized text fields
tokenized_data = list()
#List for text chunks
text_chunks = list()
#Max chunk size
max_length = 512 #@param {type: "integer"}
#Setting model selecter parameter
model_type = "Turkish" #@param ["English", "Turkish"]
#Setting up Turkish and English tokenizer models and embedders
if model_type == "English":
  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
  embedder = SentenceTransformer('bert-base-nli-mean-tokens')
else:
  tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
  embedder = SentenceTransformer('dbmdz/distilbert-base-turkish-cased')
#Adding the characters that wanted to be avoided during asci character cleaning
avoid_letters = "çğıİöşüÇĞİÖŞÜ" #@param {type: "string"}



config.json:   0%|          | 0.00/410 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/273M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/251k [00:00<?, ?B/s]

In [None]:
# @title EXTRACTING PAGE INFO
#Setting up a pdf variable for pdf plumber
pdf = pdfplumber.open(file_path)
#Defining a doc parameter for PyMuPDF
doc = fitz.open(file_path)
#Iterating over each page and appending them to pages list
for pagenum, page in enumerate(extract_pages(file_path)):
  pages.append(page)
#Extracting page objects for pdfplumber
for page in pdf.pages:
  pdfplumberpages.append(page)
#Printing pages
print(pages)

[<LTPage(1) 0.000,0.000,612.000,792.000 rotate=0>, <LTPage(2) 0.000,0.000,612.000,792.000 rotate=0>]


**DEFINING REQUIRED METHODS**

In [None]:
# @title *Extracting Table Boundry Boxes*
for page in pdfplumberpages:
    # Finding tables
    avoid_tables = page.find_tables()
    # Extract bounding boxes of tables
    table_bounding_boxes.extend([table.bbox for table in avoid_tables])
#Printing the boundry boxes
print(table_bounding_boxes)

[(105.5, 379.5, 379.5, 469.5), (54.53125, 193.33333333333334, 558.54779, 473.5), (123.5, 588.5, 397.5, 696.69118)]


In [None]:
# @title *Extracting Text Without Table Bounding Boxes (Using PymuPDF)*
with fitz.open(file_path) as doc:
    for page_num, page in enumerate(doc):
        # Extract text blocks
        blocks = page.get_text("dict")["blocks"]
        # Iterate over blocks
        for block in blocks:
            # Check if the block is a text block
            if block["type"] == 0:  # Text block
                # Iterate over lines in the block
                for line in block["lines"]:
                    # Iterate over spans in the line
                    for span in line["spans"]:
                        #Find the boundry box
                        bbox = (span["bbox"][0], span["bbox"][1], span["bbox"][2], span["bbox"][3])
                        # Check if the span is inside any table bounding box
                        is_in_table = any(
                            bbox[0] >= table_bbox[0]
                            and bbox[1] >= table_bbox[1]
                            and bbox[2] <= table_bbox[2]
                            and bbox[3] <= table_bbox[3]
                            for table_bbox in table_bounding_boxes
                        )
                        #If it is not in table, append to text_chunks list
                        if not is_in_table and not span["text"].isspace() and not len(span["text"]) == 0:
                            #Setting a bold text flag
                            font_name = span["font"]
                            #Setting a flag for boldness
                            is_bold = "bold" in font_name.lower()
                            #Appending data to text_chunks list
                            text_chunks.append({
                                "font_name": span["font"],
                                "text": span["text"].encode("utf-8").decode("utf-8"),
                                "font_size": span["size"],
                                "font_weight": "bold" if is_bold else "normal",
                                "bbox": bbox,
                                "color": span["color"],
                                "is_title": span["text"].istitle()
                            })

for chunk in text_chunks:
    print(chunk["text"] + " " + chunk["font_weight"])


Tutoring to Enhance Science Skills bold
Tutoring Two: normal
Learning to Make Data Tables normal
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . normal
Sample Data for Data Tables bold

  
	 bold
NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING bold
www.sedl.org/afterschool/toolkits normal
Use these data to create data tables following the Guidelines for Making a Data Table and  normal
Checklist for a Data Table. normal
Example 1: Pet Survey (GR 2–3) bold
Ms. Hubert’s afterschool students took a survey of the 600 students at Morales Elementary  normal
School. Students were asked to select their favorite pet from a list of eight animals. Here  normal
are the results.  normal
Example 2: Electromagnets—Increasing Coils (GR 3–5) bold
The following data were collected using an electromagnet with a 1.5 volt battery, a switch, 

In [None]:
# @title  { display-mode: "both" }
#@markdown Cleaning Text
def clean_text(text):
    # Keeping printable ASCII characters and Turkish letters
    cleaned_text = re.sub(r'[^\x20-\x7Ex7E' + avoid_letters + "]", '', text)
    return cleaned_text

# Iterating over each text chunk
for chunk in text_chunks:
    chunk["text"] = clean_text(chunk["text"])

NameError: name 'text_chunks' is not defined

In [None]:
# @title *Classfying Chunks*
#List of classified chunks
classified_chunks = []
#A temprory array for current paragraph
current_paragraph = []
#Paragraph ending flag
last_bottom = None
#In paragraph flag
in_paragraph = False
#Red color ASCI code
red_color_value = 16711680
for chunk in text_chunks:
    #Extracting the chunk data
    text = chunk["text"].strip()
    font_size = chunk["font_size"]
    font_weight = chunk["font_weight"]
    bbox = chunk["bbox"]
    color = chunk["color"]
    is_title = chunk["is_title"]

    # Identify links
    if re.match(r'^https?://|^www\.', text):
        classified_chunks.append({
            "type": "link",
            "text": text
        })
        continue

    # Identify title (e.g., large font size)
    if font_weight == "bold":  #Adjust size threshold as needed
        if in_paragraph:
            # End the previous paragraph
            classified_chunks.append({
                "type": "paragraph",
                "text": ' '.join(current_paragraph)
            })
            current_paragraph = []
            in_paragraph = False
        classified_chunks.append({
            "type": "title",
            "text": text
        })
    else:
        # Determine if a new paragraph starts by checking vertical spacing
        if last_bottom is not None and (bbox[1] - last_bottom) > font_size * 0.5:
            if in_paragraph:
                classified_chunks.append({
                    "type": "paragraph",
                    "text": ' '.join(current_paragraph)
                })
            current_paragraph = []
            in_paragraph = True

        # Add current text to the paragraph
        current_paragraph.append(text)
        last_bottom = bbox[3]

# Append the last paragraph if it exists
if in_paragraph and current_paragraph:
    classified_chunks.append({
        "type": "paragraph",
        "text": ' '.join(current_paragraph)
    })

print(classified_chunks)

[{'type': 'title', 'text': 'Tutoring to Enhance Science Skills'}, {'type': 'title', 'text': 'Sample Data for Data Tables'}, {'type': 'title', 'text': ''}, {'type': 'title', 'text': 'NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING'}, {'type': 'link', 'text': 'www.sedl.org/afterschool/toolkits'}, {'type': 'paragraph', 'text': 'Use these data to create data tables following the Guidelines for Making a Data Table and Checklist for a Data Table.'}, {'type': 'title', 'text': 'Example 1: Pet Survey (GR 23)'}, {'type': 'paragraph', 'text': 'Ms. Huberts afterschool students took a survey of the 600 students at Morales Elementary School. Students were asked to select their favorite pet from a list of eight animals. Here are the results.'}, {'type': 'title', 'text': 'Example 2: ElectromagnetsIncreasing Coils (GR 35)'}, {'type': 'paragraph', 'text': 'The following data were collected using an electromagnet with a 1.5 volt battery, a switch, a piece of #20 insulated wire, and a nail. Three tr

**EXTRACTING TABLES**

In [None]:
# @title *Extracting Table Rows*
#Iterating each pdfplumber page
for page in pdf.pages:
  #Extracting tables from each page
  page_tables = page.extract_tables()
  #Appending the tables to tables list
  tables.extend(page_tables)
#Printing tables
print(tables)

[[['Number of Coils', 'Number of Paperclips'], ['5', '3, 5, 4'], ['10', '7, 8, 6'], ['15', '11, 10, 12'], ['20', '15, 13, 14']], [['Speed (mph)', 'Driver', 'Car', 'Engine Date', None], ['407.447', 'Craig Breedlove', 'Spirit of America', 'GE J47', '8/5/63'], ['413.199', 'Tom Green', 'Wingfoot Express', 'WE J46', '10/2/64'], ['434.22', 'Art Arfons', 'Green Monster', 'GE J79', '10/5/64'], ['468.719', 'Craig Breedlove', 'Spirit of America', 'GE J79', '10/13/64'], ['526.277', 'Craig Breedlove', 'Spirit of America', 'GE J79', '10/15/65'], ['536.712', 'Art Arfons', 'Green Monster', 'GE J79', '10/27/65'], ['555.127', 'Craig Breedlove', 'Spirit of America, Sonic 1', 'GE J79', '11/2/65'], ['576.553', 'Art Arfons', 'Green Monster', 'GE J79', '11/7/65'], ['600.601', 'Craig Breedlove', 'Spirit of America, Sonic 1', 'GE J79', '11/15/65'], ['622.407', 'Gary Gabelich', 'Blue Flame', 'Rocket', '10/23/70'], ['633.468', 'Richard Noble', 'Thrust 2', 'RR RG 146', '10/4/83'], ['763.035', 'Andy Green', 'Thru

In [None]:
# @title *Null Check for Tables*
# Correcting column names if there is a corruption (None)
for table in tables:
    #Iterating each row
    for row in table:
        #Iterating each row element
        for i, row_element in enumerate(row):
            #Checking if the row element is None
            if row_element is None or row_element.lower() == "none":
                previous_row_element = row[i - 1]
                split_titles = previous_row_element.split()
                row[i - 1] = split_titles[0]
                row[i] = ' '.join(split_titles[1:])

In [None]:
# @title *Creating a List of Rows*
#Iterating each table
for table in tables:
  #Creating an empty list for rows
  rows = list()
  #Iterating each row
  for row in table:
    #Checking if the row is not None
    if row!=None and len(row) > 1:
      rows.append(row)
  #Appending the rows to all_rows list
  all_rows.append(rows)
#Printing all rows
print(all_rows)

[[['Number of Coils', 'Number of Paperclips'], ['5', '3, 5, 4'], ['10', '7, 8, 6'], ['15', '11, 10, 12'], ['20', '15, 13, 14']], [['Speed (mph)', 'Driver', 'Car', 'Engine', 'Date'], ['407.447', 'Craig Breedlove', 'Spirit of America', 'GE J47', '8/5/63'], ['413.199', 'Tom Green', 'Wingfoot Express', 'WE J46', '10/2/64'], ['434.22', 'Art Arfons', 'Green Monster', 'GE J79', '10/5/64'], ['468.719', 'Craig Breedlove', 'Spirit of America', 'GE J79', '10/13/64'], ['526.277', 'Craig Breedlove', 'Spirit of America', 'GE J79', '10/15/65'], ['536.712', 'Art Arfons', 'Green Monster', 'GE J79', '10/27/65'], ['555.127', 'Craig Breedlove', 'Spirit of America, Sonic 1', 'GE J79', '11/2/65'], ['576.553', 'Art Arfons', 'Green Monster', 'GE J79', '11/7/65'], ['600.601', 'Craig Breedlove', 'Spirit of America, Sonic 1', 'GE J79', '11/15/65'], ['622.407', 'Gary Gabelich', 'Blue Flame', 'Rocket', '10/23/70'], ['633.468', 'Richard Noble', 'Thrust 2', 'RR RG 146', '10/4/83'], ['763.035', 'Andy Green', 'Thrust 

In [None]:
# @title *Converting to Dataframe*
# Creating DataFrames for each table
dataframes = list()
for table in tables:
    if table:  # Check if the table is not empty
        # Assume the first row is the header
        df = pd.DataFrame(table[1:], columns=table[0])
        df.index = [f'row_{i+1}' for i in range(len(df))]
        dataframes.append(df)
#Printing dataframes
for index in range(len(dataframes)):
  display(dataframes[index])
  print("\n")

Unnamed: 0,Number of Coils,Number of Paperclips
row_1,5,"3, 5, 4"
row_2,10,"7, 8, 6"
row_3,15,"11, 10, 12"
row_4,20,"15, 13, 14"






Unnamed: 0,Speed (mph),Driver,Car,Engine,Date
row_1,407.447,Craig Breedlove,Spirit of America,GE J47,8/5/63
row_2,413.199,Tom Green,Wingfoot Express,WE J46,10/2/64
row_3,434.22,Art Arfons,Green Monster,GE J79,10/5/64
row_4,468.719,Craig Breedlove,Spirit of America,GE J79,10/13/64
row_5,526.277,Craig Breedlove,Spirit of America,GE J79,10/15/65
row_6,536.712,Art Arfons,Green Monster,GE J79,10/27/65
row_7,555.127,Craig Breedlove,"Spirit of America, Sonic 1",GE J79,11/2/65
row_8,576.553,Art Arfons,Green Monster,GE J79,11/7/65
row_9,600.601,Craig Breedlove,"Spirit of America, Sonic 1",GE J79,11/15/65
row_10,622.407,Gary Gabelich,Blue Flame,Rocket,10/23/70






Unnamed: 0,Time (drops of water),Distance (cm)
row_1,1,10119
row_2,2,"29, 31, 30"
row_3,3,"59, 58, 61"
row_4,4,"102, 100, 98"
row_5,5,"122, 125, 127"






In [None]:
# @title EXTRACTING IMAGE DATA
output_folder = "content/output_folder"
# Create the output folder if it does not exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

def preprocess_image(image):
    # Convert to grayscale
    image = image.convert('L')
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(image)
    image = enhancer.enhance(2)
    # Apply a slight blur to reduce noise
    image = image.filter(ImageFilter.MedianFilter())
    return image

def extract_images_from_pdf(file_path, output_folder):
    images_data = []  # List to store image byte arrays and metadata

    # Open the PDF file
    with fitz.open(file_path) as doc:
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            # Extract images
            image_list = page.get_images(full=True)
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = doc.extract_image(xref)
                image_bytes = base_image["image"]
                image_filename = f"{output_folder}/page_{page_num+1}_img_{img_index+1}.png"

                # Save the image to file
                with open(image_filename, "wb") as img_file:
                    img_file.write(image_bytes)
                print(f"Saved image {image_filename}")

                # Load image with PIL
                image = Image.open(io.BytesIO(image_bytes))

                # Preprocess the image
                image = preprocess_image(image)

                # Extract text from image using OCR
                extracted_text = pytesseract.image_to_string(image, lang='eng')

                # Store image as byte array and extracted text
                images_data.append({
                    "filename": image_filename,
                    "byte_array": image_bytes,
                    "extracted_text": extracted_text
                })

    return images_data

# Run the extraction function and get the images data
images_data = extract_images_from_pdf(file_path, output_folder)

# Optionally, you can work with the byte arrays and extracted text here
for img_data in images_data:
    print(f"Image file: {img_data['filename']}")
    print(f"Image byte array length: {len(img_data['byte_array'])}")
    print(f"Extracted text: {img_data['extracted_text']}")

In [None]:
# @title CONVERTING TO JSON - WITHOUT TOKENIZING
#Converting texts to JSON format
json_data_text = json.dumps(classified_chunks, indent = 2,ensure_ascii=False);
#Converting tables to JSON format
json_data_tables = list()
for idx, df in enumerate(dataframes):
    json_output = df.to_json(orient='index', indent=4)
    # Create a tagged dictionary
    tagged_json = {
        f"Table{idx + 1}": json.loads(json_output)
    }
    json_data_tables.append(tagged_json)
formatted_json = json.dumps(json_data_tables, indent=4)
#Mergin data
all_data = list()
all_data.append(classified_chunks)
all_data.append(json_data_tables)
all_data.append(images_data)
json_data_all = json.dumps(all_data, indent=4)
#Printing the JSON data
print(json_data_all)

[
    [
        {
            "type": "title",
            "text": "Tutoring to Enhance Science Skills"
        },
        {
            "type": "title",
            "text": "Sample Data for Data Tables"
        },
        {
            "type": "title",
            "text": ""
        },
        {
            "type": "title",
            "text": "NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING"
        },
        {
            "type": "link",
            "text": "www.sedl.org/afterschool/toolkits"
        },
        {
            "type": "paragraph",
            "text": "Use these data to create data tables following the Guidelines for Making a Data Table and Checklist for a Data Table."
        },
        {
            "type": "title",
            "text": "Example 1: Pet Survey (GR 23)"
        },
        {
            "type": "paragraph",
            "text": "Ms. Huberts afterschool students took a survey of the 600 students at Morales Elementary School. Students were asked to s

In [None]:
# @title  { vertical-output: true }
#@markdown TOKENIZING
#Creating an empty list for tokenized data
tokenized_data = []
#Iterating each classified chunk
for item in classified_chunks:
  #Extracting the chunk data
  text_type = item["type"]
  text = item["text"]
  #Tokenizing the text
  tokenized_text = tokenizer(
      text,
      truncation = True,
      padding = "max_length",
      max_length = max_length,
      return_tensors = "pt",
  )

  #Appending the tokenized data with the text type
  tokenized_data.append({
      "type": text_type,
      "input_ids": tokenized_text["input_ids"].tolist()[0],
      "attention_mask": tokenized_text["attention_mask"].tolist()[0],
      "tokens": tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"].tolist()[0])
  })

for item in json_data_tables:
  #For each table
  for idx in range(len(json_data_tables)):
    table = json_data_tables[idx][f"Table{idx + 1}"]
    #For each row in table
    for row in table:
      #Tokenizing the text
      tokenized_text = tokenizer(
      row,
      truncation = True,
      padding = "max_length",
      max_length = max_length,
      return_tensors = "pt",
    )

    tokenized_data.append({
      "type": "table",
      "input_ids": tokenized_text["input_ids"].tolist()[0],
      "attention_mask": tokenized_text["attention_mask"].tolist()[0],
      "tokens": tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"].tolist()[0])
    })

for item in images_data:
  filename = item["filename"]
  byte_array = item["bytearray"]
  extracted_text = item["extractedtext"]
  #Tokenizing the text
  tokenized_text = tokenizer(
      byte_array,
      truncation = True,
      return_tensors = "pt",
  )

  #Appending the tokenized data with the text type
  tokenized_data.append({
      "type": "image",
      "input_ids": tokenized_text["input_ids"].tolist()[0],
      "attention_mask": tokenized_text["attention_mask"].tolist()[0],
      "tokens": tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"].tolist()[0])
  })

In [None]:
# @title *Sample Token Output*
# Print tokenized and chunked data
for i, item in enumerate(tokenized_data):
    print(f"Chunk {i + 1} - Type: {item['type']}")
    print(f"Input IDs: {item['input_ids']}")
    print(f"Attention Mask: {item['attention_mask']}\n")
    print(f"Tokens: {item['tokens']}\n")
#Downloading the JSON file
with open("tokenized_text.json", "w") as tokenized_file:
    json.dump(tokenized_data, tokenized_file, indent=4)

Chunk 1 - Type: title
Input IDs: [2, 8633, 6968, 29561, 6864, 2654, 2562, 2090, 26147, 11190, 18613, 8485, 1022, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
# @title OUTPUT OF TOKENIZED CHUNKS
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #

# Define maximum tokens per chunk
max_tokens_per_chunk = 256  # Adjust based on your needs

# Function to chunk text by tokens
def chunk_paragraph_by_tokens(text, tokenizer, max_tokens):
    words = text.split()
    current_chunk = []
    chunks = []

    for word in words:
        current_chunk.append(word)
        # Check token length of the current chunk
        if len(tokenizer.tokenize(' '.join(current_chunk))) > max_tokens:
            # Remove last word and finalize chunk
            current_chunk.pop()
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]  # Start new chunk with the current word

    # Append last chunk if it has words
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Chunked data output
chunked_paragraphs = []

# Process each item in JSON data
for item in classified_chunks:
    text_type = item["type"]
    text = item["text"]

    if text_type == "paragraph":
        # Chunk paragraph by tokens
        text_chunks = chunk_paragraph_by_tokens(text, tokenizer, max_tokens_per_chunk)

        # Append each chunk as a separate entry
        for chunk in text_chunks:
            chunked_paragraphs.append({
                "type": text_type,
                "text": chunk
            })
    else:
        # For titles and links, add them directly
        chunked_paragraphs.append(item)

# Print the chunked paragraphs
for item in chunked_paragraphs:
    print(f"Type: {item['type']}")
    print(f"Text: {item['text']}\n")
    print("="*40 + "\n")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Type: title
Text: Tutoring to Enhance Science Skills


Type: title
Text: Sample Data for Data Tables


Type: title
Text: 


Type: title
Text: NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING


Type: link
Text: www.sedl.org/afterschool/toolkits


Type: paragraph
Text: Use these data to create data tables following the Guidelines for Making a Data Table and Checklist for a Data Table.


Type: title
Text: Example 1: Pet Survey (GR 23)


Type: paragraph
Text: Ms. Huberts afterschool students took a survey of the 600 students at Morales Elementary School. Students were asked to select their favorite pet from a list of eight animals. Here are the results.


Type: title
Text: Example 2: ElectromagnetsIncreasing Coils (GR 35)


Type: paragraph
Text: The following data were collected using an electromagnet with a 1.5 volt battery, a switch, a piece of #20 insulated wire, and a nail. Three trials were run. repeating this experiment include using safety goggles or safety spectacles and avoid

In [None]:
# @title OUTPUT OF TOKENIZED CHUNKS

def chunk_paragraph_by_tokens(text, tokenizer, max_tokens):
    words = text.split()
    current_chunk = []
    chunks = []

    for word in words:
        current_chunk.append(word)
        # Check token length of the current chunk
        if len(tokenizer.tokenize(' '.join(current_chunk))) > max_tokens:
            # Remove last word and finalize chunk
            current_chunk.pop()
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]  # Start new chunk with the current word

    # Append last chunk if it has words
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Function to chunk tables by rows (simplified example)
def chunk_table_by_rows(table, max_tokens):
    rows = []
    current_chunk = []
    token_count = 0

    for key, value in table.items():
        row = {key: value}
        row_tokens = len(tokenizer.tokenize(json.dumps(row)))
        if token_count + row_tokens > max_tokens:
            rows.append(current_chunk)
            current_chunk = [row]
            token_count = row_tokens
        else:
            current_chunk.append(row)
            token_count += row_tokens

    if current_chunk:
        rows.append(current_chunk)

    return rows

# Function to handle images (placeholder)
def process_image(image_data):
    return {"type": "image", "data": image_data}

# Chunked data output
chunked_data = []

# Process each item in JSON data
for item in classified_chunks:
    text_type = item["type"]

    if text_type == "paragraph":
        # Chunk paragraph by tokens
        text_chunks = chunk_paragraph_by_tokens(item["text"], tokenizer, max_tokens_per_chunk)

        # Append each chunk as a separate entry
        for chunk in text_chunks:
            chunked_data.append({
                "type": text_type,
                "text": chunk
            })

    elif text_type.startswith("Table"):
        # Chunk table by rows
        table_chunks = chunk_table_by_rows(item[text_type], max_tokens_per_chunk)

        # Append each chunk as a separate table entry
        for i, chunk in enumerate(table_chunks):
            chunked_data.append({
                "type": f"{text_type}_chunk_{i + 1}",
                "table": chunk
            })

    elif text_type == "image":
        # Process image (if any specific processing is required)
        image_output = process_image(item["data"])
        chunked_data.append(image_output)

    else:
        # For other types like titles and links, add them directly
        chunked_data.append(item)

# Print the chunked data
for item in chunked_data:
    print(json.dumps(item, indent=2, ensure_ascii=False))
    print("=" * 40 + "\n")


{
  "type": "title",
  "text": "Tutoring to Enhance Science Skills"
}

{
  "type": "title",
  "text": "Sample Data for Data Tables"
}

{
  "type": "title",
  "text": ""
}

{
  "type": "title",
  "text": "NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING"
}

{
  "type": "link",
  "text": "www.sedl.org/afterschool/toolkits"
}

{
  "type": "paragraph",
  "text": "Use these data to create data tables following the Guidelines for Making a Data Table and Checklist for a Data Table."
}

{
  "type": "title",
  "text": "Example 1: Pet Survey (GR 23)"
}

{
  "type": "paragraph",
  "text": "Ms. Huberts afterschool students took a survey of the 600 students at Morales Elementary School. Students were asked to select their favorite pet from a list of eight animals. Here are the results."
}

{
  "type": "title",
  "text": "Example 2: ElectromagnetsIncreasing Coils (GR 35)"
}

{
  "type": "paragraph",
  "text": "The following data were collected using an electromagnet with a 1.5 volt battery, a s

In [None]:
# @title EMBEDDING
# Generate embeddings for each chunk
chunk_embeddings = embedder.encode(chunked_paragraphs)
# Combine embeddings (e.g., by averaging)
combined_embedding = np.mean(chunk_embeddings, axis=0)
# Print the combined embedding
print(combined_embedding)

[-1.07453719e-01 -2.48700693e-01  1.87515044e+00  5.95990539e-01
  3.16657454e-01  2.59363890e-01 -3.81187260e-01  7.04434335e-01
  3.33562717e-02  1.40162721e-01 -7.06893682e-01  4.42370288e-02
  2.78677583e-01  7.30859697e-01  1.02466393e+00 -2.43157551e-01
 -8.84924769e-01  1.43513441e-01  2.93354630e-01 -1.13113332e+00
  2.29927953e-02  6.19321644e-01  8.00761059e-02 -8.86165917e-01
  1.29831284e-01 -8.88936996e-01  2.26764649e-01 -2.18836927e+00
 -7.58061349e-01 -1.13709413e-01 -1.72173098e-01 -6.25355721e-01
  6.75956368e-01  3.58773023e-01  3.32598925e-01 -7.07706437e-02
 -4.01967525e-01  7.61587694e-02  8.57401490e-02 -3.45007956e-01
  1.43000925e+00  1.66671678e-01  9.84617174e-01  1.45605698e-01
 -3.67988735e-01 -2.04998106e-02 -2.25178286e-01 -1.60633162e-01
 -2.65985906e-01 -1.11736238e+00 -1.16511619e+00 -5.04458904e-01
  7.62587726e-01  5.95828295e-01 -2.82601446e-01  2.78816789e-01
  5.33357143e-01 -2.97096103e-01  3.21188718e-01  5.92841685e-01
  7.96351731e-01 -7.53566