# 📄 Text Extraction from PDFs

This notebook automates the extraction of textual content from PDF files.  
The main steps include:

- Filtering and organizing input files.
- Processing each document to extract its textual data.
- Applying basic preprocessing and cleaning techniques.
- Saving the extracted and cleaned text for future analysis.

The goal is to create a structured and organized dataset of texts extracted from technical or academic PDF documents.

> 🛠️ This workflow is particularly useful for preparing data for Natural Language Processing (NLP) tasks, document classification, and information retrieval projects.


In [None]:
# install marker-pdf
%pip install marker-pdf

In [2]:
import os
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

input_folder = "pdfs_developers"
output_folder = "Processed_Files_Developers"

# Create the input folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Initialize the PdfConverter with the model dictionary
converter = PdfConverter(
    artifact_dict=create_model_dict(),
)

file_names = os.listdir(input_folder)
file_names

  from .autonotebook import tqdm as notebook_tqdm


Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


['Copia di GSD-Spec_2122_V51_Jul08.pdf',
 'Copia di iec61158-1{ed1.0}b.pdf',
 'Copia di iec61158-2{ed6.0}b.pdf',
 'Copia di iec61158-4-3{ed3.0}b.pdf',
 'Copia di iec61158-5-3{ed3.0}b.pdf',
 'Copia di iec61158-6-3{ed3.0}b.pdf',
 'Copia di iec61784-1{ed4.0}b.pdf',
 'Copia di iec61784-3-3{ed2.0}en.pdf',
 'Copia di iec61784-5-3{ed3.0}b.pdf',
 'Copia di PROFIdrive_3172_V42_Oct15.pdf',
 'Copia di Profile-PA-Devices_3042_V402MU1_Mar22.pdf',
 'Copia di PROFIsafe-Profile_3192b_V24_Mar07.pdf',
 'iec61158-3-3{ed2.0}b.pdf',
 'Perguntas - Desenvolvedores.docx']

In [3]:
# keep only pdf files
file_names = [file for file in file_names if file.endswith(".pdf")]
file_names

['Copia di GSD-Spec_2122_V51_Jul08.pdf',
 'Copia di iec61158-1{ed1.0}b.pdf',
 'Copia di iec61158-2{ed6.0}b.pdf',
 'Copia di iec61158-4-3{ed3.0}b.pdf',
 'Copia di iec61158-5-3{ed3.0}b.pdf',
 'Copia di iec61158-6-3{ed3.0}b.pdf',
 'Copia di iec61784-1{ed4.0}b.pdf',
 'Copia di iec61784-3-3{ed2.0}en.pdf',
 'Copia di iec61784-5-3{ed3.0}b.pdf',
 'Copia di PROFIdrive_3172_V42_Oct15.pdf',
 'Copia di Profile-PA-Devices_3042_V402MU1_Mar22.pdf',
 'Copia di PROFIsafe-Profile_3192b_V24_Mar07.pdf',
 'iec61158-3-3{ed2.0}b.pdf']

In [4]:
# Process each PDF file
for filename in file_names:
    if filename.endswith(".pdf"):
        input_path = os.path.join(input_folder, filename)
        output_name = os.path.splitext(filename)[0]
        
        # Create a folder for the output files
        file_output_folder = os.path.join(output_folder, output_name)
        os.makedirs(file_output_folder, exist_ok=True)
        
        # Convert PDF
        rendered = converter(input_path)
        text, _, images = text_from_rendered(rendered)
        
        # Save the text to a markdown file
        with open(os.path.join(file_output_folder, f"{output_name}.md"), "w") as md_file:
            md_file.write(text)
        
        # Save the images to the output folder
        for name, image in images.items():
            image_path = os.path.join(file_output_folder, name)
            image.save(image_path)

print("Done!")

Recognizing layout: 100%|██████████| 16/16 [00:08<00:00,  1.81it/s]
Running OCR Error Detection: 100%|██████████| 23/23 [00:00<00:00, 125.39it/s]
Detecting bboxes: 0it [00:00, ?it/s]
[WinError 2] O sistema não pode encontrar o arquivo especificado
  File "c:\Users\FUNPEC\anaconda3\envs\tmp\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\FUNPEC\anaconda3\envs\tmp\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\FUNPEC\anaconda3\envs\tmp\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\FUNPEC\anaconda3\envs\tmp\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Detecting bboxes

Done!


In [5]:
import re

def extract_introduction_to_end(markdown_file, output_file="introduction_to_end.md"):
    with open(markdown_file, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # List of possible titles indicating the start of the desired section
    possible_titles = [
        r"Introduction", 
        r"General", 
        r"Mapping Model", 
        r"Guidelines for implementers and users", 
        r"INDUSTRIAL COMMUNICATION NETWORKS – FIELDBUS SPECIFICATIONS –",
        r"INDUSTRIAL COMMUNICATION NETWORKS – PROFILES –",
        r"IO-Link master integration strategy \(system description\)",
        r"Concepts"
    ]

    # Create regex patterns to match different formats
    title_patterns = [
        rf"^#+\s*<.*?>?\s*\*\*\d*\.?\s*({title})\*\*" for title in possible_titles  # With ID and bold
    ] + [
        rf"^#+\s*\*\*\d*\.?\s*({title})\*\*" for title in possible_titles  # Bold only
    ] + [
        rf"^#+\s*({title})" for title in possible_titles  # Plain text only
    ]

    compiled_patterns = [re.compile(pattern, re.IGNORECASE) for pattern in title_patterns]

    start_index = None
    for i, line in enumerate(lines):
        if any(pattern.search(line) for pattern in compiled_patterns):
            start_index = i
            break

    if start_index is None:
        print("Introduction section (or equivalent) not found.")
        return

    # Extract everything from the found section to the end of the document
    intro_to_end = lines[start_index:]

    # Save the extracted content into a new file
    with open(output_file, "w", encoding="utf-8") as out_file:
        out_file.writelines(intro_to_end)

    print(f"Extracted content saved to '{output_file}'.")


In [6]:
os.makedirs("Processed_Files_Introduction_to_End", exist_ok=True)

# Execute the extraction for each markdown file in the output folder
for folders in os.listdir(f"{output_folder}"):
    for file in os.listdir(f"{output_folder}/{folders}"):
        if file.endswith(".md"):
            extract_introduction_to_end(f"{output_folder}/{folders}/{file}", f"Processed_Files_Introduction_to_End/{file}")

Introduction section (or equivalent) not found.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61158-1{ed1.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61158-2{ed6.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61158-4-3{ed3.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61158-5-3{ed3.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61158-6-3{ed3.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61784-1{ed4.0}b.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61784-3-3{ed2.0}en.md'.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia di iec61784-5-3{ed3.0}b.md'.
Introduction section (or equivalent) not found.
Introduction section (or equivalent) not found.
Extracted content saved to 'Processed_Files_Introduction_to_End/Copia