# **E.T.L.**

This notebook demonstrates how to use two ETL frameworks, **PyMuPDF** and **Docling**, to efficiently convert PDF documents into Markdown files. 

It supports processing multiple PDFs, saving the output as structured Markdown files for use in downstream tasks such as document indexing, text analysis and feeding RAG applications.
 
---

In [1]:
!pip install pymupdf pymupdf4llm docling -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
print("Please, enter the following information...")

project = input("\t- Project name:")
topic = input("\t - Topic name:")

Please, enter the following information...


	- Project name: papers
	 - Topic name: LLMs


## [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/)

In [3]:
import pymupdf
import pymupdf4llm
from pymupdf4llm import to_markdown

In [4]:
# Function to convert a PDF to Markdown

def pymupdf_pdf_to_md(input_pdf_path, output_md_path):
    try:
        doc = pymupdf.open(input_pdf_path)  # Open the PDF
        page_list = list(range(0, len(doc)))  # Define desired pages
        md_text = to_markdown(doc, pages=page_list)  # Convert to Markdown

        # Write the Markdown string to a file
        with open(output_md_path, "w") as output:
            output.write(md_text)

    except Exception as e:
        print(f"Error processing {input_pdf_path}: {e}")

In [5]:
# Input and output directories
input_dir = f"projects/{project}/pdfs/{topic}"
output_dir = f"projects/{project}/mds_pymupdf/{topic}"

# Create the output directory if it doesn't exist
import os
os.makedirs(output_dir, exist_ok=True)

# Process each PDF file in the input directory
for file_name in os.listdir(input_dir):
    if file_name.endswith(".pdf"):  # Ensure it's a PDF file
        input_pdf_path = os.path.join(input_dir, file_name)
        output_md_path = os.path.join(output_dir, file_name.rsplit(".", 1)[0] + ".md")

        print(f"\033[1mProcessing {file_name}...\033[0m")
        pymupdf_pdf_to_md(input_pdf_path, output_md_path)

print("----- \033[1mConversion completed!\033[0m -----")

[1mProcessing Scaling Laws for Neural Language Models.pdf...[0m
Processing projects/papers/pdfs/LLMs/Scaling Laws for Neural Language Models.pdf...
[1mProcessing Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.pdf...[0m
Processing projects/papers/pdfs/LLMs/Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.pdf...
[1mProcessing BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.pdf...[0m
Processing projects/papers/pdfs/LLMs/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.pdf...
[1mProcessing ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.pdf...[0m
Processing projects/papers/pdfs/LLMs/ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.pdf...
[1mProcessing Improving Language Understanding by Generative Pre-Training.pdf...[0m
Processing projects/papers/pdfs/LLMs/Improving Language Understanding by Generative Pre-Tr

## [Docling](https://github.com/DS4SD/docling)

In [6]:
from docling.document_converter import DocumentConverter

  backends.update(_get_backends("networkx.backends"))


In [7]:
# Function to convert a PDF to Markdown

def docling_pdf_to_md(input_pdf_path, output_md_path):
    source = input_pdf_path
    converter = DocumentConverter() # Initialize converter
    try:
        # Convert to MD
        result = converter.convert(source) 
        md_text = result.document.export_to_markdown()

        # Write the Markdown string to a file
        with open(output_md_path, "w") as output:
            output.write(md_text)

    except Exception as e:
        print(f"Error processing {input_pdf_path}: {e}")

In [8]:
# Input and output directories
input_dir = f"projects/{project}/pdfs/{topic}"
output_dir = f"projects/{project}/mds_docling/{topic}"

# Create the output directory if it doesn't exist
import os
os.makedirs(output_dir, exist_ok=True)

# Process each PDF file in the input directory
for file_name in os.listdir(input_dir):
    if file_name.endswith(".pdf"):  # Ensure it's a PDF file
        input_pdf_path = os.path.join(input_dir, file_name)
        output_md_path = os.path.join(output_dir, file_name.rsplit(".", 1)[0] + ".md")

        print(f"\033[1mProcessing {file_name}...\033[0m")
        docling_pdf_to_md(input_pdf_path, output_md_path)

print("----- \033[1mConversion completed!\033[0m -----")

[1mProcessing Scaling Laws for Neural Language Models.pdf...[0m


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


[1mProcessing Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.pdf...[0m
[1mProcessing BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.pdf...[0m


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


[1mProcessing ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.pdf...[0m
[1mProcessing Improving Language Understanding by Generative Pre-Training.pdf...[0m


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


[1mProcessing Attention is all you need.pdf...[0m


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


[1mProcessing Language Models are Few-Shot Learners.pdf...[0m


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


[1mProcessing Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.pdf...[0m
----- [1mConversion completed![0m -----


---