<a href="https://colab.research.google.com/github/UmangJoshi4/marker/blob/master/Marker_In_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone "https://github.com/VikParuchuri/marker.git"

fatal: destination path 'marker' already exists and is not an empty directory.


In [2]:
!cd marker && pip install -e .

Obtaining file:///content/marker
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: marker-pdf
  Building editable for marker-pdf (pyproject.toml) ... [?25l[?25hdone
  Created wheel for marker-pdf: filename=marker_pdf-1.4.0-py3-none-any.whl size=22808 sha256=434449917c902de65f9be7dc9dd7f76d76979e745ca65f7e19ce4c907c3d91f2
  Stored in directory: /tmp/pip-ephem-wheel-cache-n3mqsl_m/wheels/26/fd/a2/30fec3befe0eb33ac82866839cb61d98e998e76d7638df075c
Successfully built marker-pdf
Installing collected packages: marker-pdf
  Attempting uninstall: marker-pdf
    Found existing installation: marker-pdf 1.4.0
    Uninstalling marker-pdf-1.4.0:
      Successfully uninstalled marker-pdf-1.4.0
Successfully installed marker-pdf-1.4.0


In [1]:
!ls -la

total 11584
drwxr-xr-x  1 root root     4096 Feb 12 10:53  .
drwxr-xr-x  1 root root     4096 Feb 12 10:22  ..
-rw-r--r--  1 root root 11828746 Feb 12 10:32 'annual_report_2020 from bse.pdf'
drwxr-xr-x  4 root root     4096 Feb 10 14:24  .config
drwxr-xr-x  3 root root     4096 Feb 12 10:53  data
drwxr-xr-x  2 root root     4096 Feb 12 10:33  .ipynb_checkpoints
drwxr-xr-x 10 root root     4096 Feb 12 10:30  marker
drwxr-xr-x  3 root root     4096 Feb 12 10:35  outputs
drwxr-xr-x  1 root root     4096 Feb 10 14:25  sample_data


In [None]:
!marker_single "annual_report_2020 from bse.pdf"

2025-02-12 11:03:15.938592: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739358195.959204   10560 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739358195.965512   10560 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-12 11:03:15.987174: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded layout model datalab-to/surya_layout on device cuda with dtype torch.float16
Loaded texify model datalab-to/texify on 

In [None]:
from typing import Dict, Any, List
from pathlib import Path
import torch
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
import gc

class PDFParser:
    """Parse PDFs to markdown using Marker with GPU optimization"""

    def __init__(self, input_dir: str = "data/raw", output_dir: str = "data/preprocessed", batch_size: int = 4):
        self.input_dir = Path(input_dir)
        self.output_dir = Path(output_dir)
        self.batch_size = batch_size

        # Initialize device and model with optimizations
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"Using device: {self.device}")

        if torch.cuda.is_available():
            # Enable TF32 for better performance on Ampere GPUs
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True
            torch.cuda.empty_cache()
            torch.backends.cudnn.benchmark = True

        # Create model dictionary - note that create_model_dict() doesn't take precision param
        self.model_dict = create_model_dict(device=self.device)

        # Initialize converter with optimizations
        self.converter = PdfConverter(
            artifact_dict=self.model_dict,
            config={
                'batch_size': self.batch_size,
                'force_ocr': False,  # Set to True if needed
                'debug': False
            }
        )

    @torch.amp.autocast(device_type='cuda')  # Updated autocast syntax
    def process_batch(self, file_path: str) -> Dict[str, Any]:
        """Process PDF with optimized batch processing"""
        try:
            with torch.no_grad():  # Disable gradient computation
                rendered = self.converter(file_path)
                markdown, metadata, _ = text_from_rendered(rendered)

                # Clear GPU cache after batch
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

                return markdown, metadata

        except Exception as e:
            print(f"Error in batch processing: {str(e)}")
            return None, None

    def parse_pdf(self, company_code: str, file_path: str) -> Dict[str, Any]:
        """Parse a single PDF file to markdown with optimizations"""
        try:
            print(f"\nParsing PDF: {file_path}")

            # Create output directory
            output_dir = self.output_dir / company_code / "annual_reports"
            output_dir.mkdir(parents=True, exist_ok=True)

            # Process PDF
            markdown, metadata = self.process_batch(file_path)

            if markdown:
                # Save markdown
                output_file = output_dir / f"{Path(file_path).stem}.md"
                with open(output_file, "w", encoding="utf-8") as f:
                    f.write(markdown)

                print(f"Successfully parsed and saved to: {output_file}")

                # Clear memory
                gc.collect()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

                return {
                    "output_file": str(output_file),
                    "metadata": metadata,
                    "status": "success"
                }

            return None

        except Exception as e:
            print(f"Error parsing PDF {file_path}: {str(e)}")
            return None

if __name__ == "__main__":
    # Test the parser
    parser = PDFParser(batch_size=1)

    # Example: Parse annual report
    result = parser.parse_pdf(
        company_code="500470",
        file_path="annual_report_2020 from bse.pdf"
    )

    if result:
        print(f"\nParsing completed:")
        print(f"Output file: {result['output_file']}")
        print(f"Status: {result['status']}")

Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loaded layout model datalab-to/surya_layout on device cuda with dtype torch.float16
Loaded texify model datalab-to/texify on device cuda with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device cuda with dtype torch.float16
Loaded table recognition model datalab-to/surya_tablerec on device cuda with dtype torch.float16
Loaded detection model vikp/surya_det3 on device cuda with dtype torch.float16

Parsing PDF: annual_report_2020 from bse.pdf
