PROCESS TO MAKE OCR OF DOCLING RUN ON GPU -- Only use GPU when image based pdfs are there where OCR will be used. For text based PDF no OCR is used hence extraction through CPU.

In [22]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://download.pytorch.org/whl/cu121


In [23]:
import torch

print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU device:", torch.cuda.get_device_name(0))

CUDA available: True
GPU device: Tesla T4


In [24]:
!pip install docling



In [25]:
!pip install "docling-ocr-onnxtr[gpu]"



In [26]:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, InputFormat, PdfFormatOption
from docling_ocr_onnxtr import OnnxtrOcrOptions
import torch

# Step 4a: Verify GPU for PyTorch models
print("PyTorch CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("PyTorch GPU device:", torch.cuda.get_device_name(0))

# Step 4b: Configure OCR plugin
ocr_options = OnnxtrOcrOptions(
    det_arch="db_mobilenet_v3_large",           # detection model
    reco_arch="Felix92/onnxtr-parseq-multilingual-v1",  # recognition model
    auto_correct_orientation=False
)

pipeline_options = PdfPipelineOptions(
    ocr_options=ocr_options,
)
pipeline_options.allow_external_plugins = True  # enable external plugin usage

# Step 4c: Create converter with PDF input
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    },
)

# Step 4d: Run conversion
conversion_result = converter.convert(source="Finance.pdf")
print("Conversion finished! Check nvidia-smi to see GPU usage during OCR/model execution.")

PyTorch CUDA available: True
PyTorch GPU device: Tesla T4
Conversion finished! Check nvidia-smi to see GPU usage during OCR/model execution.


In [28]:
# Step 5: Extract plain text from conversion_result
all_text = "\n".join([
    "".join(cell.text for cell in cluster.cells)
    for page in conversion_result.pages
    for cluster in page.predictions.layout.clusters
])

print("\n===== Extracted Text =====\n")
print(all_text)



===== Extracted Text =====

Infosys Limited (CIN: L85110KA1981PLC013115, PAN: AAACI4798L, GSTIN: 29AAACI4798L1ZU) issued Tax Invoice No. INF/INV/2025/204 on 15/02/2025 the consulting charges. Payments are to be made to State Bank of India, 
with a payment due date of 28/02/2025. The invoice included IGST @ 18% on Account Number 112233445566, IFSC Code SBIN0000456. If payment is delayed beyond the due date, a penalty interest of 10% per annum will apply. Kotak Mahindra Bank issued its 8.20% Fixed Rate Senior Secured Bonds (ISIN: INE237A08765) on 12 January 2023. Each bond has a face value of ₹5,00,000 and carries a fixed coupon of 8.20% per annum, payable semi-annually. The maturity date is set for 12 January 2028. On 20 March 2022, Standard Chartered India executed its first foreign trade finance deal linked to the SOFR benchmark, marking a shift from LIBOR-based transactions. Punjab National Bank, under the SARFAESI Act, issued a notice on 10-09-2023 against borrower loan accounts LA

In [29]:
print(torch.cuda.is_available())        # True
print(torch.cuda.current_device())      # Should return 0

True
0


LANGEXTRACT

In [30]:
!pip install langextract python-dotenv



[94m[1mLangExtract[0m: Saving to [92mfinancial_data.jsonl[0m: 0 docs [15:33, ? docs/s]


In [31]:
import langextract as lx
import textwrap
import os
from dotenv import load_dotenv
from docling.document_converter import DocumentConverter

In [32]:
# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("LANGEXTRACT_API_KEY")

In [33]:
# Define the extraction task
prompt = textwrap.dedent("""\
Extract financial details in a structured way using the categories below:

1. Parties & Identification
- Capture names of companies, institutions, regulators, or counterparties.
- Record identifiers such as CIN, PAN, GST/VAT, Tax IDs, or any registration codes.
- Include account references like bank accounts, loan numbers, or investment account IDs.

2. Monetary Information
- Principal sums: invoice totals, loan values, transaction amounts.
- Charges & fees: late charges, service/processing fees, management costs.
- Interest terms: fixed/floating interest rates, APR values, or benchmark-linked references (e.g., SOFR, LIBOR).
- Taxes: GST, VAT, withholding, or similar levies.
- Penalties/Fines: early exit fee, defaults, or other financial penalties.

3. Dates & Timeframes
- Effective/Start dates: when agreements or transactions take effect.
- Maturity/Closing dates: final settlement or loan closure.
- Due dates: payment deadlines or installment schedules.
- Duration/tenure: repayment term, lock-in period, or ramp-up.
- Historical dates: transaction execution, invoice issue, settlement.

Instructions:
- Always extract **exact text spans** from the input (no rephrasing).
- Each extracted item must have contextual attributes (e.g., type of ID, kind of date, nature of monetary value).
""")


In [34]:
# Examples to guide extraction
examples = [
    lx.data.ExampleData(
        text="Tata Consultancy Services Limited (CIN: L99999MH1995PLC084781, PAN: AAACM8654F, GSTIN: 29AAACT1924F1Z9) issued Invoice No. TCS/INV/2024/987 on 15/09/2024 with a payment due date of 30/09/2024. The invoice carried IGST @ 18% on professional service fees. Payments must be credited to HDFC Bank, Account Number 123456789012, IFSC Code HDFC0001234. Any overdue payment will attract an additional 10% per annum as penalty interest.",
        extractions=[
            lx.data.Extraction(extraction_class="party", extraction_text="Tata Consultancy Services Limited", attributes={"type":"company"}),
            lx.data.Extraction(extraction_class="identifier", extraction_text="CIN: L99999MH1995PLC084781", attributes={"id_type":"CIN"}),
            lx.data.Extraction(extraction_class="identifier", extraction_text="PAN: AAACM8654F", attributes={"id_type":"PAN"}),
            lx.data.Extraction(extraction_class="identifier", extraction_text="GSTIN: 29AAACT1924F1Z9", attributes={"id_type":"GSTIN"}),
            lx.data.Extraction(extraction_class="party", extraction_text="HDFC Bank", attributes={"type":"bank"}),
            lx.data.Extraction(extraction_class="monetary", extraction_text="professional service fees", attributes={"value_type":"service_fee"}),
            lx.data.Extraction(extraction_class="interest_rate", extraction_text="10% per annum", attributes={"type":"penalty_interest"}),
            lx.data.Extraction(extraction_class="tax", extraction_text="IGST @ 18%", attributes={"tax_type":"GST"}),
            lx.data.Extraction(extraction_class="account", extraction_text="123456789012", attributes={"account_type":"bank"}),
            lx.data.Extraction(extraction_class="date", extraction_text="15/09/2024", attributes={"date_type":"invoice_date"}),
            lx.data.Extraction(extraction_class="date", extraction_text="30/09/2024", attributes={"date_type":"due_date"}),
        ]
    ),
]


In [43]:
### Manually giving in all_text variable contents
## Input
# all_text = """
# Infosys Limited (CIN: L85110KA1981PLC013115, PAN: AAACI4798L, GSTIN: 29AAACI4798L1ZU) issued Tax Invoice No. INF/INV/2025/204 on 15/02/2025 with a payment due date of 28/02/2025. The invoice included IGST @ 18% on the consulting charges. Payments are to be made to State Bank of India, Account Number 112233445566, IFSC Code SBIN0000456. If payment is delayed beyond the due date, a penalty interest of 10% per annum will apply.

# Kotak Mahindra Bank issued its 8.20% Fixed Rate Senior Secured Bonds (ISIN: INE237A08765) on 12 January 2023. Each bond has a face value of ₹5,00,000 and carries a fixed coupon of 8.20% per annum, payable semi-annually. The maturity date is set for 12 January 2028.

# On 20 March 2022, Standard Chartered India executed its first foreign trade finance deal linked to the SOFR benchmark, marking a shift from LIBOR-based transactions.

# Punjab National Bank, under the SARFAESI Act, issued a notice on 10-09-2023 against borrower loan accounts LAC00987651234 and LAC00987659876. The secured property was taken into possession on 05-01-2024. Auction was scheduled with earnest money deposit due by 25-03-2024, and bids opening on 26-03-2024. As per Section 194-IA of the Income Tax Act, the buyer is liable to deduct TDS.

# Bajaj Finance Limited launched Secured Redeemable Non-Convertible Debentures (NCDs) with a face value of ₹10,000 each. The issue had series with tenors of 24 months, 48 months, and 84 months. Coupon rates ranged between 8.50% and 9.40% per annum depending on the chosen series.
# """



In [36]:
### Normal OCR based docling code snippet
## Input
# source = "Finance.pdf"  # document per local path or URL
# converter = DocumentConverter()
# result = converter.convert(source)
# input_text = result.document.export_to_markdown()  # output: "## Docling Technical Report[...]"

In [38]:
all_text

'Infosys Limited (CIN: L85110KA1981PLC013115, PAN: AAACI4798L, GSTIN: 29AAACI4798L1ZU) issued Tax Invoice No. INF/INV/2025/204 on 15/02/2025 the consulting charges. Payments are to be made to State Bank of India, \nwith a payment due date of 28/02/2025. The invoice included IGST @ 18% on Account Number 112233445566, IFSC Code SBIN0000456. If payment is delayed beyond the due date, a penalty interest of 10% per annum will apply. Kotak Mahindra Bank issued its 8.20% Fixed Rate Senior Secured Bonds (ISIN: INE237A08765) on 12 January 2023. Each bond has a face value of ₹5,00,000 and carries a fixed coupon of 8.20% per annum, payable semi-annually. The maturity date is set for 12 January 2028. On 20 March 2022, Standard Chartered India executed its first foreign trade finance deal linked to the SOFR benchmark, marking a shift from LIBOR-based transactions. Punjab National Bank, under the SARFAESI Act, issued a notice on 10-09-2023 against borrower loan accounts LAC00987651234 and LAC0098765

In [39]:
# Run extraction
result = lx.extract(
    text_or_documents=all_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,
    max_workers=20,
    max_char_buffer=1000,
    api_key=api_key
)




In [40]:
# Save results to JSONL
lx.io.save_annotated_documents([result], output_name="financial_data.jsonl", output_dir=".")

[94m[1mLangExtract[0m: Saving to [92mfinancial_data.jsonl[0m: 1 docs [00:00, 404.74 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92mfinancial_data.jsonl[0m





In [41]:
# Generate interactive visualization
html_content = lx.visualize("financial_data.jsonl")

[94m[1mLangExtract[0m: Loading [92mfinancial_data.jsonl[0m: 100%|██████████| 12.7k/12.7k [00:00<00:00, 14.0MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92mfinancial_data.jsonl[0m





In [42]:
# Save visualization to HTML file
with open("financial_data_visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)
    else:
        f.write(html_content)