This Python script extracts structured data from Bill of Lading (BOL) PDF documents and converts it to JSON format.
-
Extracts key information from Bill of Lading PDFs including:
- BOL number
- Shipper information
- Consignee information
- Notify party information
- Vessel and voyage details
- Container information
- Dates (issue date, shipped date)
- Port information
- Cargo details
-
Handles layout variations in PDFs
-
Provides both structured data and raw text for each section
-
Outputs data in JSON format
-
Optional OCR support for PDFs with poor text extraction
- Python 3.6+
- PyMuPDF (fitz)
- For OCR support:
- Tesseract OCR (must be installed separately)
- pytesseract
- Pillow
- Clone this repository or download the script files
- Install the required dependencies:
pip install -r requirements.txt- For OCR support, install Tesseract OCR:
- On macOS:
brew install tesseract - On Ubuntu/Debian:
sudo apt-get install tesseract-ocr - On Windows: Download and install from Tesseract GitHub
- On macOS:
python extract_bol.py path/to/your/bol.pdfThis will extract data from the PDF and save it to a JSON file with the same name as the PDF.
If the PDF has poor text extraction, you can use the OCR version:
python extract_bol_with_ocr.py path/to/your/bol.pdf --ocrpython extract_bol.py path/to/your/bol.pdf --output output.jsonpython extract_bol.py ./public/065-2024\ MBL\ MEDUP1966175.pdfWith OCR:
python extract_bol_with_ocr.py 065-2024\ MBL\ MEDUP1966175.pdf --ocr --lang engThe script generates a JSON file with the following structure:
{
"document_type": "Bill of Lading",
"filename": "example.pdf",
"bol_number": "MEDUP1966175",
"shipper": {
"company_name": "COMPANY NAME",
"address": "Full address",
"raw_text": "Raw text from the shipper section"
},
"consignee": {
"company_name": "CONSIGNEE COMPANY",
"address": "Full address",
"raw_text": "Raw text from the consignee section"
},
"notify_party": {
"company_name": "NOTIFY PARTY",
"address": "Full address",
"raw_text": "Raw text from the notify party section"
},
"vessel": {
"name": "VESSEL NAME",
"voyage": "VOYAGE NUMBER"
},
"containers": [
{
"container_number": "ABCD1234567",
"seal_number": "SEAL123",
"package_count": "44",
"weight": "25000.00",
"context": "Surrounding text for context"
}
],
"issue_date": "28-Nov-2024",
"shipped_date": "24-Nov-2024",
"port_of_loading": "PARANAGUA, PR, BRAZIL",
"port_of_discharge": "JEBEL ALI, DUBAI",
"place_of_receipt": "Place information",
"place_of_delivery": "Delivery information",
"cargo": {
"package_count": "88",
"gross_weight_kg": "50000.00",
"description": "Description of cargo"
}
}The script uses multiple strategies to extract data:
- Regular expression pattern matching on the full text
- Region-based extraction for specific areas of the document
- Context-based extraction for related information
If you encounter PDFs with different layouts, you may need to adjust the extraction patterns or regions in the script.
- Use
extract_bol.pyfor PDFs with good text extraction (faster) - Use
extract_bol_with_ocr.pyfor:- Scanned PDFs
- PDFs with poor text extraction
- PDFs with unusual layouts or formatting
- The script is designed for a specific Bill of Lading format and may require adjustments for different formats
- OCR accuracy depends on the quality of the PDF
- Complex tables or unusual layouts may not be fully captured
MIT