This project extracts structured data from PDF documents and writes the results into Excel workbooks.
It supports:
- rule-based extraction for known table and figure patterns
- OCR fallback for scanned/image-based PDFs
- per-PDF extractor selection through
extract_requests.txt - lightweight history-based extractor recommendation
- human-in-the-loop feedback through reviewed Excel files
- extract_production_process_table.py: main extraction script
- extract_requests.txt: per-PDF extractor configuration
- keyword_aliases.json: canonical keywords and their aliases
- extraction_history.json: learned historical pattern memory
- feedback_history.json: learned human feedback corrections
Install the main Python packages:
python -m pip install pdfplumber pypdfium2 openpyxl pytesseractFor scanned PDFs, install the Tesseract OCR engine as well.
Expected Windows install path:
C:\Program Files\Tesseract-OCR\tesseract.exe
The script auto-detects that location.
Current extractor names:
Production ProcessTable 1Table 3 ASTMTable 6CI-4 HardnessISO VG 10Fig 1 TemperaturesProduct TablesGreasesOXXABB Greasing
These names are the same names used in extract_requests.txt and as Excel sheet names.
Run with the default request file:
python extract_production_process_table.pyRun for a specific PDF:
python extract_production_process_table.py myfile.pdfRun for a specific PDF and request file:
python extract_production_process_table.py myfile.pdf extract_requests.txtRun while also passing reviewed Excel workbooks explicitly:
python extract_production_process_table.py myfile.pdf extract_requests.txt reviewed_output.xlsxThe script reads extraction requests from extract_requests.txt.
Requests can use either:
- exact extractor names such as
Greases - canonical business keywords such as
Grease
If a canonical keyword is used, the script resolves it through keyword_aliases.json and maps it to the appropriate extractor.
You can define:
- Global options for a single PDF:
Greases, OXX
- Per-PDF options for batch processing:
testLubricant.pdf: Grease
prodTable.pdf: Product Tables, OXX
ABB.pdf: ABB Greasing
If no explicit request is available for a PDF, the history-based recommender may suggest extractors automatically.
The file keyword_aliases.json stores canonical keywords and all of the aliases that should be searched in the PDF.
Example:
{
"Grease": {
"extractor": "Greases",
"aliases": [
"special ball bearing grease",
"high performance greases",
"greases"
]
}
}How it is used:
- You put
Greaseinextract_requests.txt - The script maps
Greaseto theGreasesextractor - The
Greasesextractor searches the PDF for all configured aliases - If any of those aliases are found, the matching pages are processed
For each PDF, the script creates one Excel workbook:
<pdf_stem>_tables.xlsx
Examples:
testLubricant_tables.xlsxprodTable_tables.xlsxABB_tables.xlsx
If the workbook is open and locked, the script saves instead to:
<pdf_stem>_tables_updated.xlsx
The script prints progress to the console, including:
- selected extractor count
- start/completion message for each extractor
- row count per extractor
- total time taken
Example:
PDF: D:\App\pdfscrapper\testLubricant.pdf
Selected extractors: 1
Starting Greases...
Completed Greases in 0.24s with 13 rows.
Extracted 13 rows across 1 sheets to: D:\App\pdfscrapper\testLubricant_tables.xlsx
Total time taken: 0.27s
The script first tries native PDF text extraction where possible.
If a page is scanned or image-based, it falls back to OCR.
This is used for:
- scanned ABB regreasing cards
- image-heavy figure extraction
- scanned table lookups where text extraction is weak
The script maintains a lightweight memory in extraction_history.json.
How it works:
- Successful extractions store page snippets and extractor names.
- When a new PDF has no explicit request entry, the script scores its pages against historical snippets.
- The best-matching extractors are recommended and can be run automatically.
This is a lightweight ML-style similarity layer, not a trained neural network.
Every output row now includes:
Review FieldHuman Evaluation
After the workbook is generated:
- Open the Excel file.
- Review the extracted values.
- Leave
Review Fieldas-is unless you want to target a different column. - In
Human Evaluation:
- write
Correctif the extracted value is right - write the corrected value if the extracted value is wrong
If a row looks like this:
| Production Process | Rt (mm) | Ra (mm) | Review Field | Human Evaluation |
|---|---|---|---|---|
| grinding | 2.00-6.0 | 0.400-0.8 | Ra (mm) | 0.750-3.5 |
then on the next run the script can learn that correction and apply it automatically.
The script reads reviewed workbooks from:
- explicitly passed
.xlsxarguments - any
*_tables.xlsxfiles in the current folder - any
*_tables_updated.xlsxfiles in the current folder
It stores:
- confirmations in
feedback_history.json - corrections in
feedback_history.json - safe pattern-aware rules in
feedback_history.json
Then, during future extraction runs, matching values are auto-corrected before writing the next workbook.
The feedback loop now works at two levels:
- Value-level feedback
- exact corrected values are reused when the same extracted value appears again
- Pattern-aware feedback
- when a reviewed correction looks like a safe OCR-style character substitution in the same row context, the script stores a reusable pattern rule
- those rules are applied only when:
- the same sheet is involved
- the same review field is involved
- the same row context/signature is present
- the value shape matches
This helps with repeated OCR-style mistakes across similar PDFs without rewriting the Python code itself.
Important note:
- the pattern-aware layer is conservative
- it is intended for safe OCR-like substitutions and not for inventing arbitrary new numeric values
- Run extraction:
python extract_production_process_table.py-
Open the generated workbook and fill
Human Evaluation. -
Run the script again:
python extract_production_process_table.py- The script learns from the reviewed workbook and applies matching corrections.
Correctmeans the value is confirmed as-is.- Any other non-empty
Human Evaluationtext is treated as the corrected value. - Blank
Human Evaluationmeans not reviewed yet. - Corrections are safest when
Review Fieldpoints to the exact column being reviewed.
ABB Greasing writes one section per detected ABB regreasing card occurrence.
Each occurrence includes:
- bearings
- amount of grease
- greased in factory with
- the 4-column grease table
The sheet includes occurrence and page markers such as:
Occurrence 1 | Page 1
Occurrence 2 | Page 2
This means the selected extractor did not find its expected pattern in that PDF.
No selected extractor found usable content in the PDF.
Check:
pytesseractis installed- Tesseract OCR engine is installed
tesseract.exeexists in a standard Windows path
If the Excel file is open, the script writes to a fallback filename ending in _updated.xlsx.
For repeated vendor PDFs:
- Add the PDF name and extractor names to
extract_requests.txt. - Run the script.
- Review the workbook.
- Fill
Human Evaluation. - Re-run the script so it learns from the review.
- Let the history and feedback files improve future extraction quality.