PDF Extraction Tool

This project extracts structured data from PDF documents and writes the results into Excel workbooks.

It supports:

rule-based extraction for known table and figure patterns
OCR fallback for scanned/image-based PDFs
per-PDF extractor selection through extract_requests.txt
lightweight history-based extractor recommendation
human-in-the-loop feedback through reviewed Excel files

Files

extract_production_process_table.py: main extraction script
extract_requests.txt: per-PDF extractor configuration
keyword_aliases.json: canonical keywords and their aliases
extraction_history.json: learned historical pattern memory
feedback_history.json: learned human feedback corrections

Requirements

Install the main Python packages:

python -m pip install pdfplumber pypdfium2 openpyxl pytesseract

For scanned PDFs, install the Tesseract OCR engine as well.

Expected Windows install path:

C:\Program Files\Tesseract-OCR\tesseract.exe

The script auto-detects that location.

Supported Extractors

Current extractor names:

Production Process
Table 1
Table 3 ASTM
Table 6
CI-4 Hardness
ISO VG 10
Fig 1 Temperatures
Product Tables
Greases
OXX
ABB Greasing

These names are the same names used in extract_requests.txt and as Excel sheet names.

Basic Execution

Run with the default request file:

python extract_production_process_table.py

Run for a specific PDF:

python extract_production_process_table.py myfile.pdf

Run for a specific PDF and request file:

python extract_production_process_table.py myfile.pdf extract_requests.txt

Run while also passing reviewed Excel workbooks explicitly:

python extract_production_process_table.py myfile.pdf extract_requests.txt reviewed_output.xlsx

Request File Format

The script reads extraction requests from extract_requests.txt.

Requests can use either:

exact extractor names such as Greases
canonical business keywords such as Grease

If a canonical keyword is used, the script resolves it through keyword_aliases.json and maps it to the appropriate extractor.

You can define:

Global options for a single PDF:

Greases, OXX

Per-PDF options for batch processing:

testLubricant.pdf: Grease
prodTable.pdf: Product Tables, OXX
ABB.pdf: ABB Greasing

If no explicit request is available for a PDF, the history-based recommender may suggest extractors automatically.

Canonical Keyword Aliases

The file keyword_aliases.json stores canonical keywords and all of the aliases that should be searched in the PDF.

Example:

{
  "Grease": {
    "extractor": "Greases",
    "aliases": [
      "special ball bearing grease",
      "high performance greases",
      "greases"
    ]
  }
}

How it is used:

You put Grease in extract_requests.txt
The script maps Grease to the Greases extractor
The Greases extractor searches the PDF for all configured aliases
If any of those aliases are found, the matching pages are processed

Output Files

For each PDF, the script creates one Excel workbook:

<pdf_stem>_tables.xlsx

Examples:

testLubricant_tables.xlsx
prodTable_tables.xlsx
ABB_tables.xlsx

If the workbook is open and locked, the script saves instead to:

<pdf_stem>_tables_updated.xlsx

Progress Output

The script prints progress to the console, including:

selected extractor count
start/completion message for each extractor
row count per extractor
total time taken

Example:

PDF: D:\App\pdfscrapper\testLubricant.pdf
Selected extractors: 1
Starting Greases...
Completed Greases in 0.24s with 13 rows.
Extracted 13 rows across 1 sheets to: D:\App\pdfscrapper\testLubricant_tables.xlsx
Total time taken: 0.27s

OCR Behavior

The script first tries native PDF text extraction where possible.

If a page is scanned or image-based, it falls back to OCR.

This is used for:

scanned ABB regreasing cards
image-heavy figure extraction
scanned table lookups where text extraction is weak

History-Based Recommendation

The script maintains a lightweight memory in extraction_history.json.

How it works:

Successful extractions store page snippets and extractor names.
When a new PDF has no explicit request entry, the script scores its pages against historical snippets.
The best-matching extractors are recommended and can be run automatically.

This is a lightweight ML-style similarity layer, not a trained neural network.

Human Feedback Loop

Every output row now includes:

Review Field
Human Evaluation

How to Review

After the workbook is generated:

Open the Excel file.
Review the extracted values.
Leave Review Field as-is unless you want to target a different column.
In Human Evaluation:

write Correct if the extracted value is right
write the corrected value if the extracted value is wrong

Example

If a row looks like this:

Production Process	Rt (mm)	Ra (mm)	Review Field	Human Evaluation
grinding	2.00-6.0	0.400-0.8	Ra (mm)	0.750-3.5

then on the next run the script can learn that correction and apply it automatically.

How Feedback Is Learned

The script reads reviewed workbooks from:

explicitly passed .xlsx arguments
any *_tables.xlsx files in the current folder
any *_tables_updated.xlsx files in the current folder

It stores:

confirmations in feedback_history.json
corrections in feedback_history.json
safe pattern-aware rules in feedback_history.json

Then, during future extraction runs, matching values are auto-corrected before writing the next workbook.

Pattern-Aware Feedback

The feedback loop now works at two levels:

Value-level feedback

exact corrected values are reused when the same extracted value appears again

Pattern-aware feedback

when a reviewed correction looks like a safe OCR-style character substitution in the same row context, the script stores a reusable pattern rule
those rules are applied only when:
- the same sheet is involved
- the same review field is involved
- the same row context/signature is present
- the value shape matches

This helps with repeated OCR-style mistakes across similar PDFs without rewriting the Python code itself.

Important note:

the pattern-aware layer is conservative
it is intended for safe OCR-like substitutions and not for inventing arbitrary new numeric values

Feedback Workflow

Run extraction:

python extract_production_process_table.py

Open the generated workbook and fill Human Evaluation.
Run the script again:

python extract_production_process_table.py

The script learns from the reviewed workbook and applies matching corrections.

Notes

Correct means the value is confirmed as-is.
Any other non-empty Human Evaluation text is treated as the corrected value.
Blank Human Evaluation means not reviewed yet.
Corrections are safest when Review Field points to the exact column being reviewed.

ABB Greasing Output

ABB Greasing writes one section per detected ABB regreasing card occurrence.

Each occurrence includes:

bearings
amount of grease
greased in factory with
the 4-column grease table

The sheet includes occurrence and page markers such as:

Occurrence 1 | Page 1
Occurrence 2 | Page 2

Troubleshooting

`Skipping <name>: no details were found.`

This means the selected extractor did not find its expected pattern in that PDF.

`No details were found.`

No selected extractor found usable content in the PDF.

OCR is not working

Check:

pytesseract is installed
Tesseract OCR engine is installed
tesseract.exe exists in a standard Windows path

Workbook is locked

If the Excel file is open, the script writes to a fallback filename ending in _updated.xlsx.

Recommended Usage Pattern

For repeated vendor PDFs:

Add the PDF name and extractor names to extract_requests.txt.
Run the script.
Review the workbook.
Fill Human Evaluation.
Re-run the script so it learns from the review.
Let the history and feedback files improve future extraction quality.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
initiatives/CIS_Microsoft_Azure_Foundations_Benchmark_v1.3.0_612b5213-9160-4969-8578-1518bd2a000c		initiatives/CIS_Microsoft_Azure_Foundations_Benchmark_v1.3.0_612b5213-9160-4969-8578-1518bd2a000c
ss		ss
GitLearning.txt		GitLearning.txt
Kirk-OthmerEncyclopediaofChemicalTechnology-LubricationandLubricants.pdf		Kirk-OthmerEncyclopediaofChemicalTechnology-LubricationandLubricants.pdf
README.md		README.md
azuredeploy.json		azuredeploy.json
azuredeployFdotD2.json		azuredeployFdotD2.json
branchTest.txt		branchTest.txt
dev1updates.txt		dev1updates.txt
extract_production_process_table.py		extract_production_process_table.py
extract_requests.txt		extract_requests.txt
extraction_history.json		extraction_history.json
feedback_history.json		feedback_history.json
keyword_aliases.json		keyword_aliases.json
manufacturer_matrix_extractor.py		manufacturer_matrix_extractor.py
tesseract-ocr-w64-setup.exe		tesseract-ocr-w64-setup.exe
testGit.txt		testGit.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Extraction Tool

Files

Requirements

Supported Extractors

Basic Execution

Request File Format

Canonical Keyword Aliases

Output Files

Progress Output

OCR Behavior

History-Based Recommendation

Human Feedback Loop

How to Review

Example

How Feedback Is Learned

Pattern-Aware Feedback

Feedback Workflow

Notes

ABB Greasing Output

Troubleshooting

`Skipping <name>: no details were found.`

`No details were found.`

OCR is not working

Workbook is locked

Recommended Usage Pattern

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Extraction Tool

Files

Requirements

Supported Extractors

Basic Execution

Request File Format

Canonical Keyword Aliases

Output Files

Progress Output

OCR Behavior

History-Based Recommendation

Human Feedback Loop

How to Review

Example

How Feedback Is Learned

Pattern-Aware Feedback

Feedback Workflow

Notes

ABB Greasing Output

Troubleshooting

Skipping <name>: no details were found.

No details were found.

OCR is not working

Workbook is locked

Recommended Usage Pattern

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`Skipping <name>: no details were found.`

`No details were found.`

Packages