# Prerequisites

**poppler**

To install Poppler on Windows, you can follow these steps:

1. **Download Poppler for Windows**: Visit the official Poppler for Windows repository on GitHub. Look for the latest release and download the pre-built binaries. You can find the repository here: [Poppler for Windows](https://github.com/oschwartz10612/poppler-windows/releases)

2. **Extract the downloaded files**: Once the download is complete, extract the contents of the downloaded ZIP file to a location on your computer. You can use any extraction tool like 7-Zip or WinRAR.

3. **Set up Environment Variables (Optional)**: To make it easier to use Poppler from the command line, you can add the path to the extracted Poppler directory to your system's PATH environment variable. This step is optional but recommended. Here's how you can do it:
   - Right-click on "This PC" or "My Computer" and select "Properties".
   - Click on "Advanced system settings" on the left side.
   - In the System Properties window, click on the "Environment Variables" button.
   - In the Environment Variables window, find the "Path" variable in the System Variables section and click "Edit".
   - Add the path to the "bin" directory inside the extracted Poppler directory to the list of paths. For example, if you extracted Poppler to "C:\poppler-xx.xx.xx\bin", add "C:\poppler-xx.xx.xx\bin" to the list.
   - Click "OK" to save your changes.

4. **Verify Installation**: To verify that Poppler is installed correctly, open Command Prompt and type `pdftotext -v`. This command should display the version information of Poppler if it's installed correctly.

That's it! Poppler should now be installed on your Windows system, and you can use it to work with PDF files from the command line or integrate it into your applications.

**Modules**

To install Pytesseract OCR on Windows, you can follow these steps:

1. **Install Tesseract OCR Engine**: Pytesseract is a Python wrapper for Tesseract OCR engine. First, you need to install the Tesseract OCR engine on your system. You can download the installer for Windows from the official GitHub repository: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract). Follow the instructions provided in the repository to install Tesseract on your Windows system. Make sure to remember the installation path.

2. **Install Python**: If you haven't already, download and install Python on your Windows system. You can download the latest version of Python from the official website: [Python Downloads](https://www.python.org/downloads/). Make sure to check the option to add Python to your system PATH during installation.

3. **Install Pytesseract**: Once Python is installed, you can install Pytesseract using pip, the Python package manager. Open Command Prompt and run the following command:
   ```
   pip install pytesseract
   ```

4. **Install Pillow (PIL Fork)**: Pytesseract requires the Pillow library (Python Imaging Library) to work with images. You can install Pillow using pip:
   ```
   pip install pillow
   ```

5. **Verify Installation**: After installation, you can verify that Pytesseract is installed correctly by opening a Python interpreter and trying to import it:
   ```python
   import pytesseract
   ```

6. **Set Tesseract Path (Optional)**: If Tesseract is installed in a non-standard location or if Pytesseract is unable to find it, you can specify the path to the Tesseract executable using the `pytesseract.pytesseract.tesseract_cmd` variable in your Python script. For example:
   ```python
   import pytesseract
   pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
   ```

That's it! Pytesseract OCR should now be installed on your Windows system, and you can use it in your Python projects to perform OCR (Optical Character Recognition) on images and PDFs.

**easy ocr and keras ocr**

To install EasyOCR and Keras-OCR on Windows, you can follow these steps:

1. EasyOCR Installation:

EasyOCR is a Python package for performing Optical Character Recognition (OCR) easily. Here's how you can install it:

1. **Install Python**: If you haven't already, download and install Python on your Windows system from the official website: [Python Downloads](https://www.python.org/downloads/). Make sure to check the option to add Python to your system PATH during installation.

2. **Install EasyOCR**: Open Command Prompt and run the following command to install EasyOCR using pip:
   ```
   pip install easyocr
   ```

3. **Verify Installation**: After installation, you can verify that EasyOCR is installed correctly by opening a Python interpreter and trying to import it:
   ```python
   import easyocr
   ```

 2. Keras-OCR Installation:

Keras-OCR is another Python package for OCR based on Keras and TensorFlow. Here's how you can install it:

1. **Install TensorFlow and Keras**: Keras-OCR depends on TensorFlow and Keras. Install these libraries using pip:
   ```
   pip install tensorflow keras
   ```

2. **Install Keras-OCR**: After installing TensorFlow and Keras, you can install Keras-OCR using pip:
   ```
   pip install keras-ocr
   ```

3. **Verify Installation**: After installation, you can verify that Keras-OCR is installed correctly by opening a Python interpreter and trying to import it:
   ```python
   import keras_ocr
   ```

That's it! You have now installed EasyOCR and Keras-OCR on your Windows system. You can use these libraries in your Python projects to perform Optical Character Recognition on images and documents.

**Important**

Keras-OCR is an OCR library based on Keras and TensorFlow, so its TensorFlow and CUDA requirements are the same as those of the TensorFlow library. Here are the requirements for Keras-OCR for TensorFlow and CUDA:

TensorFlow version requirements:

- TensorFlow 2.x version is the preferred version for Keras-OCR. It is recommended to use the latest TensorFlow 2.x version as it usually contains the latest features and improvements.
- TensorFlow 1.x versions may no longer be supported by Keras-OCR because Keras-OCR is built on the Keras API, and TensorFlow 2.x has integrated Keras as its main high-level API.

CUDA requirements:

If you plan to run Keras-OCR on a GPU, you need to ensure that your system meets the following CUDA requirements:

- CUDA Toolkit version: The TensorFlow version supported by Keras-OCR is generally compatible with a specific CUDA Toolkit version. You need to check the CUDA Toolkit version required by your installed version of TensorFlow. Generally speaking, the official TensorFlow documentation will provide this information.
- GPU driver: You need to install a GPU driver that is compatible with the version of CUDA Toolkit you are using.

# Pyttesseract OCR

## pyttesseract_final_version_with_extract_for_the_first_page

In [6]:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract

# Specify the installation path of Tesseract for Windows users
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
poppler_path = r"C:\poppler-24.02.0\Library\bin"

# Path to your PDF document
pdf_path = r'pdf_files\236-020-WAS-120_1.pdf'

# Convert the PDF to images (all pages)
images = convert_from_path(pdf_path)

# Define coordinates for cropping specific regions
regions = {
    "TITLE": ((200, 175), (2200, 420)),
    "REV_TABLE": ((2504, 4160), (4133, 4520)),
    "DETAILS": ((4133, 4162), (4338, 4678)),
    "CUSTOMER": ((4338, 4162), (4890, 4518)),
    "Information": ((5480, 4162), (6464, 4519))
}

# Define NOTES section coordinates
notes_coords = (180, 4165, 2450, 4500)

# Function to extract text from given image regions
def extract_text(images, regions, config=''):
    extracted_texts = {}
    for label, ((x1, y1), (x2, y2)) in regions.items():
        for image in images:
            cropped_image = image.crop((x1, y1, x2, y2))
            text = pytesseract.image_to_string(cropped_image, config=config if label == 'Information' else '')
            if text.strip():
                extracted_texts[label] = text.strip()
                break  # Stop after finding text for a region
    return extracted_texts

# Extract text from specified regions on the first page
extracted_texts = extract_text(images[:1], regions)

# Print extracted text
for label, text in extracted_texts.items():
    print(f"Text from {label}:")
    print(text)
    print("-" * 50)

# Process NOTES sections from all pages
unique_notes = set()
for page_num, image in enumerate(images, start=1):
    notes_section = image.crop(notes_coords)
    notes_text = pytesseract.image_to_string(notes_section).strip()

    if notes_text and notes_text not in unique_notes:
        unique_notes.add(notes_text)
        print(f"Page {page_num} has unique NOTES:")
        print(notes_text)
        print("-" * 50)

Text from TITLE:
ADBRI BIRKENHEAD 1000T RMS SILO
SILO RING BEAM WELDMENT
--------------------------------------------------
Text from REV_TABLE:
1__| 28/06/2023 | LOAD CELL MOUNTS CHANGED AZ | DH | DP | HV
0 | 20/04/2023 | ISSUED FOR CONSTRUCTION MS | DH | DP | HV
REV| DATE REVISION HISTORY DRN|CHK|ENG| PM
--------------------------------------------------
Text from DETAILS:
Al

DO NOT SCALE

©

tT
ALL DIMENSIONS IN MILLIMETRES
DRAWING PRACTICE TO AS1100

IF IN DOUBT ASK

Tolerances U.N.O.

On dimensions
QODECPLACE 20.5
1DECPLACE © #0.2
2DECPLACE £0.10
All Angles 405°

| is not to be copied, t
--------------------------------------------------
Text from CUSTOMER:
Customer: MCMAHON SERVICES/ABC.
Reference : 236-020
--------------------------------------------------
Text from Information:
Status :

ISSUED FOR CONSTRUCTION

Drawn : AZ Date : 25/07/2022 | A3 Scale = 1.5x | A1 Sheet Scale=| 1:50
Desorption : 1000T RMS SILO
SILO RING BEAM
GENERAL ASSEMBLY
Client Code | Job Number Type Drawin

## Documentation

**Introduction:**

This document provides a comprehensive guide on extracting text from PDF documents using Python. It outlines a Python script that utilizes libraries such as `pdf2image`, `PIL`, and `pytesseract` to convert PDF pages into images and extract text from predefined regions within these images.

**1. Prerequisites:**
   - Python installation on your system.
   - Essential Python libraries installed (`pdf2image`, `PIL`, `pytesseract`).
   - Tesseract OCR installed on your system. For Windows users, ensure the correct installation path is specified.

**2. Setting Up Tesseract for Windows Users:**
   - For Windows users, specify the installation path of Tesseract using `pytesseract.pytesseract.tesseract_cmd`.

**3. Usage:**
   - Define the path to your PDF document (`pdf_path`).
   - Specify coordinates for cropping specific regions within the document.
   - Optionally, define coordinates for extracting the "NOTES" section from each page.
   - Run the provided Python script to extract text from the specified regions and NOTES sections.

**4. Code Explanation:**

    1. **Importing Libraries:**
       - Necessary libraries are imported, including `pdf2image`, `PIL`, and `pytesseract`.
    
    2. **Setting Tesseract Path (For Windows Users):**
       - Specifies the path to the Tesseract executable for Windows users.
    
    3. **Defining PDF Path:**
       - Specifies the path to the PDF document for processing.
    
    4. **Converting PDF to Images:**
       - Uses `convert_from_path` to convert PDF pages into images.
    
    5. **Defining Coordinates for Cropping Specific Regions:**
       - Defines coordinates for cropping specific regions within the PDF.
    
    6. **Defining NOTES Section Coordinates:**
       - Defines coordinates for cropping the NOTES section from each page.
    
    7. **Function to Extract Text from Image Regions:**
       - Defines `extract_text` function to extract text from specified regions using Tesseract OCR.
    
    8. **Extracting Text from Specified Regions on the First Page:**
       - Extracts text from specified regions on the first page using `extract_text`.
    
    9. **Printing Extracted Text:**
       - Prints extracted text from each region on the first page.
    
    10. **Processing NOTES Sections from All Pages:**
        - Iterates over each page to extract NOTES sections using Tesseract OCR.
    
    11. **Printing Unique NOTES Sections:**
        - Prints unique NOTES sections from each page.
   
**5. Output:**
   - Extracted text from specified regions on the first page is printed.
   - Unique NOTES sections from each page are printed.

**6. Conclusion:**
   - The provided Python script offers a convenient method for text extraction from PDF documents, facilitating the analysis of specific sections programmatically.

**Note:** Ensure all dependencies are installed and configured correctly before executing the script.


1. Import the required libraries:
```python
from pdf2image import convert_from_path # Used to convert PDF to image
from PIL import Image # Python Imaging Library, used for image processing
import pytesseract # used to perform OCR operations
```

2. Set the installation path of Tesseract (for Windows users only):
```python
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
```
This line of code specifies the installation path of Tesseract OCR so that the `pytesseract` library can find and use it.

3. Specify the path to the PDF file:
```python
pdf_path = r'pdf_files\236-020-WAS-120_1.pdf'
```
You need to replace `pdf_files\236-020-WAS-120_1.pdf` with your actual PDF file path.

4. Convert PDF to image using `convert_from_path` function:
```python
images = convert_from_path(pdf_path)
```
This line of code converts the entire PDF file into an image. Each PDF page is converted into an image object and stored in the `images` variable.

5. Define the coordinates of the specific area to be extracted:
```python
regions = {
     "TITLE": ((200, 175), (2200, 420)),
     "REV_TABLE": ((2504, 4160), (4133, 4520)),
     "DETAILS": ((4133, 4162), (4338, 4678)),
     "CUSTOMER": ((4338, 4162), (4890, 4518)),
     "Information": ((5480, 4162), (6464, 4519))
}
```
Listed here are the areas of text to be extracted from each page and their coordinates. Each area is determined by the coordinates of the upper left and lower right corners.

6. Define the coordinates of the NOTES area:
```python
notes_coords = (180, 4165, 2450, 4500)
```
This tuple defines the coordinates of the upper left and lower right corners of the NOTES area.

7. Define a function `extract_text` to extract text from a specified area in the image:
```python
def extract_text(images, regions, config=''):
     extracted_texts = {}
     for label, ((x1, y1), (x2, y2)) in regions.items():
         for image in images:
             cropped_image = image.crop((x1, y1, x2, y2))
             text = pytesseract.image_to_string(cropped_image, config=config if label == 'Information' else '')
             if text.strip():
                 extracted_texts[label] = text.strip()
                 break # Stop after finding text for a region
     return extracted_texts
```
This function accepts a list of images, a region dictionary, and optional configuration parameters, and returns a dictionary containing text extracted from the specified region.

8. Use the `extract_text` function to extract the text of the specified area from the first page:
```python
extracted_texts = extract_text(images[:1], regions)
```
Only the images on the first page are processed here, `images[:1]` returns a list containing the images on the first page.

9. Print the extracted text:
```python
for label, text in extracted_texts.items():
     print(f"Text from {label}:")
     print(text)
     print("-" * 50)
```
This code prints the extracted text from each region, separated by a separator line.

10. Process NOTES areas in all pages:
```python
unique_notes = set()
for page_num, image in enumerate(images, start=1):
     notes_section = image.crop(notes_coords)
     notes_text = pytesseract.image_to_string(notes_section).strip()

     if notes_text and notes_text not in unique_notes:
         unique_notes.add(notes_text)
         print(f"Page {page_num} has unique NOTES:")
         print(notes_text)
         print("-" * 50)
```
This code iterates through the images of all pages, extracts the text of the NOTES area from each page, and checks to see if there is unique text.

## pyttesseract_final_version_with_extract_for_each_page

In [5]:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract

# Specify the installation path of Tesseract for Windows users
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
poppler_path = r"C:\poppler-24.02.0\Library\bin"

# Path to your PDF document
pdf_path = r'pdf_files\236-020-STR-001_D.pdf' # Update this to your actual PDF file path  # Update this to your actual PDF file path

# Convert the PDF to images (all pages)
images = convert_from_path(pdf_path)

# Function to check if a region contains any of the specified keywords
def contains_keywords(image, coords, keywords):
    cropped_image = image.crop(coords)
    text = pytesseract.image_to_string(cropped_image)
    return any(keyword in text for keyword in keywords)

# Extract text from TITLE, DETAILS, and CUSTOMER on the first page
regions = {
    "TITLE": (200, 175, 2200, 420),
    "DETAILS": (4133, 4162, 4338, 4678),
    "CUSTOMER": (4338, 4162, 4890, 4518)
}
first_page_texts = {key: pytesseract.image_to_string(images[0].crop(value)) for key, value in regions.items()}

# Check and extract REV_TABLE if "REV" is present, indicate page number
rev_table_coords = (2504, 4160, 4133, 4520)
rev_texts = [(page_num+1, pytesseract.image_to_string(image.crop(rev_table_coords))) for page_num, image in enumerate(images) if contains_keywords(image, rev_table_coords, ['REV'])]

# Process NOTES sections from all pages without uniqueness check
notes_coords = (180, 4165, 2450, 4500)
notes_texts = [(page_num+1, pytesseract.image_to_string(image.crop(notes_coords)).strip()) for page_num, image in enumerate(images)]

# Extract INFORMATION from every page and indicate page number
information_coords = (5480, 4162, 6464, 4519)
information_texts = [(page_num+1, pytesseract.image_to_string(image.crop(information_coords)).strip()) for page_num, image in enumerate(images)]

# # Output extracted texts
# for label, text in first_page_texts.items():
#     print(f"{label}:\n{text}\n{'-'*50}")

# print("REV_TABLES:")
# for page_num, text in rev_texts:
#     print(f"Page {page_num}:\n{text}\n{'-'*50}")

# print("NOTES:")
# for page_num, note in notes_texts:
#     print(f"Page {page_num}:\n{note}\n{'-'*50}")

# print("INFORMATION:")
# for page_num, info in information_texts:
#     print(f"Page {page_num}:\n{info}\n{'-'*50}")


# Create a dictionary to store all extracted content
extracted_content = {
     "TITLE": first_page_texts.get("TITLE", ""),
     "DETAILS": first_page_texts.get("DETAILS", ""),
     "CUSTOMER": first_page_texts.get("CUSTOMER", ""),
     "REV_TABLES": rev_texts,
     "NOTES": notes_texts,
     "INFORMATION": information_texts
}

# Output all contents stored in the dictionary
print("Extracted Content Summary:")
for key, value in extracted_content.items():
     print(f"{key}:")
     if isinstance(value, list):
         # If the value is a list (such as REV_TABLES, NOTES, INFORMATION), print each item in the list separately
         for item in value:
             if isinstance(item, tuple):
                 # If the items in the list are tuples (page number, text), print them separately
                 print(f"Page {item[0]}:")
                 print(item[1]) #Print the entire text content
                 print("-" * 50) # separator line
             else:
                 print(item) # Print the item directly
                 print("-" * 50) # separator line
     else:
         # If the value is not a list (such as TITLE, DETAILS, CUSTOMER), print directly
         print(value)
         print("-" * 50) # separator line

Extracted Content Summary:
TITLE:
ADBRI BIRKENHEAD

--------------------------------------------------
DETAILS:
© NOT SCALE
€
ac
ALL DIMENSIONS IN |
MILLIMETRES
ING PRACTICE TO AS1100 |
FINDOUBT ASK :

folerances U.N.O.
Dn dimensions

)DEC PLACE 40.5
}DEC PLACE 40.2

2DEC PLACE 40.10
\LLANGLES 40.5"

OR USED FC


--------------------------------------------------
CUSTOMER:
Customer : McMahon Services
Reference : 236-020

IR MANUFACTURING OR TENDERI


--------------------------------------------------
REV_TABLES:
Page 1:
01/03/2023 |ISSUED FOR CLIENT REVIEW BM | DH | DZ HV
20/01/2023 ISSUED FOR REVIEW SB | DH | DZ HV
16/12/2022 |ISSUED FOR REVIEW BM | DH | DZ HV
| 11/11/2022 |ISSUED FOR REVIEW AT | DH | DZ HV
V DATE REVISION HISTORY DRN | CHK | ENG PM

°-ROPERTY OF INGENIA LTD. NEITHER THE WHOLE NOR ANY EXTRACT MAY BE DISCLOSED COPIED

--------------------------------------------------
Page 2:
01/03/2023 |ISSUED FOR CLIENT REVIEW BM | DH | DZ HV
20/01/2023 ISSUED FOR REVIEW SB | DH | DZ

## Documentation

**Introduction:**

This document serves as a comprehensive guide on extracting text from PDF documents using Python. It includes a Python script leveraging libraries such as `pdf2image`, `PIL`, and `pytesseract` to convert PDF pages into images and extract text from predefined regions within these images.

**1. Prerequisites:**
   - Python installation on your system.
   - Essential Python libraries installed (`pdf2image`, `PIL`, `pytesseract`).
   - Tesseract OCR installed on your system. For Windows users, ensure the correct installation path is specified.

**2. Setting Up Tesseract for Windows Users:**
   - For Windows users, specify the installation path of Tesseract using `pytesseract.pytesseract.tesseract_cmd`.

**3. Usage:**
   - Define the path to your PDF document (`pdf_path`).
   - Specify coordinates for cropping specific regions within the document.
   - Optionally, define coordinates for extracting the "NOTES" section from each page.
   - Run the provided Python script to extract text from the specified regions and NOTES sections.

**4. Code Explanation:**

    1. **Importing Libraries:**
       - Necessary libraries are imported, including `pdf2image`, `PIL`, and `pytesseract`.
    
    2. **Setting Tesseract Path (For Windows Users):**
       - Specifies the path to the Tesseract executable for Windows users.
    
    3. **Defining PDF Path:**
       - Specifies the path to the PDF document for processing.
    
    4. **Converting PDF to Images:**
       - Uses `convert_from_path` to convert PDF pages into images.
    
    5. **Defining Coordinates for Cropping Specific Regions:**
       - Defines coordinates for cropping specific regions within the PDF.
    
    6. **Defining NOTES Section Coordinates:**
       - Defines coordinates for cropping the NOTES section from each page.
    
    7. **Function to Extract Text from Image Regions:**
       - Defines `extract_text` function to extract text from specified regions using Tesseract OCR.
    
    8. **Extracting Text from Specified Regions on the First Page:**
       - Extracts text from specified regions on the first page using `extract_text`.
    
    9. **Printing Extracted Text:**
       - Prints extracted text from each region on the first page.
    
    10. **Processing NOTES Sections from All Pages:**
        - Iterates over each page to extract NOTES sections using Tesseract OCR.
    
    11. **Printing Unique NOTES Sections:**
        - Prints unique NOTES sections from each page.
   
**5. Output:**
   - Extracted text from specified regions on the first page is printed.
   - Unique NOTES sections from each page are printed.

**6. Conclusion:**
   - The provided Python script offers a convenient method for text extraction from PDF documents, facilitating the analysis of specific sections programmatically.

**Note:** Ensure all dependencies are installed and configured correctly before executing the script.

This code is used to extract text in a specific area from a PDF file and perform OCR (Optical Character Recognition) within the specific area to extract the text content. Let me explain step by step what each part does:

1. Import the necessary libraries:
```python
from pdf2image import convert_from_path # Used to convert PDF to image
from PIL import Image # Python Imaging Library, used for image processing
import pytesseract # used to perform OCR operations
```

2. Set the installation path of Tesseract (for Windows users only):
```python
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
```
This specifies the installation path of Tesseract OCR so that the `pytesseract` library can find and use it.

3. Specify the path to the PDF file:
```python
pdf_path = r'pdf_files\236-020-WAS-120_1.pdf'
```
You need to replace `pdf_files\236-020-WAS-120_1.pdf` with your actual PDF file path.

4. Convert PDF to image using `convert_from_path` function:
```python
images = convert_from_path(pdf_path)
```
This line of code converts the entire PDF file into an image, stored in the `images` variable. Each page of the PDF is converted into an image object.

5. Define the coordinates of the specific area to be extracted:
```python
regions = {
     "TITLE": ((200, 175), (2200, 420)),
     "REV_TABLE": ((2504, 4160), (4133, 4520)),
     "DETAILS": ((4133, 4162), (4338, 4678)),
     "CUSTOMER": ((4338, 4162), (4890, 4518)),
     "Information": ((5480, 4162), (6464, 4519))
}
```
Listed here are the areas of text to be extracted from each page and their coordinates. Each area is determined by the coordinates of the upper left and lower right corners.

6. Define the coordinates of the NOTES area:
```python
notes_coords = (180, 4165, 2450, 4500)
```
This tuple defines the coordinates of the upper left and lower right corners of the NOTES area.

7. Define a function `extract_text` to extract text from a specified area in the image:
```python
def extract_text(images, regions, config=''):
     extracted_texts = {}
     for label, ((x1, y1), (x2, y2)) in regions.items():
         for image in images:
             cropped_image = image.crop((x1, y1, x2, y2))
             text = pytesseract.image_to_string(cropped_image, config=config if label == 'Information' else '')
             if text.strip():
                 extracted_texts[label] = text.strip()
                 break # Stop after finding text for a region
     return extracted_texts
```
This function accepts a list of images, a region dictionary, and optional configuration parameters, and returns a dictionary containing text extracted from the specified region.

8. Use the `extract_text` function to extract the text of the specified area from the first page:
```python
extracted_texts = extract_text(images[:1], regions)
```
Only the images on the first page are processed here, `images[:1]` returns a list containing the images on the first page.

9. Print the extracted text:
```python
for label, text in extracted_texts.items():
     print(f"Text from {label}:")
     print(text)
     print("-" * 50)
```
This code prints the extracted text from each region, separated by a separator line.

10. Process NOTES areas in all pages:
```python
unique_notes = set()
for page_num, image in enumerate(images, start=1):
     notes_section = image.crop(notes_coords)
     notes_text = pytesseract.image_to_string(notes_section).strip()

     if notes_text and notes_text not in unique_notes:
         unique_notes.add(notes_text)
         print(f"Page {page_num} has unique NOTES:")
         print(notes_text)
         print("-" * 50)
```
This code iterates through the images of all pages, extracts the text of the NOTES area from each page, and checks to see if there is unique text.

# easy_ocr

In [3]:
from pdf2image import convert_from_path
import easyocr
import numpy as np

#Initialize easyOCR reader
reader = easyocr.Reader(['en'])
poppler_path = r"C:\poppler-24.02.0\Library\bin"

def extract_text_from_pdf(pdf_path):
     images = convert_from_path(pdf_path)
     #Initialize storage structure
     first_page_texts = {}
     rev_texts = []
     notes_texts = []
     information_texts = []

     for i, image in enumerate(images):
         image_np = np.array(image)

         if i == 0: # Extract only from the first page
             #Extract TITLE area text
             title_region = image_np[175:420, 200:2200]
             title_text = ' '.join(reader.readtext(title_region, detail=0, paragraph=True))
             first_page_texts['TITLE'] = title_text

             # Extract DETAILS area text
             details_region = image_np[4162:4678, 4133:4338]
             details_text = ' '.join(reader.readtext(details_region, detail=0, paragraph=True))
             first_page_texts['DETAILS'] = details_text

             # Extract CUSTOMER area text
             customer_region = image_np[4162:4518, 4338:4890]
             customer_text = ' '.join(reader.readtext(customer_region, detail=0, paragraph=True))
             first_page_texts['CUSTOMER'] = customer_text

         # Check if REV_TABLE exists
         rev_text = reader.readtext(image_np, detail=0)
         if any('REV' in word for word in rev_text):
             rev_texts.append((i + 1, ' '.join(rev_text)))

         #Extract NOTES area text
         notes_region = image_np[4165:4500, 180:2450]
         notes_text = ' '.join(reader.readtext(notes_region, detail=0, paragraph=True))
         if notes_text:
             notes_texts.append((i + 1, notes_text))

         # Extract INFORMATION area text
         information_region = image_np[4162:4519, 5480:6464]
         information_text = ' '.join(reader.readtext(information_region, detail=0, paragraph=True))
         information_texts.append((i + 1, information_text))

     # Create dictionary to store all extracted content
     extracted_content = {
         "TITLE": first_page_texts.get("TITLE", ""),
         "DETAILS": first_page_texts.get("DETAILS", ""),
         "CUSTOMER": first_page_texts.get("CUSTOMER", ""),
         "REV_TABLES": rev_texts,
         "NOTES": notes_texts,
         "INFORMATION": information_texts
     }

     return extracted_content

# This is a sample PDF path, you need to change it according to the actual situation
pdf_path = 'pdf_files/236-020-STR-001_D.pdf'
extracted_content = extract_text_from_pdf(pdf_path)

# Output all contents stored in the dictionary
print("Extracted Content Summary:")
for key, value in extracted_content.items():
     print(f"{key}:")
     if isinstance(value, list):
         for item in value:
             print(f"Page {item[0]}:")
             print(item[1]) # Print the entire text content
             print("-" * 50) # Separator line
     else:
         print(value)
         print("-" * 50) # Separator line

Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

Downloading recognition model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% CompleteExtracted Content Summary:
TITLE:
ADBRI BIRKENHEAD
--------------------------------------------------
DETAILS:
0 NOT SCALE ALL DIMENSIONS IN MILLIMETRES IING PRACTICE TO AS11OO F IN DOUBT ASK [olerances UNQ Jn dimensions DEC PLACE DEC PLACE DEC PLACE ILL ANGLES 10.5 10.2 10.10 L0.5" OR USED FC
--------------------------------------------------
CUSTOMER:
McMAHON 5 E R V 1 0 E 5 Customer McMahon Services Reference 236-020 R MANUFACTURING OR TENDERIE
--------------------------------------------------
REV_TABLES:
Page 1:
ADBRI BIRKENHEAD DRAWING LIST STRUCTURAL Sheet Number Sheet Name Current Revision 1OO0T RMS SILO TR-OOI-00? COVBERHCHEDULE R-001-100 PLAN BASE ATE PLANS 1000 EWS 103 1000 SILO PLAN VIEWS  1000 SILO PLAN VIEWS = LOWER PLATFORM PLAN VIEW VIFW -00 SECTION OWER ATFORM SECTIO R-001-303 LOWER PLATFORM SECTION SHEET R-OO1 LOWER PLATFORM SECTION SHEE R-001-305 UPPER PLATFORM SECTION VIEWS R-001-800 CAP F

## Documentation

**Introduction:**

This document provides guidance on extracting text from PDF documents using Python with the assistance of the `pdf2image` and `easyocr` libraries. The script extracts text from predefined regions within the PDF pages and outputs the extracted content.

**1. Prerequisites:**
   - Python installed on your system.
   - Necessary Python libraries installed (`pdf2image`, `easyocr`).
   - PDF files to be processed available.

**2. Usage:**
   - Define the path to the PDF document (`pdf_path`).
   - Run the provided Python script to extract text from the specified regions within the PDF document.

**3. Code Explanation:**

    1. **Importing Libraries:**
       - `pdf2image` is used to convert PDF pages into images.
       - `easyocr` is utilized for optical character recognition (OCR) to extract text from images.
    
    2. **Initializing easyOCR Reader:**
       - `easyocr.Reader()` is initialized with the language parameter set to English (`'en'`).
    
    3. **Function to Extract Text from PDF:**
       - `extract_text_from_pdf()` function takes the PDF path as input and returns the extracted content.
    
    4. **Extracting Text from PDF Pages:**
       - The function iterates through each page of the PDF.
       - Text is extracted from predefined regions such as TITLE, DETAILS, CUSTOMER, NOTES, and INFORMATION using easyOCR.
    
    5. **Storing Extracted Text:**
       - Extracted text is stored in a dictionary with keys representing different regions.
       - For REV_TABLES and NOTES, page numbers are associated with the extracted text.
    
    6. **Outputting Extracted Content:**
       - Extracted content is printed to the console, including text from each region and relevant page numbers for REV_TABLES and NOTES.

**4. Sample Usage:**
   - Define the path to the PDF document to be processed (`pdf_path`).
   - Call the `extract_text_from_pdf()` function with the PDF path as input.
   - Extracted content is printed to the console.

**5. Conclusion:**
   - The provided Python script offers a straightforward method for extracting text from specific regions within PDF documents, facilitating data extraction and analysis tasks.

**Note:** Ensure all dependencies are installed and configured correctly before executing the script. Additionally, customize the script according to specific requirements such as adjusting region coordinates or language settings.

This code is a Python script used to extract the text content of a specific area from a PDF document and output it to the console. Let me explain step by step what the code does:

1. Import the required libraries:
```python
from pdf2image import convert_from_path # Used to convert PDF to image
import easyocr # Used to perform OCR operations
import numpy as np # used to process image data
```

2. Initialize the easyOCR reader:
```python
reader = easyocr.Reader(['en'])
```
This line of code initializes the easyOCR reader, specifying that the language to be recognized is English (`'en'`).

3. Define a function `extract_text_from_pdf` to extract text from PDF:
```python
def extract_text_from_pdf(pdf_path):
      images = convert_from_path(pdf_path)
      #Initialize storage structure
      first_page_texts = {}
      rev_texts = []
      notes_texts = []
      information_texts = []
```
This function accepts the path to a PDF file as input and returns the text content extracted from it. It first converts the PDF to an image using the `convert_from_path` function and then initializes the data structure used to store the extracted text.

4. Go through each page in the PDF and extract text from a specific area:
```python
      for i, image in enumerate(images):
          image_np = np.array(image)

          if i == 0: # Extract only from the first page
              #Extract TITLE area text
              title_region = image_np[175:420, 200:2200]
              title_text = ' '.join(reader.readtext(title_region, detail=0, paragraph=True))
              first_page_texts['TITLE'] = title_text

              # Extract DETAILS area text
              details_region = image_np[4162:4678, 4133:4338]
              details_text = ' '.join(reader.readtext(details_region, detail=0, paragraph=True))
              first_page_texts['DETAILS'] = details_text

              # Extract CUSTOMER area text
              customer_region = image_np[4162:4518, 4338:4890]
              customer_text = ' '.join(reader.readtext(customer_region, detail=0, paragraph=True))
              first_page_texts['CUSTOMER'] = customer_text
```
This loops through each page of images in the PDF and uses NumPy to convert the images into arrays for processing. Then, text is extracted from specific areas of each page, such as titles, details, customer information, etc.

5. Extract the text of the REV_TABLES, NOTES, and INFORMATION areas:
```python
          # Check if REV_TABLE exists
          rev_text = reader.readtext(image_np, detail=0)
          if any('REV' in word for word in rev_text):
              rev_texts.append((i + 1, ' '.join(rev_text)))

          #Extract NOTES area text
          notes_region = image_np[4165:4500, 180:2450]
          notes_text = ' '.join(reader.readtext(notes_region, detail=0, paragraph=True))
          if notes_text:
              notes_texts.append((i + 1, notes_text))

          # Extract INFORMATION area text
          information_region = image_np[4162:4519, 5480:6464]
          information_text = ' '.join(reader.readtext(information_region, detail=0, paragraph=True))
          information_texts.append((i + 1, information_text))
```
Here, it is determined whether the text of the relevant range exists by checking whether REV_TABLE exists. The text is then extracted from the NOTES and INFORMATION areas of each page and stored in the corresponding lists.

6. Integrate the extracted text content into a dictionary:
```python
      # Create dictionary to store all extracted content
      extracted_content = {
          "TITLE": first_page_texts.get("TITLE", ""),
          "DETAILS": first_page_texts.get("DETAILS", ""),
          "CUSTOMER": first_page_texts.get("CUSTOMER", ""),
          "REV_TABLES": rev_texts,
          "NOTES": notes_texts,
          "INFORMATION": information_texts
      }

      return extracted_content
```
Here the text extracted from different regions is integrated into a dictionary and the dictionary is returned as the extracted content.

7. Execute the function to extract and print the text content in the PDF:
```python
# This is a sample PDF path, you need to change it according to the actual situation
pdf_path = 'pdf_files/236-020-STR-001_D.pdf'
extracted_content = extract_text_from_pdf(pdf_path)

# Output all contents stored in the dictionary
print("Extracted Content Summary:")
for key, value in extracted_content.items():
      print(f"{key}:")
      if isinstance(value, list):
          for item in value:
              print(f"Page {item[0]}:")
              print(item[1]) # Print the entire text content
              print("-" * 50) # Separator line
      else:
          print(value)
          print("-" * 50) # Separator line
```
Here the function to extract text is executed and the extracted content is printed to the console.

# keras_ocr_test

In [1]:
import numpy as np
import keras_ocr
from pdf2image import convert_from_path
from PIL import Image

# Prepare the pipeline of keras_ocr, which will automatically download the pre-trained model
poppler_path = r"C:\poppler-24.02.0\Library\bin"
pipeline = keras_ocr.pipeline.Pipeline()

# Path to your PDF document
pdf_path = 'pdf_files/236-020-STR-001_D.pdf' # Update this to your actual PDF file path

# Define the coordinates of each specific area
regions = {
     "TITLE": [(200, 175), (2200, 420)],
     "DETAILS": [(4133, 4162), (4338, 4678)],
     "CUSTOMER": [(4338, 4162), (4890, 4518)],
     "NOTES": [(180, 4165), (2450, 4500)],
     "INFORMATION": [(5480, 4162), (6464, 4519)]
}

# Function checks whether the box is within the specified area
def box_in_region(box, region):
     ((x1, y1), (x2, y2)) = region
     (box_x1, box_y1, box_x2, box_y2) = box
     return box_x1 >= x1 and box_y1 >= y1 and box_x2 <= x2 and box_y2 <= y2

#Extract text from a specific area
def extract_text_by_region(ocr_result, region):
     texts = [text for text, box in ocr_result if box_in_region(box, region)]
     return " ".join(texts)

# Get the total number of pages in the PDF
total_pages = len(convert_from_path(pdf_path))

# Store the extracted content
extracted_content = {
     "TITLE": "",
     "DETAILS": "",
     "CUSTOMER": "",
     "INFORMATION": [],
     "NOTES": [],
     "REV_TABLE": []
}

# Convert and process PDF page by page
for page_number in range(1, total_pages + 1):
     # Convert single page from PDF to an image
     image = convert_from_path(pdf_path, first_page=page_number, last_page=page_number, poppler_path= r"C:\poppler-24.02.0\Library\bin")[0]
     image_for_ocr = np.array(image)
    
     # Perform OCR on the current page
     ocr_result = pipeline.recognize([image_for_ocr])[0]

     # Special processing of TITLE, DETAILS, CUSTOMER on the first page
     if page_number == 1:
         for key in ["TITLE", "DETAILS", "CUSTOMER"]:
             region = regions[key]
             extracted_content[key] = extract_text_by_region(ocr_result, region)

     # Extract the INFORMATION of each page
     extracted_content["INFORMATION"].append((page_number, extract_text_by_region(ocr_result, regions["INFORMATION"])))
    
     # Check and extract NOTES
     notes_text = extract_text_by_region(ocr_result, regions["NOTES"])
     if "NOTES" in notes_text or "IMPORTANT NOTE" in notes_text:
         extracted_content["NOTES"].append((page_number, notes_text))
    
     # Check and extract REV_TABLE if "REV" exists
     rev_text = extract_text_by_region(ocr_result, regions["NOTES"])
     if "REV" in rev_text:
         extracted_content["REV_TABLE"].append((page_number, rev_text))

#Print the extracted content
for key, value in extracted_content.items():
     print(f"{key}: {value}\n{'-'*100}")

Looking for C:\Users\wangj\.keras-ocr\craft_mlt_25k.h5



ValueError: Unrecognized keyword arguments passed to Dense: {'weights': [array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]], dtype=float32), array([1., 0., 0., 0., 1., 0.], dtype=float32)]}

for that error:

```shell

C:\Users\wangj>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:30:10_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

```

It is not supported by tensorflow now: https://discuss.tensorflow.org/t/tensorflow-2-14-and-cuda-12-status/19725

so, should use cuda 11.x to run keras_ocr.

additional: if the version fo TF is higher than 12.10.0, then you can only run that on linux or wsl2.

In [4]:
import numpy as np
import keras_ocr
from pdf2image import convert_from_path
from PIL import Image

#Initialize the model
detector = keras_ocr.detection.Detector()
recognizer = keras_ocr.recognition.Recognizer(alphabet='\.\/\RI-\*\+')

# Preparation
poppler_path = r"C:\poppler-24.02.0\Library\bin"
pipeline = keras_ocr.pipeline.Pipeline()

# PDF file path
pdf_path = 'pdf_files/236-020-STR-001_D.pdf'

# Define area coordinates
regions = {
      "TITLE": [(200, 175), (2200, 420)],
      "DETAILS": [(4133, 4162), (4338, 4678)],
      "CUSTOMER": [(4338, 4162), (4890, 4518)],
      "NOTES": [(180, 4165), (2450, 4500)],
      "INFORMATION": [(5480, 4162), (6464, 4519)]
}

# Determine whether the box is within the specified area
def box_in_region(box, region):
      ((x1, y1), (x2, y2)) = region
      (box_x1, box_y1, box_x2, box_y2) = box
      return box_x1 >= x1 and box_y1 >= y1 and box_x2 <= x2 and box_y2 <= y2

# Extract text from the specified area
def extract_text_by_region(ocr_result, region):
      texts = [text for text, box in ocr_result if box_in_region(box, region)]
      return " ".join(texts)

# Get the total number of PDF pages
total_pages = len(convert_from_path(pdf_path))

# Store the extracted content
extracted_content = {
      "TITLE": "",
      "DETAILS": "",
      "CUSTOMER": "",
      "INFORMATION": [],
      "NOTES": [],
      "REV_TABLE": []
}

# Process PDF page by page
for page_number in range(1, total_pages + 1):
      # Convert PDF pages to images
      image = convert_from_path(pdf_path, first_page=page_number, last_page=page_number, poppler_path=poppler_path)[0]
      image_for_ocr = np.array(image)
    
      # Use detector and recognizer for OCR
      detection_results = detector.detect(images=[image_for_ocr])
      recognition_results = [
          recognizer.recognize(image_for_ocr, detections)
          for detections in detection_results
      ]
      ocr_result = list(zip(recognition_results[0], detection_results[0]))

      # Special processing of TITLE, DETAILS, CUSTOMER on the first page
      if page_number == 1:
          for key in ["TITLE", "DETAILS", "CUSTOMER"]:
              region = regions[key]
              extracted_content[key] = extract_text_by_region(ocr_result, region)

      # Extract INFORMATION from each page
      extracted_content["INFORMATION"].append((page_number, extract_text_by_region(ocr_result, regions["INFORMATION"])))
    
      # Check and extract NOTES
      notes_text = extract_text_by_region(ocr_result, regions["NOTES"])
      if "NOTES" in notes_text or "IMPORTANT NOTE" in notes_text:
          extracted_content["NOTES"].append((page_number, notes_text))
    
      # Check and extract REV_TABLE
      rev_text = extract_text_by_region(ocr_result, regions["NOTES"])
      if "REV" in rev_text:
          extracted_content["REV_TABLE"].append((page_number, rev_text))

#Print the extracted content
for key, value in extracted_content.items():
      print(f"{key}: {value}\n{'-'*100}")

Looking for C:\Users\wangj\.keras-ocr\craft_mlt_25k.h5


ValueError: Unrecognized keyword arguments passed to Dense: {'weights': [array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]], dtype=float32), array([1., 0., 0., 0., 1., 0.], dtype=float32)]}

# Easy_OCR re-try

In [2]:
from pdf2image import convert_from_path
import easyocr
import numpy as np

# Initialize easyOCR reader
reader = easyocr.Reader(['en'])
poppler_path = r"C:\poppler-24.02.0\Library\bin"

def extract_text_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)
    # Initialize storage structure
    first_page_texts = {}
    rev_texts = []
    notes_texts = []
    information_texts = []

    for i, image in enumerate(images):
        image_np = np.array(image)

        if i == 0:  # Extract from the first page
            # Define regions and extract text similar to Pytesseract code
            regions = {
                "TITLE": (175, 200, 420, 2200),
                "DETAILS": (4162, 4133, 4678, 4338),
                "CUSTOMER": (4162, 4338, 4518, 4890)
            }
            for label, (top, left, bottom, right) in regions.items():
                region = image_np[top:bottom, left:right]
                text = ' '.join(reader.readtext(region, detail=0, paragraph=True))
                first_page_texts[label] = text

        # Check and extract REV_TABLE if "REV" is present, indicate page number
        rev_table_coords = (4160, 2504, 4520, 4133)
        rev_table_region = image_np[rev_table_coords[0]:rev_table_coords[2], rev_table_coords[1]:rev_table_coords[3]]
        rev_table_text = reader.readtext(rev_table_region, detail=0)
        if any('REV' in word for word in rev_table_text):
            rev_texts.append((i + 1, ' '.join(rev_table_text)))

        # Process NOTES sections from all pages
        notes_coords = (4165, 180, 4500, 2450)
        notes_region = image_np[notes_coords[0]:notes_coords[2], notes_coords[1]:notes_coords[3]]
        notes_text = ' '.join(reader.readtext(notes_region, detail=0, paragraph=True))
        notes_texts.append((i + 1, notes_text))

        # Extract INFORMATION from every page
        information_coords = (4162, 5480, 4519, 6464)
        information_region = image_np[information_coords[0]:information_coords[2], information_coords[1]:information_coords[3]]
        information_text = ' '.join(reader.readtext(information_region, detail=0, paragraph=True))
        information_texts.append((i + 1, information_text))

    # Create dictionary to store all extracted content
    extracted_content = {
        "TITLE": first_page_texts.get("TITLE", ""),
        "DETAILS": first_page_texts.get("DETAILS", ""),
        "CUSTOMER": first_page_texts.get("CUSTOMER", ""),
        "REV_TABLES": rev_texts,
        "NOTES": notes_texts,
        "INFORMATION": information_texts
    }

    return extracted_content

# This is a sample PDF path, you need to change it according to the actual situation
pdf_path = 'pdf_files/236-020-STR-001_D.pdf'
extracted_content = extract_text_from_pdf(pdf_path)

# Output all contents stored in the dictionary
print("Extracted Content Summary:")
for key, value in extracted_content.items():
    print(f"{key}:")
    if isinstance(value, list):
        for item in value:
            print(f"Page {item[0]}:")
            print(item[1])  # Print the entire text content
            print("-" * 50)  # Separator line
    else:
        print(value)
        print("-" * 50)  # Separator line

Extracted Content Summary:
TITLE:
ADBRI BIRKENHEAD
--------------------------------------------------
DETAILS:
0 NOT SCALE ALL DIMENSIONS IN MILLIMETRES IING PRACTICE TO AS11OO F IN DOUBT ASK [olerances UNQ Jn dimensions DEC PLACE DEC PLACE DEC PLACE ILL ANGLES 10.5 10.2 10.10 L0.5" OR USED FC
--------------------------------------------------
CUSTOMER:
McMAHON 5 E R V 1 0 E 5 Customer McMahon Services Reference 236-020 R MANUFACTURING OR TENDERIE
--------------------------------------------------
REV_TABLES:
Page 1:
01/03/2023 ISSUED FOR CLIENT REVIEW BM DH DZ HV 20/01/2023 ISSUED FOR REVIEW SB DH DZ HV DRAV 16/12/2022 ISSUED FOR REVIEW BM DH DZ HV 11/11/2022 ISSUED FOR REVIEW AT DH DZ HV V DATE REVISION HISTORY DRN CHK ENG PM PROPERTY OF INGENIA LTD. NEITHER THE WHOLE NOR ANY EXTRACT MAY BE DISCLOSED COPIED
--------------------------------------------------
Page 2:
01/03/2023 ISSUED FOR CLIENT REVIEW BM DH DZ HV 20/01/2023 ISSUED FOR REVIEW SB DH DZ HV DRAV 16/12/2022 ISSUED FOR REVI

# paddle ocr

In [10]:
from pdf2image import convert_from_path
from PIL import Image
from paddleocr import PaddleOCR
import numpy as np

#Initialize PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')

#Specify PDF files and pages
pdf_path = r'pdf_files\236-020-STR-001_D.pdf' # Please update to your PDF file path
page_number = 0 # Specify the page (counting from 0)

# Convert PDF specified pages to images
pages = convert_from_path(pdf_path)
page_image = pages[page_number] # Get the image of a specific page

# Define the area to be recognized
regions = {
     "TITLE": (200, 175, 2200, 420),
     "DETAILS": (4133, 4162, 4338, 4678),
     "CUSTOMER": (4338, 4162, 4890, 4518)
}

# Perform OCR on each area and extract text directly
for label, coords in regions.items():
     # Crop the image to the specified area
     cropped_image = page_image.crop(coords)
    
     # Use PaddleOCR to identify cropped images
     result = ocr.ocr(np.array(cropped_image), cls=True)
    
     # Check the format and completeness of each recognition result
     recognized_texts = []
     for line in result:
         if line and len(line) > 1 and line[1] and isinstance(line[1], list) and line[1][0]:
             if isinstance(line[1][0], list) and len(line[1][0]) > 0 and isinstance(line[1][0][0], str):
                 recognized_texts.append(line[1][0][0])
             elif isinstance(line[1][0], str):
                 recognized_texts.append(line[1][0])

     # Concatenate the recognized text lists into a string
     recognized_text = ' '.join(recognized_texts)

     #Print recognition results
     print(f"{label} Text:\n{recognized_text}\n{'-'*50}")

[2024/05/16 23:47:11] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='C:\\Users\\wangj/.paddleocr/whl\\det\\en\\en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='C:\\Users\\wangj/.paddleocr/whl\\rec\\en\\en_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_len

In [11]:
# Perform OCR on each area and extract text directly
for label, coords in regions.items():
     # Crop the image to the specified area
     cropped_image = page_image.crop(coords)
    
     # Save the cropped image for inspection
     cropped_image.save(f"{label}_cropped.png")

     # Use PaddleOCR to identify cropped images
     result = ocr.ocr(np.array(cropped_image), cls=True)
    
     #Extract recognition results
     recognized_texts = []
     for line in result:
         if line and len(line) > 1 and line[1] and isinstance(line[1], list) and line[1][0]:
             if isinstance(line[1][0], list) and len(line[1][0]) > 0 and isinstance(line[1][0][0], str):
                 recognized_texts.append(line[1][0][0])
             elif isinstance(line[1][0], str):
                 recognized_texts.append(line[1][0])

     recognized_text = ' '.join(recognized_texts)
     print(f"{label} Text:\n{recognized_text}\n{'-'*50}")

[2024/05/16 23:54:02] ppocr DEBUG: dt_boxes num : 1, elapsed : 0.47309255599975586
[2024/05/16 23:54:02] ppocr DEBUG: cls num  : 1, elapsed : 0.013784170150756836
[2024/05/16 23:54:02] ppocr DEBUG: rec_res num  : 1, elapsed : 0.1611495018005371
TITLE Text:

--------------------------------------------------
[2024/05/16 23:54:02] ppocr DEBUG: dt_boxes num : 16, elapsed : 0.2407066822052002
[2024/05/16 23:54:03] ppocr DEBUG: cls num  : 16, elapsed : 0.04245185852050781
[2024/05/16 23:54:03] ppocr DEBUG: rec_res num  : 16, elapsed : 0.8090565204620361
DETAILS Text:

--------------------------------------------------
[2024/05/16 23:54:03] ppocr DEBUG: dt_boxes num : 13, elapsed : 0.059751272201538086
[2024/05/16 23:54:04] ppocr DEBUG: cls num  : 13, elapsed : 0.17111444473266602
[2024/05/16 23:54:04] ppocr DEBUG: rec_res num  : 13, elapsed : 0.7976350784301758
CUSTOMER Text:

--------------------------------------------------


In [13]:
# Perform OCR on each area and extract text directly
for label, coords in regions.items():
     # Crop the image to the specified area
     cropped_image = page_image.crop(coords)
    
     # Save the cropped image for inspection
     cropped_image.save(f"{label}_cropped.png")

     # Use PaddleOCR to identify cropped images
     result = ocr.ocr(np.array(cropped_image), cls=True)
    
     #Extract recognition results
     recognized_texts = []
     for line in result:
         if line and len(line) > 1 and line[1] and isinstance(line[1], list) and line[1][0]:
             if isinstance(line[1][0], list) and len(line[1][0]) > 0 and isinstance(line[1][0][0], str):
                 recognized_texts.append(line[1][0][0])
             elif isinstance(line[1][0], str):
                 recognized_texts.append(line[1][0])

     recognized_text = ' '.join(recognized_texts)
     print(f"{label} Text:\n{recognized_text}\n{'-'*50}")

[2024/05/16 23:57:03] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='C:\\Users\\wangj/.paddleocr/whl\\det\\en\\en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='C:\\Users\\wangj/.paddleocr/whl\\rec\\en\\en_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_len

IndexError: list index out of range

In [14]:
from pdf2image import convert_from_path
from PIL import Image
from paddleocr import PaddleOCR
import numpy as np

#Initialize PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')

#Specify PDF files and pages
pdf_path = r'pdf_files\236-020-STR-001_D.pdf' # Please update to your PDF file path
page_number = 0

# Convert PDF specified pages to images
pages = convert_from_path(pdf_path)
page_image = pages[page_number]

# Define the area to be recognized
regions = {
     "TITLE": (200, 175, 2200, 420),
     "DETAILS": (4133, 4162, 4338, 4678),
     "CUSTOMER": (4338, 4162, 4890, 4518)
}

for label, coords in regions.items():
     cropped_image = page_image.crop(coords)
     result = ocr.ocr(np.array(cropped_image), cls=True)

     # Print each recognized text area and its contents
     for line in result:
         if line and len(line) > 1 and line[1]:
             text = line[1][0] if line[1] and isinstance(line[1], list) and len(line[1]) > 0 else "No text found"
             print(f"{label} Detected Box: {line[0]}, Text: {text}")
         else:
             print(f"{label} Detected Box: {line[0]}, Text: No valid text detected")

     # Check if the result contains text
     if not result:
         print(f"No text detected in {label}")
     else:
         recognized_texts = []
         for line in result:
             if line and len(line) > 1 and line[1] and isinstance(line[1], list) and line[1][0]:
                 recognized_texts.append(line[1][0])
         recognized_text = ' '.join(recognized_texts)
         print(f"{label} Text: {recognized_text}\n{'-'*50}")

[2024/05/17 00:03:32] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='C:\\Users\\wangj/.paddleocr/whl\\det\\en\\en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='C:\\Users\\wangj/.paddleocr/whl\\rec\\en\\en_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_len

TypeError: sequence item 0: expected str instance, list found