This script uses the Python Imaging Library (PIL), pytesseract, and the dotenv library to extract text from an image file and print it to the console. The script reads the image path and the path to the Tesseract executable from environment variables.
- Python Imaging Library (PIL) or Pillow
- pytesseract
- python-dotenv
-
Install Python Imaging Library (PIL) or Pillow. Check the Pillow documentation for installation instructions.
-
Install Google Tesseract OCR. Refer to the additional info on how to install the engine on Linux, Mac OSX, and Windows.
-
Install pytesseract and python-dotenv using pip:
pip install pytesseract python-dotenv
- Create a
.env
file in the same directory as the script and set the environment variablesSAMPLE_1
andPATH_TO_TESSERACT
:
SAMPLE_1=/path/to/image/file
PATH_TO_TESSERACT=/path/to/tesseract/executable
- Run the script:
python ocr_script.py
The script will load the image specified by the SAMPLE_1
environment variable, use pytesseract to extract text from the image, and print the extracted text to the console.
Here's an example of the script in action:
from PIL import Image
from pytesseract import pytesseract
from dotenv import load_dotenv
import os
load_dotenv()
IMG_SAMPLE = os.getenv("SAMPLE_1")
PATH_TO_TESSERACT = os.getenv("PATH_TO_TESSERACT")
path_to_tesseract = PATH_TO_TESSERACT
image_path = IMG_SAMPLE
img = Image.open(image_path)
pytesseract.tesseract_cmd = path_to_tesseract
text = pytesseract.image_to_string(img)
print(text)
For more information and usage examples of pytesseract, refer to the official pytesseract GitHub repository.