# Invoice Data Extraction using PyTesseract
This notebook demonstrates how to extract text data from invoice images using the PyTesseract library. The process involves loading the invoice image, pre-processing it for better OCR accuracy, and then applying PyTesseract to extract the text.

In [1]:
# Import necessary libraries
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

## Pre-processing the Image
Pre-processing steps can significantly improve the OCR results by making the text more distinguishable in the image. This might include converting to grayscale, applying filters, and enhancing contrast.

In [2]:
def preprocess_image(image_path):
    # Open the image
    img = Image.open(image_path)
    
    # Convert to grayscale
    img = img.convert('L')
    
    # Apply a median filter
    img = img.filter(ImageFilter.MedianFilter())
    
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(2)
    
    return img

## Extracting Text from the Pre-processed Image
Once the image has been pre-processed, we can use PyTesseract to extract the text.

In [3]:
def extract_text_from_image(image_path):
    # Preprocess the image for better OCR results
    img = preprocess_image(image_path)
    
    # Use PyTesseract to extract text
    text = pytesseract.image_to_string(img)
    
    return text

## Running the OCR Process
Specify the path to your invoice image and run the OCR process to extract the text.

In [4]:
# Specify the path to your invoice image
image_path = 'path/to/your/invoice/image.jpg'

# Extract text from the image
extracted_text = extract_text_from_image(image_path)

# Print the extracted text
print(extracted_text)