#CHATGPT
You are correct that the OCR process typically reads from left to right and may not be able to properly recognize text in separate columns of a newspaper. Here are a few suggestions for how you can modify the OCR process to better handle multi-column text in a newspaper:

Split the image into separate columns: One option is to split the image into separate columns and perform OCR on each column individually. You can use image processing techniques such as edge detection or thresholding to detect the column boundaries and then split the image into separate images for each column.

Use OCR software that is designed for multi-column text: Some OCR software, such as ABBYY FlexiCapture, is specifically designed to handle multi-column text and may provide better results in this case. You can try using one of these specialized OCR tools instead of pytesseract.

Train a custom OCR model: Another option is to train a custom OCR model using machine learning techniques. You can use a dataset of scanned newspaper images with corresponding transcriptions to train a model that is able to recognize multi-column text. This approach requires a more in-depth understanding of machine learning and OCR, but it may provide the best results for your specific use case.

I hope these suggestions help! Let me know if you have any further questions.

In [35]:
import cv2
import pytesseract

# Read the PDF file
pdf_file = r"C:\Users\Cheng\Downloads\1-19-1938 Leela Menon(scan).png"
image = cv2.imread(pdf_file)

# Convert the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Perform OCR and extract the bounding boxes of the words
boxes = pytesseract.image_to_data(gray, output_type='box')

# Split the boxes into lines based on the y-coordinate
lines = []
for b in boxes:
    # Extract the bounding box coordinates and text
    x, y, w, h = b['x'], b['y'], b['w'], b['h']
    x, y, w, h = int(x), int(y), int(w), int(h)
    
    # Check if the word belongs to a new line
    if len(lines) == 0 or y > lines[-1][-1][1]:
        # Start a new line
        lines.append([(x, y, w, h)])
    else:
        # Add the word to the current line
        lines[-1].append((x, y, w, h))

# Identify the boundaries of different text sections
section_boundaries = []
for i, line in enumerate(lines):
    # Calculate the mean y-coordinate of the line
    y = int(sum([b[1] for b in line]) / len(line))
    
    # Check if the line is the first or last line in the page
    if i == 0 or i == len(lines) - 1:
        # Add the line as a boundary
        section_boundaries.append(y)
    else:
        # Check if the line is significantly different in y-coordinate from the previous line
        if abs(y - lines[i-1][0][1]) > 20:
            # Add the line as a boundary
            section_boundaries.append(y)

# Draw the section boundaries on the image
for y in section_boundaries:
    cv2.line(image, (0, y), (image.shape[1], y), (0, 0, 255), 2)

# Display the image with the bounding boxes
cv2.imshow("Text Layout", image)
cv2.waitKey(0)
cv2.destroyAllWindows()


KeyError: 'box'

In [7]:
!tesseract --help-psm

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.


In [11]:
import pytesseract
import cv2


pdf_file = r"C:\Users\Cheng\Downloads\1-19-1938 Leela Menon(scan).png"
image = cv2.imread(pdf_file)


data = pytesseract.image_to_data(image, output_type='data.frame', config="--psm 1")


for i, row in data.iterrows():
    
    if row['text'] != '':
        
        x, y, w, h = row['left'], row['top'], row['width'], row['height']
        
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 0, 255), 2)


cv2.imwrite(r"C:\Users\Cheng\Downloads\Open Peeps - Avatar.png", image)

True