# Text extraction using OCR

In [1]:
# text extraction using OCR
import cv2
import pytesseract
import os
import numpy as np

# Set up Tesseract-OCR path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Read image
image_path = r"E:\Ducat\AiProjects\Text_OCR\Real_text_image.png"
img = cv2.imread(image_path)

# Get image directory
image_dir = os.path.dirname(image_path)
output_text_path = os.path.join(image_dir, "extracted_text.txt")

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply Bilateral Filter (preserves edges better than Gaussian Blur)
gray = cv2.bilateralFilter(gray, 9, 75, 75)

# Use Otsu's Thresholding instead of Adaptive (better for printed text)
_, processed_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Resize image to increase DPI (helps Tesseract)
scale_factor = 2  # Increase resolution
processed_img = cv2.resize(processed_img, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_CUBIC)

# Save processed image for debugging
cv2.imwrite(os.path.join(image_dir, "preprocessed_img.png"), processed_img)

# Apply OCR with optimized settings
custom_config = r'--oem 3 --psm 4'  # Try --psm 11 if output is still distorted
extracted_text = pytesseract.image_to_string(processed_img, config=custom_config, lang='eng')

# Print extracted text
print("\nExtracted Text:\n" + "-" * 30)
print(extracted_text.strip())

# Save text to file
with open(output_text_path, "w", encoding="utf-8") as file:
    file.write(extracted_text.strip())

print("-" * 30 + f"\nRecognized text saved to: {output_text_path}")



Extracted Text:
------------------------------
B aimol.txt x + ~ a x

File Edit View Ww. £33
IAI/ML Notes

1.1 Artificial Intelligence (AI)

* Definition: Creating machines that simulate human intelligence (learning, reasoning, problem-solving, perception, decision-
making).

* Examples: Chatbots (ChatGPT, Siri), Computer Vision (Face recognition), Autonomous Vehicles (Tesla), Healthcare AI (Disease
prediction), Finance AI (Stock market prediction).

1.2 Machine Learning (ML)

* Definition: Subset of AI where machines learn from data without explicit programming.

* Types: 1. Supervised (labeled data, e.g., spam detection), 2. Unsupervised (unlabeled data, e.g., customer segmentation),
3. Reinforcement (trial and error, e.g., AlphaGo).

* Examples: Recommendation Systems (Netflix, YouTube), Fraud Detection, Self-Driving Cars, Medical Diagnosis.

1.3 AI vs. ML
* AT is the broader concept; ML is a way to achieve AI.
* ML models learn from data; AI can involve pre-programmed logic.

II. 

In [1]:
# check accuracy of model

from difflib import SequenceMatcher, get_close_matches
import Levenshtein
import re

actual_text_path = r"E:\Ducat\Deep Learning\Text_OCR\actual_text.txt"
extracted_text_path = r"E:\Ducat\Deep Learning\Text_OCR\extracted_text.txt"

# Load texts
with open(actual_text_path, "r", encoding="utf-8") as f:
    actual_text = f.read().strip()

with open(extracted_text_path, "r", encoding="utf-8") as f:
    extracted_text = f.read().strip()

# === Improved Preprocessing ===
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Normalize spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.strip()

actual_text = clean_text(actual_text)
extracted_text = clean_text(extracted_text)

# === Character-Level Accuracy (Levenshtein Distance) ===
lev_distance = Levenshtein.distance(actual_text, extracted_text)
char_accuracy = (1 - lev_distance / max(len(actual_text), len(extracted_text))) * 100

# === Word-Level Accuracy (Using Better Matching) ===
ground_words = actual_text.split()
extracted_words = extracted_text.split()

matched_words = 0
for word in extracted_words:
    if word in ground_words or get_close_matches(word, ground_words, n=1, cutoff=0.8):
        matched_words += 1

word_accuracy = (matched_words / max(len(ground_words), len(extracted_words))) * 100

# === Sequence Matching Score ===
sequence_match = SequenceMatcher(None, actual_text, extracted_text).ratio() * 100

# === Print Results ===
print(f"Character-Level Accuracy: {char_accuracy:.2f}%")
print(f"Word-Level Accuracy: {word_accuracy:.2f}%")
print(f"Sequence Matching Score: {sequence_match:.2f}%")


Character-Level Accuracy: 93.54%
Word-Level Accuracy: 90.55%
Sequence Matching Score: 96.63%


# **Text Extraction and Accuracy Evaluation Using Tesseract OCR**

## **Objective**
The objective of this project is to extract text from an image using Tesseract OCR and evaluate its accuracy by comparing it with the actual text. The accuracy is measured using character-level, word-level, and sequence matching techniques.

---

## **Steps Involved**

### **1. Image Preprocessing**
- Load the image using OpenCV.
- Convert the image to grayscale.
- Apply a bilateral filter to reduce noise while preserving edges.
- Use Otsu’s thresholding to binarize the image.
- Resize the image to increase resolution and improve OCR performance.

### **2. Text Extraction Using Tesseract OCR**
- Use Tesseract OCR with optimized configurations (`--oem 3 --psm 4`).
- Extract text from the preprocessed image.
- Save the extracted text to a file.

### **3. Accuracy Evaluation**
- Load the actual text and extracted text from files.
- Preprocess the texts by:
  - Converting to lowercase.
  - Removing extra spaces and punctuation.
- Compute accuracy using three different metrics:
  - **Character-Level Accuracy:** Based on Levenshtein distance.
  - **Word-Level Accuracy:** Using exact matches and close matches.
  - **Sequence Matching Score:** Using `SequenceMatcher`.

---

## **Results and Conclusion**
- The extracted text was compared with the actual text, and the following accuracy scores were obtained:
  - **Character-Level Accuracy:** 93.54%
  - **Word-Level Accuracy:** 90.55%
  - **Sequence Matching Score:** 96.63%
- The high accuracy scores indicate that the OCR process is effective for structured and clear text images.
- Minor errors may still occur due to image noise, font variations, or distortions.
- Further improvements can be achieved by fine-tuning preprocessing techniques and testing different Tesseract parameters.

---
