# text extraction using OCR

In [13]:
import cv2
import pytesseract
import os
import numpy as np

# Set up Tesseract-OCR path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Read image
image_path = r"E:\Ducat\Deep Learning\Text_OCR\text_image.png"
img = cv2.imread(image_path)

# Get image directory
image_dir = os.path.dirname(image_path)
output_text_path = os.path.join(image_dir, "extracted_text.txt")

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply Bilateral Filter (preserves edges better than Gaussian Blur)
gray = cv2.bilateralFilter(gray, 9, 75, 75)

# Use Otsu's Thresholding instead of Adaptive (better for printed text)
_, processed_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Resize image to increase DPI (helps Tesseract)
scale_factor = 2  # Increase resolution
processed_img = cv2.resize(processed_img, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_CUBIC)

# Save processed image for debugging
cv2.imwrite(os.path.join(image_dir, "preprocessed2.png"), processed_img)

# Apply OCR with optimized settings
custom_config = r'--oem 3 --psm 4'  # Try --psm 11 if output is still distorted
extracted_text = pytesseract.image_to_string(processed_img, config=custom_config, lang='eng')

# Print extracted text
print("\nExtracted Text:\n" + "-" * 30)
print(extracted_text.strip())

# Save text to file
with open(output_text_path, "w", encoding="utf-8") as file:
    file.write(extracted_text.strip())

print("-" * 30 + f"\nRecognized text saved to: {output_text_path}")



Extracted Text:
------------------------------
B aimol.txt x + ~ a x

File Edit View Ww. £33
IAI/ML Notes

1.1 Artificial Intelligence (AI)

* Definition: Creating machines that simulate human intelligence (learning, reasoning, problem-solving, perception, decision-
making).

* Examples: Chatbots (ChatGPT, Siri), Computer Vision (Face recognition), Autonomous Vehicles (Tesla), Healthcare AI (Disease
prediction), Finance AI (Stock market prediction).

1.2 Machine Learning (ML)

* Definition: Subset of AI where machines learn from data without explicit programming.

* Types: 1. Supervised (labeled data, e.g., spam detection), 2. Unsupervised (unlabeled data, e.g., customer segmentation),
3. Reinforcement (trial and error, e.g., AlphaGo).

* Examples: Recommendation Systems (Netflix, YouTube), Fraud Detection, Self-Driving Cars, Medical Diagnosis.

1.3 AI vs. ML
* AT is the broader concept; ML is a way to achieve AI.
* ML models learn from data; AI can involve pre-programmed logic.

II. 

In [26]:
# check accuracy of model

from difflib import SequenceMatcher, get_close_matches
import Levenshtein
import re

actual_text_path = r"E:\Ducat\Deep Learning\Text_OCR\actual_text.txt"
extracted_text_path = r"E:\Ducat\Deep Learning\Text_OCR\extracted_text.txt"

# Load texts
with open(actual_text_path, "r", encoding="utf-8") as f:
    actual_text = f.read().strip()

with open(extracted_text_path, "r", encoding="utf-8") as f:
    extracted_text = f.read().strip()

# === Improved Preprocessing ===
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Normalize spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.strip()

actual_text = clean_text(actual_text)
extracted_text = clean_text(extracted_text)

# === Character-Level Accuracy (Levenshtein Distance) ===
lev_distance = Levenshtein.distance(actual_text, extracted_text)
char_accuracy = (1 - lev_distance / max(len(actual_text), len(extracted_text))) * 100

# === Word-Level Accuracy (Using Better Matching) ===
ground_words = actual_text.split()
extracted_words = extracted_text.split()

matched_words = 0
for word in extracted_words:
    if word in ground_words or get_close_matches(word, ground_words, n=1, cutoff=0.8):
        matched_words += 1

word_accuracy = (matched_words / max(len(ground_words), len(extracted_words))) * 100

# === Sequence Matching Score ===
sequence_match = SequenceMatcher(None, actual_text, extracted_text).ratio() * 100

# === Print Results ===
print(f"Character-Level Accuracy: {char_accuracy:.2f}%")
print(f"Word-Level Accuracy: {word_accuracy:.2f}%")
print(f"Sequence Matching Score: {sequence_match:.2f}%")


Character-Level Accuracy: 93.54%
Word-Level Accuracy: 90.55%
Sequence Matching Score: 96.63%


## Optical Character Recognition (OCR) Project

### **Objective**
The objective of this project is to extract text from an image using Tesseract OCR and evaluate its accuracy using various metrics. The extracted text is compared against ground truth data to determine the efficiency of the OCR model.

---

### **Steps Followed**

#### **1. Image Preprocessing**
- Load the image using OpenCV.
- Convert the image to grayscale.
- Apply a Bilateral Filter to remove noise while preserving edges.
- Perform Otsu’s Thresholding to enhance text visibility.
- Resize the image to improve text clarity and increase DPI.
- Save the preprocessed image for debugging.

#### **2. Text Extraction using Tesseract OCR**
- Define the Tesseract-OCR path.
- Use the `pytesseract.image_to_string()` function to extract text from the processed image.
- Optimize OCR performance using `--oem 3 --psm 4` configuration.
- Save the extracted text to a file.

#### **3. Accuracy Evaluation**
- Load both the extracted text and ground truth text from files.
- Perform text preprocessing (lowercasing, space normalization, punctuation removal).
- Calculate different accuracy metrics:
  - **Character-Level Accuracy** using Levenshtein Distance.
  - **Word-Level Accuracy** using word matching and similarity.
  - **Sequence Matching Score** using `SequenceMatcher`.

---

### **Results**
| Metric | Accuracy |
|--------|---------|
| Character-Level Accuracy | **93.54%** |
| Word-Level Accuracy | **90.55%** |
| Sequence Matching Score | **96.63%** |

---

### **Conclusion**
- The OCR model performs well, achieving a high accuracy across different metrics.
- Preprocessing steps such as thresholding and resizing significantly improve text recognition.
- The character-level accuracy indicates minor discrepancies, likely due to font variations or noise.
- Word-level accuracy suggests that some words might be misrecognized or omitted.
- Further improvements can be made by fine-tuning preprocessing techniques or using deep learning-based OCR models.

---
