<a href="https://colab.research.google.com/github/Vishal5051/TextSight/blob/main/TextSight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install necessary libraries
!pip install transformers torch easyocr gradio


Collecting easyocr
  Downloading easyocr-1.7.2-py3-none-any.whl.metadata (10 kB)
Collecting gradio
  Downloading gradio-4.44.0-py3-none-any.whl.metadata (15 kB)
Collecting python-bidi (from easyocr)
  Downloading python_bidi-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting pyclipper (from easyocr)
  Downloading pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (9.0 kB)
Collecting ninja (from easyocr)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadat

In [2]:
# Import the necessary libraries
import easyocr
import gradio as gr
import re

In [3]:

# Initialize EasyOCR Reader with English and Hindi support
reader = easyocr.Reader(['en', 'hi'])  # English and Hindi language support




Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

In [4]:

def clean_and_format_text(text_pieces):
    """
    This function cleans and formats the extracted text pieces
    into proper sentences with appropriate spacing and punctuation.
    """
    # Combine text into a single string
    combined_text = ' '.join(text_pieces)

    # Basic cleaning: remove extra spaces, ensure punctuation is attached properly
    combined_text = re.sub(r'\s([?.!,"](?:\s|$))', r'\1', combined_text)  # Fix spaces before punctuations
    combined_text = re.sub(r'\s+', ' ', combined_text).strip()  # Remove extra spaces

    # Capitalize the first word of the sentence
    if combined_text:
        combined_text = combined_text[0].upper() + combined_text[1:]

    return combined_text

In [5]:

def highlight_keywords(text, keywords):
    """
    This function highlights the matching keywords in the extracted text.
    """
    for keyword in keywords:
        # Use HTML <span> with red color to highlight the keyword
        text = re.sub(f"({keyword})", r"<span style='color: red; font-weight: bold;'>\1</span>", text, flags=re.IGNORECASE)
    return text

In [6]:

def ocr_and_search(image, keyword_input):
    """
    Perform OCR on the uploaded image, search for keywords, and return the results.
    """
    # Perform OCR using EasyOCR
    result = reader.readtext(image, detail=0)  # detail=0 returns only text pieces

    # Clean and format the extracted text into proper sentences
    formatted_text = clean_and_format_text(result)

    # Split keywords by comma and strip whitespace
    keywords = [keyword.strip() for keyword in keyword_input.split(',')] if keyword_input else []

    # Highlight keywords in the text
    highlighted_text = highlight_keywords(formatted_text, keywords)

    return highlighted_text

In [7]:

# Create Gradio interface for the web application
interface = gr.Interface(
    fn=ocr_and_search,  # Function to call for OCR and search
    inputs=[
        gr.Image(type="filepath", label="Upload Image"),  # Input: Image file path for OCR
        gr.Textbox(label="Enter keywords to search (comma-separated, optional)",
                   placeholder="Type your keywords here...")  # Input: Textbox for keyword search
    ],
    outputs=gr.HTML(label="Extracted Text with Search Highlights"),  # Output: Display the text with highlights
    title="TextSight",  # Title for the app
    description="""Instructions:
1. **Upload an image** containing the text you want to extract.
2. **Select languages**: This application supports both Hindi and English.
3. **Enter keywords**: Type your keywords in the search box, separated by commas.
4. **View results**: Extracted text will be displayed with your keywords highlighted in red.
""",  # Properly formatted description

    css="""
    .gradio-container {
        max-width: 800px; /* Set a maximum width for better readability */
        margin: auto;     /* Center the interface on the page */
        padding: 20px;    /* Add some padding around the content */
    }
    h1, h2 {
        text-align: center; /* Center the titles for better appearance */
    }
    .input-textbox {
        margin: 10px 0;    /* Add margin for better spacing */
    }
    """,  # Add CSS for responsive design and styling
)


In [8]:

# Launch the interface
if __name__ == "__main__":
    interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0eccf3fb2e63c1091e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
