# Text Detection with Surya

This notebook will demonstrate how to use `Surya` to detect text in an image. Surya is named for the Hindu sun god, who has universal vision. Find out more about Surya [here](https://github.com/VikParuchuri/surya).

## Installation

First, we need to install the `surya-ocr` library. You can install it using pip.

In [None]:
# Install Surya
!pip install surya-ocr

## Initialize Surya Detector and Recognizer

We will first initialize the detector and recognizer. The detector is used to detect text in an image, and the recognizer is used to recognize the text in the detected regions. For the first time, the library will download the pre-trained models.

In [None]:
from surya.model.recognition.model import load_model as load_recognizer
from surya.model.recognition.processor import (load_processor as load_recognizer_processor,)

recognizer = load_recognizer()
recognizer_processor = load_recognizer_processor()

from surya.model.detection.model import load_model as load_detector
from surya.model.detection.model import load_processor as load_detector_processor

detector = load_detector()
detector_processor = load_detector_processor()

## Load and Display the Image

We will load the our document image and display it using jupyter notebook built-in `display` function.

In [None]:
from PIL import Image

# Load the image from a path
image_path = "path/to/image.jpg"
image = Image.open(image_path)

display(image)

## Perform Text Detection and Recognition

We will use the `run_ocr` method to detect and recognize text in the image. This method returns a list of `OCRResult`, where `OCRResult` is the result of the individual document image. This is the structure of `OCRResult`:

```py
# Document level result
class OCRResult(BaseModel):
    text_lines: List[TextLine]
    languages: List[str]
    image_bbox: List[float]
```

which we will use only the `text_lines` attribute. Here is the structure of `TextLine`:

```py
# Text line level result
class TextLine:
    text: str                   # Detected text
    confidence: float           # 0.0 to 1.0
    polygon: List[List[float]]  # [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
```

In [None]:
from surya.ocr import run_ocr
from surya.schema import OCRResult, TextLine

# Define languages to recognize
langs = ["en", "th"]
predictions: list[OCRResult] = run_ocr([image], [langs], detector, detector_processor, recognizer, recognizer_processor)

# Because we only have one image, our result will be in the first element
prediction = predictions[0]
# Unpack the prediction
text_lines: list[TextLine] = prediction.text_lines

# Let's print the first 5 textlines
for idx, text_line in enumerate(text_lines):
    if idx == 5:
        break

    print(f"Textline {idx}: {text_line.text}")
    print(f"Confidence: {text_line.confidence}")
    print(f"Bounding box: {text_line.polygon}")
    print("-" * 80)

## Draw Bounding Boxes and Display the Image with Predictions

We will draw the bounding boxes around the detected text and display the image with the text predictions.

In [None]:
from PIL import ImageDraw, ImageFont
from surya.schema import OCRResult

FONT = ImageFont.truetype("../assets/THSarabun.ttf", size=20)


def draw_boxes(image: Image.Image, result: OCRResult) -> None:
    """Draw bounding boxes and its information on the image."""
    # Create a drawing object
    draw = ImageDraw.Draw(image)

    # Draw each result on the image.
    for text_line in result.text_lines:
        # Unpack the result.
        text = text_line.text
        confidence_score = text_line.confidence
        bbox = text_line.polygon

        # bbox is a four-point coordinate of the bounding box. [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
        # we need to convert it to PIL coordinates to draw the rectangle.
        # which is only two points, top-left and bottom-right. [x1, y1, x2, y2]

        pil_bbox = [bbox[0][0], bbox[0][1], bbox[2][0], bbox[2][1]]

        # 1. Draw the bounding box
        draw.rectangle(pil_bbox, outline="blue", width=2)
        # 2. Draw the text and confidence score e.g. 'Hello (0.72)'
        draw_text = f"{text} ({confidence_score:.2f})"

        # Place text at the top-left of the bounding box and shift it up by 20 pixels
        x = pil_bbox[0]
        y = pil_bbox[1] - 20
        draw.text((x, y), draw_text, fill="red", font=FONT)


# Create a copy of the image to draw on
image_with_boxes = image.copy()
draw_boxes(image_with_boxes, prediction)

display(image_with_boxes)

## Advanced

We can use only the `surya` detector and `surya` recognizer to perform text detection and recognition separately. This is useful when we want to use our own text detection model or our own text recognition model.

### Text Detection with Surya
 This method returns a list of `TextDetectionResult`, where `TextDetectionResult` is the result of the individual document image. This is the structure of `TextDetectionResult`:
```py
class TextDetectionResult(BaseModel):
    bboxes: List[PolygonBox]
    vertical_lines: List[ColumnLine]
    heatmap: Any
    affinity_map: Any
    image_bbox: List[float]
```

which we will use only the `bboxes` attribute. Here is the structure of `PolygonBox`:

```py
class PolygonBox(BaseModel):
    polygon: List[List[float]]
    confidence: Optional[float] = None
```

In [None]:
from surya.detection import batch_text_detection
from surya.schema import TextDetectionResult

bbox_predictions: list[TextDetectionResult] = batch_text_detection([image], detector, detector_processor)

# Because we only have one image, our result will be in the first element
bbox_prediction = bbox_predictions[0]

# Unpack the prediction
bboxes = bbox_prediction.bboxes

# Let's print the first 5 bounding boxes
for idx, bbox in enumerate(bboxes):
    if idx == 5:
        break

    print(f"Bounding box {idx}: {bbox.polygon}")
    print(f"Confidence: {bbox.confidence}")
    print("-" * 80)

Now we can draw the bounding boxes.

In [None]:
from PIL import ImageDraw

# Create a copy of the image to draw on
image_with_boxes = image.copy()

# Create a drawing object
draw = ImageDraw.Draw(image_with_boxes)

# Draw each region on the image
for bbox in bboxes:
    # Unpack the result.
    confidence_score = bbox.confidence
    bbox = bbox.polygon

    # bbox is a four-point coordinate of the bounding box. [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
    # we need to convert it to PIL coordinates to draw the rectangle.
    # which is only two points, top-left and bottom-right. [x1, y1, x2, y2]

    pil_bbox = [bbox[0][0], bbox[0][1], bbox[2][0], bbox[2][1]]

    # 1. Draw the bounding box
    draw.rectangle(pil_bbox, outline="blue", width=2)
    # 2. Draw the confidence score e.g. '0.72'
    draw_text = f"{confidence_score:.2f}"
    # Shift the text to the left of the bounding box to make it more visible
    draw.text((pil_bbox[0], pil_bbox[1]-20), draw_text, fill="red", font=FONT)


display(image_with_boxes)

Now let's crop the detected text regions and perform text recognition using the `surya` recognizer.

In [None]:
# Crop the textboxes from the image
textboxes = []
for bbox in bboxes:
    bbox = bbox.polygon
    # bbox is a four-point coordinate of the bounding box. [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]
    # we need to convert it to PIL coordinates to draw the rectangle.
    # which is only two points, top-left and bottom-right. [x1, y1, x2, y2]
    textboxes.append(image.crop([bbox[0][0], bbox[0][1], bbox[2][0], bbox[2][1]]))

### Text Recognition with Surya
 This method returns a tuple of `List[str]` and `List[float]`, where the first element is the list of recognized text and the second element is the list of confidence scores.


In [None]:
from surya.recognition import batch_recognition

# The number of languages must match the number of textboxes
# So we need to pass a list of languages for each textbox
languages = [["en", "th"]] * len(textboxes)
text_predictions, confidence_scores = batch_recognition(textboxes, languages, recognizer, recognizer_processor)

# Let's print the first 5 text predictions
for idx, (text, confidence_score, textbox) in enumerate(zip(text_predictions, confidence_scores, textboxes)):
    if idx == 5:
        break

    print(f"Text: {text}")
    print(f"Confidence: {confidence_score}")
    display(textbox)
    print("-" * 80)

# Extract Information from the recognized text

## Create extract question

In [15]:
extract_question = """
You are provided with a recognized text from the OCR system of a Thai vehicle registration book (สมุดทะเบียนรถ), which each recognized text are seperated by tab (\t) character.
Your task is to extract the following information from the image.
The extracted value is typically located on the right side of the key in the document.
Some of the text might be corrupted, missing diacritics or misread, autocorrection is appreciated.
Extract these details:

1. วันจดทะเบียน (date_of_registration)
2. เลขทะเบียน (registration_no)
3. จังหวัด (car_province)
4. ประเภท (vehicle_use)
5. รย. (type)
6. ลักษณะ (body_style)
7. ยี่ห้อรถ (manufacturer)
8. แบบ (model)
9. รุ่นปี คศ (year)
10. สี (color)
11. เลขตัวรถ (chassis_number)
12. อยู่ที่ (chassis_location)
13. ยี่ห้อเครื่องยนต์ (engine_manufacturer)
14. เลขเครื่องยนต์ (engine_number)
15. อยู่ที่ (engine_location)
16. เชื้อเพลิง (fuel_type)
17. เลขถังแก๊ส (fuel_tank_number)
18. จำนวน (cylinders)
19. ซีซี (cubic_capacity)
20. แรงม้า (horse_power)
21. จำนวนเพลาและล้อ (axles_wheels_no)
22. น้ำหนักรถ (unladen_weight)
23. น้ำหนักบรรทุก/น้ำหนักเพลา (load_capacity)
24. น้ำหนักรวม (gross_weight)
25. ที่นั่ง (seats)

Instructions:

Carefully examine the image and locate each piece of information.
If a particular field is not visible or not present in the image, use the value "N/A" for that field.
Ensure all text extracted from the image is in its original language (Thai or English) as it appears in the document.
Return the extracted information in a JSON format, using the English key names provided in parentheses.
Only return the JSON output, without any additional explanation or text.

Example of expected output in dictionary format:
{
  "date_of_registration": "1 ม.ค. 2566",
  "registration_no": "กข 1234",
  "car_province": "กรุงเทพมหานคร",
  ...
  "seats": "4"
}
"""

## Perform information extraction with Llama3.2

In [None]:
# Stack recognized text into a single string for prompting
recognized_text = "\t".join(text_predictions)
recognized_text

In [17]:
from langchain_ollama import OllamaLLM
from langchain import PromptTemplate

llm = OllamaLLM(model="llama3.2", stop=["<|eot_id|>"]) # Added stop token

def get_model_response(user_prompt: str, system_prompt: str) -> str:
    # NOTE: No f string and no whitespace in curly braces
    template = """
        <|begin_of_text|>
        <|start_header_id|>system<|end_header_id|>
        {system_prompt}
        <|eot_id|>
        <|start_header_id|>user<|end_header_id|>
        {user_prompt}
        <|eot_id|>
        <|start_header_id|>assistant<|end_header_id|>
        """

    # Added prompt template
    prompt = PromptTemplate(
        input_variables=["system_prompt", "user_prompt"],
        template=template
    )

    # Modified invoking the model
    response = llm.invoke(prompt.format(system_prompt=system_prompt, user_prompt=user_prompt))

    return response

In [18]:
# Example
user_prompt = recognized_text
system_prompt = extract_question
answer = get_model_response(user_prompt, system_prompt)

In [None]:
print(answer)