# Surya OCR Text Extraction

## What is Surya?

Surya is a tool for performing Optical Character Recognition (OCR) on images or documents with text. It designed specifically with structured documents in mind, and performs well across dozens of languages. Its results are very impressive when compared to other openly available OCR tools, including Google's much-utilized Tesseract.

Although Surya's output resembles that of a conventional OCR engine in the sense that it will reproduce text strings from an input image, it is based on a machine learning model called [DONUT](https://arxiv.org/abs/2111.15664) (Document Understanding Transformer), which was released in 2021. Unlike traditional OCR engines, the DONUT architecture utilises a Transformer-based model which simultaneously analyses a document's layout, structure and content, without performing OCR as a separate step.



## Before you begin

Before you start processing your images or documents, it is a good idea to check that they are in the correct format. If you are using images, check that the text is clearly visible, and that any background has been cropped out of the picture.


**If you simply want to test the code**, the cells below will pull a volume of the [Journal of Fabrics and Textile Industries](https://archive.org/details/journaloffabrics03unse/page/n7/mode/2up) from 1882, and will save two pages as images in a folder.

In [None]:
!pip install PyPDF2 pdf2image

In [None]:
!apt-get install -y poppler-utils

In [None]:
import requests

In [None]:
response = requests.get('https://dn790000.ca.archive.org/0/items/journaloffabrics03unse/journaloffabrics03unse.pdf', stream=True)

save_path = "/content/journaloffabrics03unse.pdf"


with open(save_path, 'wb') as file:
    for chunk in response.iter_content(chunk_size=1024):
        file.write(chunk)

print(f"PDF successfully downloaded and saved to {save_path}")

In [None]:
import os
from pdf2image import convert_from_path

def extract_pages_as_images(pdf_path, pages, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    images = convert_from_path(pdf_path, first_page=min(pages), last_page=max(pages))

    for page_num, image in zip(range(min(pages), max(pages) + 1), images):
        if page_num in pages:
            image_path = os.path.join(output_dir, f"page_{page_num}.png")
            image.save(image_path, "PNG")
            print(f"Saved page {page_num} as {image_path}")

pdf_path = "/content/journaloffabrics03unse.pdf"  # Path to the PDF file
pages = [12, 13]  # Pages to extract (1-based index)
output_dir = "/content/surya_input"  # Directory to save images

extract_pages_as_images(pdf_path, pages, output_dir)

## Usage

To make the most of out Surya's capabilities, run it through a hosted runtime on Google Colab. You can choose a hosted runtime by navigating to the top right-hand corner of the notebook and clicking on the triangle to the right of the RAM/Disk indictator (see image below). Make sure that you choose a GPU runtime when selecting from the options available.

While it is possible to run Surya on a CPU, the processing time is likely to be long. If you own a recent Macbook computer, you may also be able to run it locally on your machine, depending on your specifications. For example, a test using four images as input on an M1 Macbook Air took 32 minutes. Running the same process on an L4 GPU hosted runtime took under a minute. To find out more, please refer to the [Surya documentation](https://github.com/VikParuchuri/surya).

***Warning***: while Colab has a number of free options for hosted runtimes, processing large OCR batches will require that you to purchase compute units from Google. Depending on your system's specifications, you may also be able to run Surya locally.

<figure>
<center>
<img width=350 src='https://drive.google.com/uc?id=1-1kVQjE2Zco6kG6NLVZsg0L5vzqm3e5i'/>
</figure>

#### Install Surya

The first step in setting up Surya is to install the package via pip. Run the following cell to do this. Once Surya has successfully installed, you will see the following warning:

**WARNING: The following packages were previously imported in this runtime: [PIL]
You must restart the runtime in order to use newly installed versions.**

You will be prompted by Colab to restart your session and delete the runtime. Click on 'restart session' and scroll down to the next cell.

In [None]:
!pip install surya-ocr

#### Upload your documents

Surya will take as input PDFs, images, or folders containing collections of either. For the purposes of this notebook, you will want to navigate to the left-hand pane in Colab, and select the 'Files' icon. A tab will open up. Right-click to show the available options, and select 'New folder'. Let's rename this folder 'surya_input'.

**If you extracted the example pages from archive.org above, you don't need to do this. The images will be saved in a folder called 'surya_input'**

<figure>
<center>
<img width=350 src='https://drive.google.com/uc?id=1D9INA-PvN-ZNpxWieiaouqq1fdDDaAGe'/>
</figure>




You can now drag your individual files into this folder ready for processing.

#### Run Surya in the CLI

While you can use Surya in Python for greater flexibility, we have found that running via the CLI is by far the easiest way to run the tool. To perform OCR on the files uploaded to our 'surya_input' folder, run the following code:

In [None]:
!surya_ocr '/content/surya_input' --langs en --results_dir '/content/surya_output'

This will output a json file for each of your documents/images with the following fields for each text line detected:

*   text
*   confidence
* polygon
* bbox



This will save a json file with the name 'results.json'.

You can now download the json file for use in other applications.

#### Extract text fields as .txt files

If you want to end up with .txt files corresponding to each json file created, you can use the following code. [Generated by Chat GPT]

Before you execute the code, create a new folder with the name 'text_files'.

In [None]:
import json
import os

def save_text_lines_to_individual_files(json_directory, output_directory):
    # Ensure the output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Iterate over all JSON files in the given directory
    for json_file_name in os.listdir(json_directory):
        if json_file_name.endswith('.json'):
            json_file_path = os.path.join(json_directory, json_file_name)

            # Load the JSON data from the file
            with open(json_file_path, 'r', encoding='utf-8') as file:
                data = json.load(file)

            # Create a corresponding .txt file name
            txt_file_name = f"{os.path.splitext(json_file_name)[0]}.txt"
            txt_file_path = os.path.join(output_directory, txt_file_name)

            # Open the text file for writing
            with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
                # Iterate through the JSON structure to extract and print text lines
                for image_name, pages in data.items():
                    for page in pages:
                        text_lines = page.get("text_lines", [])
                        for text_line in text_lines:
                            text = text_line.get('text', '')
                            txt_file.write(text + '\n')

json_directory = '/content/surya_output/surya_input'  # Replace with the path to your directory containing JSON files
output_directory = '/content/text_files'  # Replace with the desired output directory for text files
save_text_lines_to_individual_files(json_directory, output_directory)


In [None]:
!surya_ocr '/content/surya_input' --langs en --results_dir '/content/surya_output'

# Layout Detection

In [None]:
from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.model import load_model as load_det_model, load_processor as load_det_processor
from surya.model.layout.model import load_model as load_layout_model
from surya.model.layout.processor import load_processor as load_layout_processor

IMAGE_PATH = '/content/PART II 5.jpeg'

image = Image.open(IMAGE_PATH)
model = load_layout_model()
processor = load_layout_processor()
det_model = load_det_model()
det_processor = load_det_processor()

# layout_predictions is a list of dicts, one per image
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)