# OCR Application using Python, OpenCV, and Tesseract in Google Colab

This notebook allows you to upload **multiple images** and extract text from them using Optical Character Recognition (OCR). You can also specify the **language** of the text for more accurate OCR results.

---

## Table of Contents

- [Install Required Libraries](#install-required-libraries)
- [Import Necessary Libraries](#import-necessary-libraries)
- [Select the Language for OCR](#select-the-language-for-ocr)
- [Upload Multiple Images](#upload-multiple-images)
- [Process Uploaded Images](#process-uploaded-images)
- [Conclusion](#conclusion)

---

## Install Required Libraries

First, we need to install the necessary libraries and dependencies. Google Colab already has some libraries installed, but we'll ensure everything we need is set up.

In [None]:
# Install Tesseract OCR and language data
!apt-get update
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev

# Install additional language packs (optional)
# Uncomment and modify the following line to install language packs you need
# !apt-get install -y tesseract-ocr-[lang_code]

# Install Python libraries
!pip install pytesseract
!pip install opencv-python
!pip install Pillow

**Explanation:**

- **Tesseract OCR:** We install the Tesseract OCR engine and its development files.
- **Language Packs:** You can install additional language packs as needed.
- **Python Libraries:** We install `pytesseract`, `opencv-python`, and `Pillow` for OCR and image processing.

## Import Necessary Libraries

In [None]:
import cv2
import pytesseract
from google.colab import files
from IPython.display import display
import numpy as np
from PIL import Image
import io

**Explanation:**

- **cv2:** OpenCV library for image processing.
- **pytesseract:** Python wrapper for Tesseract OCR.
- **files:** For uploading files in Colab.
- **display:** To display images and outputs.
- **numpy:** For numerical operations on image data.
- **PIL (Image, io):** For image handling.

## Select the Language for OCR

Since interactive widgets like `ipywidgets.Dropdown` are not fully supported in Colab, we'll prompt the user to input the language code.

In [None]:
# List of common language codes
common_languages = {
    'English': 'eng',
    'Spanish': 'spa',
    'French': 'fra',
    'German': 'deu',
    'Italian': 'ita',
    'Portuguese': 'por',
    'Russian': 'rus',
    'Chinese Simplified': 'chi_sim',
    'Chinese Traditional': 'chi_tra',
    'Japanese': 'jpn',
    'Korean': 'kor',
    'Hindi': 'hin',
    'Arabic': 'ara',
}

# Display the language options
print("Select the language for OCR from the list below:")
for lang in common_languages:
    print(f"- {lang} ({common_languages[lang]})")

# Prompt the user to input the language code
lang_code = input("Enter the language code (e.g., 'eng' for English): ").strip()

# Validate the input
if not lang_code:
    lang_code = 'eng'  # Default to English
    print("No language code entered. Defaulting to English ('eng').")
else:
    print(f"Using language code: '{lang_code}'")

**Explanation:**

- We provide a list of common languages and their Tesseract language codes.
- We prompt the user to input the desired language code.
- If the user doesn't input a code, we default to English (`'eng'`).

**Note:** Ensure that the language data for the chosen language is installed in Tesseract. If not, you can install it using `apt-get install tesseract-ocr-[lang_code]`.

## Upload Multiple Images

In [None]:
# Prompt the user to upload multiple image files
print("Please upload the images for OCR.")
uploaded = files.upload()

# Check if any files were uploaded
if not uploaded:
    print("No files uploaded. Please upload at least one image.")
else:
    print(f"{len(uploaded)} file(s) uploaded successfully.")

**Explanation:**

- We use `files.upload()` to prompt the user to upload one or more image files.
- The uploaded files are stored in the `uploaded` dictionary.
- We check if any files were uploaded and provide feedback.

## Process Uploaded Images

In [None]:
# Iterate over the uploaded files and process each image
for filename, content in uploaded.items():
    print(f"\nProcessing '{filename}'...")
    try:
        # Open the image
        image = Image.open(io.BytesIO(content)).convert('RGB')
        img = np.array(image)

        # Display the original image
        print("**Original Image:**")
        display(image)

        # Preprocess the image
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        _, thresh = cv2.threshold(
            gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
        )
        denoised = cv2.medianBlur(thresh, 3)

        # Perform OCR
        custom_config = r'--oem 3 --psm 6'
        text = pytesseract.image_to_string(
            denoised, lang=lang_code, config=custom_config
        )

        # Print the extracted text
        print("**Extracted Text:**")
        print(text)
    except Exception as e:
        print(f"An error occurred while processing '{filename}': {e}")

**Explanation:**

- **Image Loading:** We open each uploaded image and convert it to an RGB format.
- **Display Original Image:** We display the original image for reference.
- **Preprocessing:**
  - **Grayscale Conversion:** Simplifies the image data.
  - **Thresholding:** Binarizes the image to separate text from the background.
  - **Noise Removal:** Uses median blur to reduce noise.
- **OCR Processing:**
  - We use `pytesseract.image_to_string()` with the specified language code.
  - Custom configurations are passed to optimize the OCR engine.
- **Error Handling:** We catch exceptions to prevent the entire script from stopping if one image fails.

## Conclusion

We've successfully set up an OCR application in Google Colab that:

- Allows us to **upload multiple images**.
- Lets us **specify the language** for OCR.
- **Processes each image** to extract and display the text.

We can also modify the preprocessing steps or OCR configurations to improve accuracy for your specific use case.

---

## Additional Notes

- **Installing Language Data:**
  - Ensure that the Tesseract OCR engine has the language data files installed for the languages you want to use.
  - To install additional languages, run:
    ```python
    # Replace '[lang_code]' with the actual language code
    !apt-get install -y tesseract-ocr-[lang_code]
    ```
    For example, to install Spanish:
    ```python
    !apt-get install -y tesseract-ocr-spa
    ```
- **Improving OCR Accuracy:**
  - Experiment with different preprocessing techniques like adaptive thresholding, dilation, erosion, etc.
  - Adjust the OCR configurations (`--oem`, `--psm`) as needed.
- **Processing Non-Image Files:**
  - Ensure that the uploaded files are images. The script may fail if non-image files are uploaded.

# Full Code in One Cell (for Easy Execution)

Below is the complete code for the OCR application. You can run this cell directly after opening your Google Colab notebook.


In [None]:
# Install Required Libraries and Dependencies
!apt-get update
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev

# Install additional language packs if needed
# For example, to install Spanish, uncomment the following line:
# !apt-get install -y tesseract-ocr-spa

# Install Python libraries
!pip install pytesseract
!pip install opencv-python
!pip install Pillow

# Import Necessary Libraries
import cv2
import pytesseract
from google.colab import files
from IPython.display import display
import numpy as np
from PIL import Image
import io

# Select the Language for OCR
common_languages = {
    'English': 'eng',
    'Spanish': 'spa',
    'French': 'fra',
    'German': 'deu',
    'Italian': 'ita',
    'Portuguese': 'por',
    'Russian': 'rus',
    'Chinese Simplified': 'chi_sim',
    'Chinese Traditional': 'chi_tra',
    'Japanese': 'jpn',
    'Korean': 'kor',
    'Hindi': 'hin',
    'Arabic': 'ara',
}

print("Select the language for OCR from the list below:")
for lang in common_languages:
    print(f"- {lang} ({common_languages[lang]})")

lang_code = input("Enter the language code (e.g., 'eng' for English): ").strip()

if not lang_code:
    lang_code = 'eng'
    print("No language code entered. Defaulting to English ('eng').")
else:
    print(f"Using language code: '{lang_code}'")

# Upload Multiple Images
print("Please upload the images for OCR.")
uploaded = files.upload()

if not uploaded:
    print("No files uploaded. Please upload at least one image.")
else:
    print(f"{len(uploaded)} file(s) uploaded successfully.")

# Process Uploaded Images
for filename, content in uploaded.items():
    print(f"\nProcessing '{filename}'...")
    try:
        # Open the image
        image = Image.open(io.BytesIO(content)).convert('RGB')
        img = np.array(image)

        # Display the original image
        print("**Original Image:**")
        display(image)

        # Preprocess the image
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        _, thresh = cv2.threshold(
            gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
        )
        denoised = cv2.medianBlur(thresh, 3)

        # Perform OCR
        custom_config = r'--oem 3 --psm 6'
        text = pytesseract.image_to_string(
            denoised, lang=lang_code, config=custom_config
        )

        # Print the extracted text
        print("**Extracted Text:**")
        print(text)
    except Exception as e:
        print(f"An error occurred while processing '{filename}': {e}")


# Instructions For Execution

1. **Open a New Colab Notebook:**

   - Go to [Google Colab](https://colab.research.google.com/) and create a new notebook.

2. **Copy and Paste the Code:**

   - Copy the entire code block above into a cell in your Colab notebook.

3. **Run the Cell:**

   - Click the "Run" button or press `Shift + Enter` to execute the cell.
   - The notebook will install the required packages and prompt you to select the language and upload images.

4. **Select Language:**

   - When prompted, enter the language code corresponding to the language in your images.
   - If you don't enter a code, it defaults to English (`'eng'`).

5. **Upload Images:**

   - A file dialog will appear. Select one or more image files containing text.

6. **View Results:**

   - The notebook will process each image, display it, and print out the extracted text.

---

# Optional Enhancements

## Display Preprocessed Images

To see how the image looks after preprocessing, you can display the processed image. Uncomment the following lines in the code:

```python
# processed_image = Image.fromarray(denoised)
# print("**Processed Image:**")
# display(processed_image)
```

## Install Additional Language Packs

If the language you need is not installed by default, install it using:

```python
# Replace '[lang_code]' with the actual language code
!apt-get install -y tesseract-ocr-[lang_code]
```

For example, to install Polish:

```python
!apt-get install -y tesseract-ocr-pol
```

## Adjust OCR Configurations

You can experiment with different OCR configurations to improve accuracy:

```python
custom_config = r'--oem 3 --psm 6'
```

- **`--oem` (OCR Engine Mode):**
  - `0`: Original Tesseract only.
  - `1`: Neural nets LSTM engine only.
  - `2`: Tesseract with LSTM.
  - `3`: Default, based on what is available.

- **`--psm` (Page Segmentation Mode):**
  - `3`: Fully automatic page segmentation.
  - `6`: Assume a single uniform block of text.
  - `11`: Sparse text.

---

# Troubleshooting

- **No Text Extracted:**
  - Ensure the image quality is good (not blurry or low-resolution).
  - Check that the correct language code is used and the language pack is installed.

- **Errors During OCR:**
  - Verify that Tesseract is correctly installed and accessible.
  - Confirm that the language code is valid and supported.

- **Non-Image Files:**
  - Make sure only image files are uploaded. The script may fail with non-image files.

---

# Conclusion

You've now adapted the OCR application to work in Google Colab, allowing for multiple image uploads and language selection. This setup is convenient for collaborative work and doesn't require any local installations beyond what is handled in the notebook.

Feel free to experiment with different preprocessing techniques and OCR configurations to optimize performance for your specific needs.

**Happy Coding!**

---

# References

- **Tesseract OCR Documentation:** [GitHub Repository](https://github.com/tesseract-ocr/tesseract)
- **pytesseract Documentation:** [GitHub Repository](https://github.com/madmaze/pytesseract)
- **OpenCV Documentation:** [Official Site](https://opencv.org/)

