# OCR Baseline for License Plate Recognition (PaddleOCR)

This notebook implements the OCR baseline using **PaddleOCR**.

**Steps:**
1. **Setup**: Install PaddlePaddle and PaddleOCR.
2. **Data Preparation**: Convert dataset to PaddleOCR format.
3. **Inference (Zero-shot)**: Test pre-trained model.
4. **Training (Fine-tuning)**: Train a custom model on the dataset.

## 1. Setup

In [None]:
# Install PaddlePaddle (GPU version recommended if available)
# Check https://www.paddlepaddle.org.cn/en/install/quick for specific version
# !python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
# !pip install paddleocr

import paddle
print(f"PaddlePaddle Version: {paddle.__version__}")
print(f"GPU Available: {paddle.device.is_compiled_with_cuda()}")

In [None]:
import os
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from paddleocr import PaddleOCR

# Define Paths
DATASET_DIR = "../datasets/IndonesianLiscenePlateDataset/plate_text_dataset"
IMAGES_DIR = os.path.join(DATASET_DIR, "dataset")
LABEL_FILE = os.path.join(DATASET_DIR, "label.csv")

# Output for PaddleOCR training format
PADDLE_DATA_DIR = "../datasets/paddle_ocr_dataset"
os.makedirs(PADDLE_DATA_DIR, exist_ok=True)

## 2. Data Preparation

PaddleOCR expects a text file where each line is: `path/to/image.jpg\tlabel`

In [None]:
# Load Labels
df = pd.read_csv(LABEL_FILE)
print(f"Total samples: {len(df)}")

# Split Train/Val
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Train: {len(train_df)}, Val: {len(val_df)}")

def create_paddle_label_file(dataframe, output_path, relative_base_dir):
    with open(output_path, 'w') as f:
        for _, row in dataframe.iterrows():
            filename = row['filename']
            label = row['label']
            
            # Path relative to the folder where we will run training
            # Usually we put images in the same dir or provide full path
            # Let's use absolute paths to be safe, or relative to PADDLE_DATA_DIR
            
            # Source image path
            src_path = os.path.join(IMAGES_DIR, filename)
            if not os.path.exists(src_path):
                print(f"Warning: Image not found {src_path}")
                continue
                
            # We can symlink or copy, or just point to existing images
            # Let's point to existing images using relative path from PADDLE_DATA_DIR
            # PADDLE_DATA_DIR is ../datasets/paddle_ocr_dataset
            # IMAGES_DIR is ../datasets/IndonesianLiscenePlateDataset/plate_text_dataset/dataset
            # Rel path: ../IndonesianLiscenePlateDataset/plate_text_dataset/dataset/filename
            
            rel_path = os.path.relpath(src_path, PADDLE_DATA_DIR)
            # PaddleOCR uses / as separator
            rel_path = rel_path.replace("\\", "/")
            
            f.write(f"{rel_path}\t{label}\n")
            
    print(f"Created {output_path}")

create_paddle_label_file(train_df, os.path.join(PADDLE_DATA_DIR, "rec_gt_train.txt"), PADDLE_DATA_DIR)
create_paddle_label_file(val_df, os.path.join(PADDLE_DATA_DIR, "rec_gt_val.txt"), PADDLE_DATA_DIR)

## 3. Zero-shot Inference (Pre-trained Model)

Check how the English pre-trained model performs on Indonesian plates.

In [None]:
# Initialize PaddleOCR
try:
    ocr = PaddleOCR(use_angle_cls=False, lang='en', use_gpu=True) 
    print("PaddleOCR initialized with GPU.")
except OSError as e:
    print(f"GPU Initialization failed: {e}")
    print("Falling back to CPU. Please check your PaddlePaddle and CUDA version compatibility.")
    ocr = PaddleOCR(use_angle_cls=False, lang='en', use_gpu=False)

def predict_sample(image_path):
    result = ocr.ocr(image_path, cls=False, det=False) # Only recognition
    # result is a list of lists, e.g. [[('text', score)]]
    if result and result[0]:
        return result[0][0]
    return None, 0.0

# Test on a few validation samples
sample_val = val_df.sample(5)

for _, row in sample_val.iterrows():
    img_path = os.path.join(IMAGES_DIR, row['filename'])
    gt_label = row['label']
    
    pred_label, score = predict_sample(img_path)
    
    img = cv2.imread(img_path)
    plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    plt.title(f"GT: {gt_label} | Pred: {pred_label} ({score:.2f})")
    plt.axis('off')
    plt.show()

## 4. Fine-tuning PaddleOCR

To fine-tune, we need to clone the PaddleOCR repository and use their training script.
This section assumes you are running in a Linux/Git Bash environment. If on Windows, commands might need adjustment.

In [None]:
# Clone PaddleOCR repo if not exists
if not os.path.exists("PaddleOCR"):
    !git clone https://github.com/PaddlePaddle/PaddleOCR.git
else:
    print("PaddleOCR repo already cloned.")

In [None]:
# Install requirements
!pip install -r PaddleOCR/requirements.txt

### Configuration

We need to create a config file (YAML) for training. We will base it on `en_PP-OCRv3_rec.yml`.
Key changes needed:
- `data_dir`: Path to our dataset
- `label_file_list`: Path to our train/val txt files
- `character_dict_path`: Path to character dictionary (use `en_dict.txt` or create custom)
- `Global.epoch_num`: Number of epochs
- `Global.save_model_dir`: Output directory

In [None]:
# Download pre-trained weights for fine-tuning
# English PP-OCRv3 Recognition Model
if not os.path.exists("en_PP-OCRv3_rec_train.tar"):
    !wget https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_train.tar
    !tar -xf en_PP-OCRv3_rec_train.tar

In [None]:
# Create a custom config file
# We will read the default config and modify it

base_config_path = "PaddleOCR/configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml"
custom_config_path = "PaddleOCR/configs/rec/PP-OCRv3/lpr_finetune.yml"

if os.path.exists(base_config_path):
    with open(base_config_path, 'r') as f:
        config_content = f.read()
        
    # Replace paths (Using absolute paths is safest)
    abs_paddle_data_dir = os.path.abspath(PADDLE_DATA_DIR).replace("\\", "/")
    
    # Modify config (Simple string replacement for demo, ideally use yaml parser)
    config_content = config_content.replace("data_dir: ./train_data/", f"data_dir: {abs_paddle_data_dir}/")
    config_content = config_content.replace("label_file_list: ./train_data/train_list.txt", f"label_file_list: {abs_paddle_data_dir}/rec_gt_train.txt")
    config_content = config_content.replace("label_file_list: ./train_data/val_list.txt", f"label_file_list: {abs_paddle_data_dir}/rec_gt_val.txt")
    
    # Set epochs and save dir
    config_content = config_content.replace("epoch_num: 500", "epoch_num: 50")
    config_content = config_content.replace("save_model_dir: ./output/rec/en_PP-OCRv3", "save_model_dir: ../results/ocr/lpr_finetune")
    
    # Set pretrained model path
    abs_pretrained = os.path.abspath("../models/en_PP-OCRv3_rec_train/best_accuracy").replace("\\", "/")
    config_content = config_content.replace("pretrained_model: ", f"pretrained_model: {abs_pretrained}")
    
    with open(custom_config_path, 'w') as f:
        f.write(config_content)
        
    print(f"Created custom config at {custom_config_path}")
else:
    print(f"Base config not found at {base_config_path}. Please check PaddleOCR structure.")

In [None]:
# Run Training
# Note: This might take a while. You can monitor the output.
# Ensure you are in the PaddleOCR directory or reference the script correctly.

print("Starting training... (Uncomment line below to run)")
# !python PaddleOCR/tools/train.py -c PaddleOCR/configs/rec/PP-OCRv3/lpr_finetune.yml

## 5. Evaluation

After training, evaluate the model using the best checkpoint.

In [None]:
# !python PaddleOCR/tools/eval.py -c PaddleOCR/configs/rec/PP-OCRv3/lpr_finetune.yml -o Global.checkpoints=./output/rec/lpr_finetune/best_accuracy