# Generating OCR Synthetic Data

We'd like to some synthetic OCR data. Plan is to:

1. Load a relevant text based dataset.
2. Generate the OCR data.
3. Test this on an OCR model.

In [5]:
import trdg
import datasets
from transformers import pipeline
import torch

In [29]:
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
pipe = pipeline(
    "translation", model="ybanas/autotrain-fr-en-translate-51410121895", max_length=1200
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [15]:
dataset = datasets.load_dataset(
    "multi_eurlex",
    "all_languages",
    label_level="level_3",
    trust_remote_code=True,
)

multi_eurlex.tar.gz:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/55000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [20]:
task = dataset["train"][0]["text"]["fr"]
target = dataset["train"][0]["text"]["en"]

In [30]:
pipe(task)

[{'translation_text': 'COMMISSION Decision of 6 March 2006 establishing the classification of fire-reaction characteristics of certain construction products for the use of wood floors and masonry walls as a result of the Decision of the Commission of 21 December 1988 concerning the implementation of the Decision of the Council of 21 December 1988 on the classification of fire-reaction characteristics of construction (1), and especially the Regulation 20(2), which establishes a system of classifications for each of the essential requirements of the Decision of the Commission of 21 December 1988, which establishes a system of classifications for certain products and/or materials for the purpose of making a classification of fire-reaction characteristics of construction products, which is referred to in the Appendix.'}]

In [33]:
def split_inputs(text, split_key):
    split_rows = text.split(split_key)
    recovered_splits = [split + split_key for split in split_rows]
    return recovered_splits

In [35]:
tasks = split_inputs(task, ".")

In [39]:
arr = [pipe(t) for t in tasks]

In [40]:
arr

[[{'translation_text': 'COMMISSION Decision of 6 March 2006 establishing the classification of fire-reaction characteristics of certain construction products for wood floors and mass-wood roofing [notified under the number C(2006) 655] (Text of Relevance for the EU) (2006/213/CE) THE COMMISSION OF THE COMPETITIVE UNIONS, having read the Treaty and taking note that for each essential requirement, the classification of products may be established in the interpretative document.'}],
 [{'translation_text': 'The following documents were published as a "communication of the Commission on the interpretative documents of the GATT directive (2)".'}],
 [{'translation_text': '(2) For the essential requirement for fire safety, the Interpretation Document 2 outlines a list of interdependent measures that together establish the appropriate fire safety strategy for each member state.'}],
 [{'translation_text': '(3) One of the measures identified in the interpretation document 2 is to limit the appear

In [51]:
output = [thing[0]["translation_text"] for thing in arr]

In [52]:
output = "\n".join(output)

In [53]:
print(output)

COMMISSION Decision of 6 March 2006 establishing the classification of fire-reaction characteristics of certain construction products for wood floors and mass-wood roofing [notified under the number C(2006) 655] (Text of Relevance for the EU) (2006/213/CE) THE COMMISSION OF THE COMPETITIVE UNIONS, having read the Treaty and taking note that for each essential requirement, the classification of products may be established in the interpretative document.
The following documents were published as a "communication of the Commission on the interpretative documents of the GATT directive (2)".
(2) For the essential requirement for fire safety, the Interpretation Document 2 outlines a list of interdependent measures that together establish the appropriate fire safety strategy for each member state.
(3) One of the measures identified in the interpretation document 2 is to limit the appearance and spread of fire and fume in a specific area by limiting the possible contribution of construction pr

In [54]:
target

'COMMISSION DECISION\nof 6 March 2006\nestablishing the classes of reaction-to-fire performance for certain construction products as regards wood flooring and solid wood panelling and cladding\n(notified under document number C(2006) 655)\n(Text with EEA relevance)\n(2006/213/EC)\nTHE COMMISSION OF THE EUROPEAN COMMUNITIES,\nHaving regard to the Treaty establishing the European Community,\nHaving regard to Directive 89/106/EEC of 21 December 1988, on the approximation of laws, regulations and administrative provisions of the Member States relating to construction products (1), and in particular Article 20(2) thereof,\nWhereas:\n(1)\nDirective 89/106/EEC envisages that in order to take account of different levels of protection for construction works at national, regional or local level, it may be necessary to establish in the interpretative documents classes corresponding to the performance of products in respect of each essential requirement. Those documents have been published as the 