Skip to content

Conversation

@athewsey
Copy link
Contributor

@athewsey athewsey commented Jan 17, 2023

Issue #, if available: N/A

Description of changes:

This PR introduces trainable sequence-to-sequence models for generative tasks like OCR error correction or field re-formatting (e.g. date normalization).

  • Enable new 'seq2seq' task type for the training script
  • Add synthetic data demo on date format normalization with T5/ByT5 (e.g. prompt Convert dates to YYYY-MM-DD: Dec 31 1999 to yield 1999-12-31)
  • Simplify the output format of the custom (boxes + OCR transcript reviews) Ground Truth task UI, and PoC training the seq2seq model with SMGT annotations instead of plain src/target text (Generates prompts from class name, raw text, corrected text e.g. Normalize Agreement Effective Date: Dec 31 1999 to yield 1999-12-31, or however labelers have standardised the field)

Some important caveats:

  • Since HF LayoutLM-family models don't have a generator stack, this MVP experiments only with plain-text models (e.g. ByT5) and does not pull through raw OCR word bounding boxes / page images. Maybe users could hack together a model by splicing LayoutLM together with a pre-trained generator and fine-tuning? Or perhaps it wouldn't work without pre-training more extensively / from-scratch.
  • If you had a layout-aware/multi-modal model, you'd probably want to prompt it with more of the page context - not just treat entity extraction and text normalization as two separate tasks.
  • The date normalization demo generates 1k+ synthetic examples to achieve the ~94% accuracy it does... The field normalization seq2seq task seems likely to be quite data-hungry compared to the effort of annotating pages with transcription reviews?

Testing done:

Re-verified notebooks on an existing environment (didn't re-build from scratch)


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fix enumeration of input files when no data channel manifest provided
Resolve issue where mlm get_task() and class
TextractLayoutLMDataCollatorForLanguageModelling would throw errors if
model_param_names argument was missing, even though it was marked as
optional. These will now raise warnings instead.
Text-only seq2seq field normalization model using T5 to normalize
date fields to YYYY-MM-DD format. Includes integration to pipeline
(via postproc Lambda + field configuration SSM param) and setup NB
(in Optional Extras.ipnyb), but no updates to readme/customization
guide yet.
Allow non-fast tokenizers for seq2seq modelling tasks, which is
necessary if wanting to use byT5 model in seq2seq mode. The previous
restriction requiring Fast Tokenizers matters for ner and mlm tasks
where custom data collation expects a particular API, but not for
seq2seq.
Re-weight sample date re-formatting task from UK date formats towards
US (month-first) formats, for better consistency with the CFPB credit
cards sample documents.
Enable/streamline creating a SMGT labelling job using the custom
(transcription reviews) template from data preparation notebook. With
addition of the seq2seq model, more users are likely to be interested
in collecting text normalizations.
Date normalization generators were not correctly adding ordinal
suffixes to day numbers in some formats as intended: Just decorating
the format name without changing the actual format string.
Explicitly set up Python logging level in pre/post SMGT Lambda
functions to ensure (only) expected messages are generated. Fixes
issue of .info() calls not coming through from postproc Lambda by
default.
Simplify and improve the output format of labelling jobs using the
custom (bboxes + OCR transcript reviews) task UI. The new template
avoids duplicating the bboxes and recording Textract word IDs (which
are very verbose), and makes it easier to pull out source/OCR text vs
target/corrected text for each field (useful for seq2seq).
@athewsey athewsey changed the title Plain Plain-text seq2seq models Jan 17, 2023
Update the training script data loaders to support training seq2seq
model (still plain text) from SMGT custom task UI entity OCR
validation results. Update seq2seq in Extras notebook to match setup
used in demo. Add mention of SMGT data usage to Extras notebook.
Update README and CUSTOMIZATION_GUIDE to mention new seq2seq entity
text normalization training option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant