Plain-text seq2seq models #26

athewsey · 2023-01-17T06:15:41Z

Issue #, if available: N/A

Description of changes:

This PR introduces trainable sequence-to-sequence models for generative tasks like OCR error correction or field re-formatting (e.g. date normalization).

Enable new 'seq2seq' task type for the training script
Add synthetic data demo on date format normalization with T5/ByT5 (e.g. prompt Convert dates to YYYY-MM-DD: Dec 31 1999 to yield 1999-12-31)
Simplify the output format of the custom (boxes + OCR transcript reviews) Ground Truth task UI, and PoC training the seq2seq model with SMGT annotations instead of plain src/target text (Generates prompts from class name, raw text, corrected text e.g. Normalize Agreement Effective Date: Dec 31 1999 to yield 1999-12-31, or however labelers have standardised the field)

Some important caveats:

Since HF LayoutLM-family models don't have a generator stack, this MVP experiments only with plain-text models (e.g. ByT5) and does not pull through raw OCR word bounding boxes / page images. Maybe users could hack together a model by splicing LayoutLM together with a pre-trained generator and fine-tuning? Or perhaps it wouldn't work without pre-training more extensively / from-scratch.
If you had a layout-aware/multi-modal model, you'd probably want to prompt it with more of the page context - not just treat entity extraction and text normalization as two separate tasks.
The date normalization demo generates 1k+ synthetic examples to achieve the ~94% accuracy it does... The field normalization seq2seq task seems likely to be quite data-hungry compared to the effort of annotating pages with transcription reviews?

Testing done:

Re-verified notebooks on an existing environment (didn't re-build from scratch)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fix enumeration of input files when no data channel manifest provided

Resolve issue where mlm get_task() and class TextractLayoutLMDataCollatorForLanguageModelling would throw errors if model_param_names argument was missing, even though it was marked as optional. These will now raise warnings instead.

Text-only seq2seq field normalization model using T5 to normalize date fields to YYYY-MM-DD format. Includes integration to pipeline (via postproc Lambda + field configuration SSM param) and setup NB (in Optional Extras.ipnyb), but no updates to readme/customization guide yet.

Allow non-fast tokenizers for seq2seq modelling tasks, which is necessary if wanting to use byT5 model in seq2seq mode. The previous restriction requiring Fast Tokenizers matters for ner and mlm tasks where custom data collation expects a particular API, but not for seq2seq.

Re-weight sample date re-formatting task from UK date formats towards US (month-first) formats, for better consistency with the CFPB credit cards sample documents.

Enable/streamline creating a SMGT labelling job using the custom (transcription reviews) template from data preparation notebook. With addition of the seq2seq model, more users are likely to be interested in collecting text normalizations.

Date normalization generators were not correctly adding ordinal suffixes to day numbers in some formats as intended: Just decorating the format name without changing the actual format string.

Explicitly set up Python logging level in pre/post SMGT Lambda functions to ensure (only) expected messages are generated. Fixes issue of .info() calls not coming through from postproc Lambda by default.

Simplify and improve the output format of labelling jobs using the custom (bboxes + OCR transcript reviews) task UI. The new template avoids duplicating the bboxes and recording Textract word IDs (which are very verbose), and makes it easier to pull out source/OCR text vs target/corrected text for each field (useful for seq2seq).

Update the training script data loaders to support training seq2seq model (still plain text) from SMGT custom task UI entity OCR validation results. Update seq2seq in Extras notebook to match setup used in demo. Add mention of SMGT data usage to Extras notebook.

Update README and CUSTOMIZATION_GUIDE to mention new seq2seq entity text normalization training option.

athewsey added 11 commits January 10, 2023 00:38

fix(src): Handling textract URIs with no manifest

f073617

Fix enumeration of input files when no data channel manifest provided

fix(src): allow MLM model_param_names=None

1511c90

Resolve issue where mlm get_task() and class TextractLayoutLMDataCollatorForLanguageModelling would throw errors if model_param_names argument was missing, even though it was marked as optional. These will now raise warnings instead.

style(lint): black-format src folder

461fcba

feat(seq2seq): Weight sample dates to US formats

69bf933

Re-weight sample date re-formatting task from UK date formats towards US (month-first) formats, for better consistency with the CFPB credit cards sample documents.

fix(seq2seq): 1st/Nth date format generator fix

d3c2ca8

Date normalization generators were not correctly adding ordinal suffixes to day numbers in some formats as intended: Just decorating the format name without changing the actual format string.

fix(smgt): Py logging setup on pre/post Lambda

9c2401c

Explicitly set up Python logging level in pre/post SMGT Lambda functions to ensure (only) expected messages are generated. Fixes issue of .info() calls not coming through from postproc Lambda by default.

fix(smgt): Missing JSON serializer in postproc

6d6eb4c

athewsey changed the title ~~Plain~~ Plain-text seq2seq models Jan 17, 2023

athewsey added 3 commits January 17, 2023 15:15

doc(seq2seq): Mention seq2seq sample on readme

155bdab

Update README and CUSTOMIZATION_GUIDE to mention new seq2seq entity text normalization training option.

build(git): gitignore SM Experiments tmpfiles

8b99fb5

athewsey merged commit 8b12404 into main Feb 27, 2023

athewsey mentioned this pull request Mar 3, 2023

[Enhancement] Merge layout-aware and generative model components #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plain-text seq2seq models #26

Plain-text seq2seq models #26

Uh oh!

athewsey commented Jan 17, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Plain-text seq2seq models #26

Plain-text seq2seq models #26

Uh oh!

Conversation

athewsey commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

athewsey commented Jan 17, 2023 •

edited

Loading