Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat : Support Document AI in WasmEdge #2356

Open
3 of 6 tasks
sarrah-basta opened this issue Mar 12, 2023 · 12 comments
Open
3 of 6 tasks

feat : Support Document AI in WasmEdge #2356

sarrah-basta opened this issue Mar 12, 2023 · 12 comments

Comments

@sarrah-basta
Copy link
Contributor

sarrah-basta commented Mar 12, 2023

Motivation

The Hugging Face Hub provides a platform hosting a collection of pre-trained models, datasets, and demos of machine learning projects. This blog by them gives a concise overview of the SOTA available models for Document AI, which includes many data science tasks from Optical Character Recognition (OCR) , Document Image Classification, Document Layout Analysis, Document Parsing and Document Visual Question Answering.
WasmEdge would like to enable easy integration of these Document AI tasks in WasmEdge applications by creating the necessary pre- and post-processing functions in Rust and using the fine-tuned models available on the Hugging Face Model Hub.

Details

Document AI tasks use multimodal models, i.e models that can unify document text (using OCR) , layout (using tokens), and visual information (using spatial information from the image) in a single end-to-end framework that can learn cross-modal interactions. Each Document AI task has a description page that describes its expected output and the datasets for the task. The corresponding models fine-tuned for these datasets are available in Pytorch format, which is supported by the Wasi-NN plugin.

This project aims to

  1. Make a set of generalized pre-processing and post-processing functions that could be used for all Document AI tasks (such as tokenizers and feature extractors) and
  2. Create inference functions for each Document AI task using the said pre- and post-processing functions and fine-tuned models from the hub in the backend. Each library function takes in a media object and returns the inference result. The inference function performs the following tasks.
  • Takes in an input media object (e.g., a byte array for a JPEG image) and processes into a tensor of extracted features (eg. segment token indices, bounding boxes and image).
  • Use WasmEdge NN to run inference of the input tensor on the fine-tuned pytorch model.
  • Collect and interpret the result tensor. Return the interpreted results in the form expected by the task (eg. a struct containing a class name for document image classification or a vector of structs containing coordinates of detected tokens for a document layout analysis task).

Milestones

  • Create a list of models to be used along with expected inputs, outputs and necessary pre- and post-processing functions.
  • Import existing APIs such as the tokenizer library and Tesseract OCR in Rust to satisfy General Pre-requisites.
  • Implement necessary and common pre- and post-processing functions in Rust .
  • Setup inference functions for each task using the selected model and pre-trained weights obtained from fine-tuning.
  • Create demo examples and tests for each supported document AI model and task.
  • Document, modularize code to expose relevant functions to users, and publish the library.
@sarrah-basta
Copy link
Contributor Author

Document AI Tasks

General Pre-requisites and Common Pre-processing Functions

  1. OCR Engine - Google's Tesseract
    The text from the image needs to be extracted and understood (in the form of words and bounding boxes indicating where those words occur) for all succeeding models, except for OCR-free Document Understanding Transformer like Donut. While other options exist, importing libtesseract, its C API into WasmEdge and calling it from Rust will be the most beneficial as further models are fine-tuned using the same engine.

  2. Tokenization - Hugging Face's Tokenizer Library
    The words and bounding boxes obtained from an OCR engine need to be passed to a tokenizer in order to process it further. The aim is to replicate the PreTrainedTokenizerFast class in Hugging Face, based on their tokenizer library written in Rust.

  3. General Pre- and Post-Processing functions
    Include all functions required to prepare input features for vision models and post process their outputs. Base on Image Processor and Feature Extractor in Hugging Face.

Document AI Tasks and Model Selection

Document AI multimodal models pre-train text, layout and image in a multi-modal framework using large-scale unlabeled scanned/digital-born documents. These models are then used in visually-rich downstream document understanding tasks by fine-tuning them on the task-respective labeled benchmark dataset. The following table outlines the tasks, datasets and corresponding models to be supported in this project.

DocumentAI Task Benchmark Dataset Required Inputs Model Expected Outputs Reference
Optical Character Recognition - Image Host function to Tesseract OCR C API Words, Bounding Boxes and Optional Tokenization -
Document Image Classification RVL-CDIP 1.pixel_values Obtained from ImageProcessor and FeatureExtractor microsoft/dit-base-finetuned-rvlcdip Predicted class with maximum score NoteBook 1 and Notebook 2
Document Layout Analysis PubLayNet To be tested nielsr/dit-document-layout-analysis Set of segmentation masks/bounding boxes, along with class names and scores Spaces and Python Script
Document Parsing FUNSD input_ids, token_type_ids, attention_mask, bbox, labels, image Obtained from OCR Engine, Tokenizer and ImageProcessor nielsr/layoutlmv2-finetuned-funsd A set of tokenized sequences and corresponding bounding boxes Notebook
Table Detection and Extraction PubTables-1M pixel_values Obtained from ImageProcessor https://huggingface.co/microsoft/table-transformer-detection A struct containing bounding box and corresponding confidence for each table detected. Notebook
Document Visual Question Answering DocVQA input_ids, token_type_ids, attention_mask, bbox, labels, image and the question Obtained from OCR Engine, Tokenizer and ImageProcessor layoutlmv3-base-mpdocvqa A single answer to the asked question Notebook

Discussion Topics

  1. I have tried selecting the best SOTA models with the finetuned weights available for each task. The LayoutLMv3 based on earlier LayoutLM structures but more efficient is one such model. However. a few places warn about its licensing issues, that I looked up in this thread. Can anyone clarify if it would be okay to use it for this project ?
  2. In some places in the Hugging Faces Processing Functions, they provide batch support. However, since WASI-NN is only to be used for inference and not training, is it okay to skip Batch Support and just pass one object at a time to the functions ?

@sarrah-basta
Copy link
Contributor Author

Week 1-2 Progress Update

In order to achieve the first main goal of integrating Document AI, compiled a Rust Wrapper to Tesseract to WebAssembly using WasmEdge Plugins to perform OCR on images

  • rusty-tesseract is a rust wrapper to the command-line functionality of tesseract,hence it just uses the rust code to run the command line executable of tesseract on the host function, which can be installed with sudo apt install tesseract-ocr on linux.
  • Changes done to it to be able to make it compile to WebAssembly
    1. subprocess : I believe this is for execution of multiple processes, but I could not find the crate being used anywhere in the wrapper and hence safely removed it from the dependencies.
    2. polars : This was used as a DataFrame library. However since we do not have the need for this functionality and just need the words, bounding boxes, and confidences in raw form, I removed this as well.
    3. std::process::Command : This was necessary to run the command line functionality but I was able to replace it with the wasmedge_process_interface (https://github.com/second-state/wasmedge_process_interface) for the same functionality.

Thus, the tesseract OCR can now be compiled to WASM from the rust code, the rough test code for which I have uploaded at https://github.com/sarrah-basta/wasmedge_ai_testing/blob/main/rusty-tesseract-wasm/README.md#build-instructions-to-build-the-wrapper .

Week 2-3 Plan

  • Understand the other main functions necessary to be able to use the words and bounding boxes given by the above OCR in a complete model and test a model that depends on tesseract working which would allow us to discover most of the potential problems we will have with other models.

@juntao
Copy link
Member

juntao commented Apr 1, 2023

Thank you!

@sarrah-basta
Copy link
Contributor Author

Week 3 Progress Update

In order to create the next main pre-processing block, worked on creating a tokenizer using the Rust Tokenizers Core and compiled to WebAssembly to tokenized text given by OCR

  • Reverse Engineered the source code to understand which which options, models and configurations of the Core Rust library were being used by the LayoutLMv2Processor
  • Read through the source code of the ImageProcessor and Tokenizer classes and extracted only the relevant python code and got it working, which can be found here :
    Python Tokenizer
  • Next, the primary psedocode was as follows :
    1. Create a "Fast" tokenizer using the Rust Core library
    2. Encode the words obtained from OCR using the Rust function
    3. Convert the encodings obtained into a proper dictionary and then tensor format to feed to the model.

I was able to create this Rust Code compiled with WasmEdge to solve the first two parts, i.e create the correct tokenizer and obtain the encodings.

Note : I tested this by using the words obtained via OCR from the HuggingFaces ImageProcessor, this will later be replaced by wasm implementation of tesseract created earlier as well.
I compared the tokens obtained from my Rust implementation and the HuggingFaces Python classes here, which mostly looks correct.

Week 4 Plan

  • The next steps will be to create the end-to-end pipeline for the Sequence Labelling/Document Parsingtask, which is what I was referring so far. This will include
    a. Integrating the rusty-tesseract-wasm OCR
    b. Converting the obtained encodings to the correct tensor formats
    c. Inferencing the tensors with the PyTorch model using the Wasi-NN plugin

  • Once this proof of concept is complete, I will be able to
    a. Clean the code and add input checks, OS checks (for running CLI tesseract commands), etc and
    b. Divide the code into modular functions.

@juntao
Copy link
Member

juntao commented Apr 7, 2023

Thank you so much for the update! I just want to clarify that you have created no additional host functions / plugins. You got the entire OCR program working inside WasmEdge (Rust compiled into Wasm). Is that correct? Thanks!

@sarrah-basta
Copy link
Contributor Author

sarrah-basta commented Apr 8, 2023

I just want to clarify that you have created no additional host functions / plugins

Yes @juntao , that's correct. While I originally thought this would be needed by leveraging the C Api of tesseract, instead, since Tesseract has a command line functionality which can be used by simply installing the pre-built binaries, I decided to leverage that instead.

entire OCR program working inside WasmEdge (Rust compiled into Wasm)

Hence, yes the entire program now works inside WasmEdge, I did however have to make the use of a plugin : wasmedge_process_interface to be able to use the Command Line functionality of the native operating system (which WasmEdge is running on) while the user's Wasm is being executed on WasmEdge .

Hope this clears the need and functioning, thank you !

P.S. Pytesseract, the python wrapper to Tesseract, used in most AI applications I am referring, also uses an identical approach.

@sarrah-basta
Copy link
Contributor Author

Week 4 & 5 Progress Update

In order to create the modular end-to-end pipeline of the LayoutLMv2ModelForTokenClassification (currently using the temporary CLI based method for Tesseract OCR), created the following preprocessing functions to get the inputs required by themodel in the correct Tensor formats.

  • (words, boxes) = apply_tesseract(image_name, image_width, image_height) : applies the tesseract OCR using the wrapper to the CLI functionality of the Tesseract OCR engine and parses the output to return the Vectors of the words and bounding boxes obtained.
  • base_encodings = layoutlmv2_tokenizer(words) : This function creates a "Fast" tokenizer using the Rust Core library and encodes the words obtained from OCR using the Rust function, making the work done in Week 3 into a modular fashion.
  • bboxes = encoded_boxes(&base_encodings, boxes) : This creates the bboxes in the format needed by the model, using the ids from the encodings created by the tokenizer and the boxes created by the OCR.
  • resize_image and to_bgr_image : Basic image processing functions that convert the image to a format required by the model.
  • preprocessed the bboxes vector obtained by the encoded_boxes vector
  • got inputs from encodings as input_ids, attention_mask and token_type_ids required by the model
  • f32_to_tensor_data : That takes in a Vec and returns a Vec converting f32 data to u8 to create it into a tensor
  • to_tensor : That takes in the tensor_data and converts it into a wasi_nn::Tensor
  • converted all the above required inputs in the tensor format for the model.

Next, I obtained the required model with the fine-tuned weights and traced it in Python to convert it to TorchScript in the function infer_layout_lmv2 . I communicated with the mentors throughout this, and while I was able to solve most of the issues I was facing, I am still facing the following errors due to which the inference in the end-to-end pipeline is remaining.

The code for these preprocessing functions is at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_model .

Errors currently facing

[2023-04-16 20:59:08.292] [error] [WASI-NN] Only F32 inputs and outputs are supported for now.
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NnErrno { code: 1, name: "INVALID_ARGUMENT", message: "" }', src/main.rs:107:9
stack backtrace:
[2023-04-16 20:59:08.293] [error] execution failed: unreachable, Code: 0x89
[2023-04-16 20:59:08.293] [error]     In instruction: unreachable (0x00) , Bytecode offset: 0x001a8568
[2023-04-16 20:59:08.293] [error]     When executing function name: "_start"

This error is being caused due to this check in the source code plugins/wasi-nn however all the tensors created by me for the inputs are of the correct types.
Instead, what I believe is happening is that the model ony accepts inputs in the form of Pytorch LongTensors which are I32, and hence F32 FloatTensors won't work here. I even tried various things while tracing the model, such as converting inputs to FloatTensors before tracing, and noticed the arguments image and attention_mask work in any datatype, whereas input_ids, bbox and token_type_ids expect Integer values only.

Possible solutions

I am currently a little stuck and would require some guidance on how I should approach this further, is there some reason for only supporting F32 Tensors in the WASI-NN plugin for Pytorch Bckend, and if yes, is there any way to change the expectations of the TorchScript or PyTorch model ? Hopefully @juntao can give some insight.

Week 6 Plan

  • Another concern specified by @juntao was that the CLI dependency of the tesseract-ocr rust-wrapper I was using so far is not a long term viable solution, because using the command line plugin breaks the Wasm sandbox in a very unpredictable ways — the CLI program could write files or use the network without any constraints.

Hence, I have been the C API to get identical results, and while I wait for some guidance on the above issue, I will go ahead with creating a host function with the Rust plugin SDK to call functions, after registering the tesseract C API as a WasmEdge plugin (similar to https://github.com/WasmEdge/WasmEdge/blob/master/examples/plugin/get-string/getstring.cpp )

@juntao
Copy link
Member

juntao commented Apr 19, 2023

@apepkuss and @q82419 Can you comment on the issue about WASI NN not accepting f32 typed tensors? Thanks.

@apepkuss
Copy link
Collaborator

@q82419 According to the investigation by @sarrah-basta, the wasi-nn plugin has a type checking between line 623 - 627. Could you please help fix the issue? Thanks a lot!

@sarrah-basta
Copy link
Contributor Author

Week 6 & 7 Progress Update

In order to create the OCR solution using the Tesseract API to avoid the CLI dependency of the command line plugin that breaks the Wasm sandbox in a very unpredictable ways, I created a :

  1. Host Function in the WasmEdge C++ SDK that uses the Tesseract C++ API , and registered it as a WasmEdge Plugin - Wasi-OCR -> https://github.com/sarrah-basta/WasmEdge/tree/wasi_ocr/plugins/wasi_ocr
  2. A Rust library crate to utilize the functions in the plugin that inputs images and gets a struct of Data containing all useful information extracted by the OCR Engined wasi-ocr -> https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/wasi-ocr

The basic flow of the created codes are as follows :

The Rust Library contains

  • a public function image_to_data(image_path : &str) -> Vec<Data> --> This function uses two private functions in the library crate to take in the image path convert into a CString, so that it can be read by the C++ plugin, and uses the plugin function wasi_ephemeral_ocr::num_of_extractions(image_path: *const c_char, image_len: u32) -> u32 that returns the length of the buffer required to store the output TSV Text obtained by Tesseract.

This length is then passed to another plugin function wasi_ephemeral_ocr::get_output(output_buf:*mut c_char, output_buf_max_size: u32) -> u32; that stores the output obtained via the char *TessBaseAPI::GetTSVText API function in the output buffer. The pointer to the buffer is decoded as a &Cstr object then converted to the appropriate String format in Rust, taking care of transferring ownership to Rust in order to not face problems in the encoding.

This String is then parsed and each detection made is fed into a Data Struct containing the following fields :

pub struct Data {
    pub level: i32,
    pub page_num: i32,
    pub block_num: i32,
    pub par_num: i32,
    pub line_num: i32,
    pub word_num: i32,
    pub left: i32,
    pub top: i32,
    pub width: i32,
    pub height: i32,
    pub conf: f32,
    pub text: String,
}

A vector of such structs is returned by the public image_to_data function of the library which can be used in any downstream tasks.

I will be using it in the layoutlmv2 model created earlier.

The WasmEdge Plugin Wasi-OCR contains the two plugin functions described above and the necessary functions to register it as a module in the following file structure

  • wasi_ocr
    • CMakeLists.txt
    • wasiocrenv.h
    • wasiocrenv.cpp
    • wasiocrfunc.h
    • wasiocrfunc.cpp
    • wasiocrmodule.h
    • wasiocrmodule.cpp

The Tesseract API is created when the environment is created and destroyed at the end of the call of the image_to_data function.

Dependencies and Install Instructions for the plugin

The Tesseract API has two dependencies which can be installed as follows :

sudo apt install tesseract-ocr
sudo apt-get install libleptonica-dev

More detailed instructions can be found at https://tesseract-ocr.github.io/tessdoc/Installation.html but only the above two libraries are necessary.

They are then linked via the CMakeLists.

Building WasmEdge with the plugin

Week 8 Plan

Since the concern for the CLI depependency is now solved, this week I can focus on

  • Looking more into the earlier problem discussed in the previous report regarding LongIntTensors not being compatible with the Wasi-NN Plugin encountered while inferencing the LAyoutLMv2 Model. If @apepkuss and @q82419 could help I will be very grateful.
  • Creating a similar proof of concept with preprocessing functions and inference for the DIT model

These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully) done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.

@sarrah-basta
Copy link
Contributor Author

Week 8 Progress Update

  1. Created the end-to-end pipeline of the LayoutLMv2ModelForTokenClassification using the Wasi-OCR plugin created to obtain results from tesseract from within WasmEdge, and contains preprocessing functions to get the inputs required by the layoutlmv2 model in the correct Tensor formats.
    The code and detailed description of the working can be found at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_with_wasi_ocr/README.md
  • Dependencies and Install Instructions for the plugin
    Needs WasmEdge to built with both Wasi-OCR and Wasi-NN with Pytorch Backend plugins.
  1. Investigated further regarding the error being caused due to this check in the source code plugins/wasi-nn .

Week 9 Plan

  • Work along with the community to solve the issue encountered in the Wasi-NN plugin.
  • Creating a similar proof of concept with preprocessing functions and inference for the DIT model

These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.

@sarrah-basta
Copy link
Contributor Author

Week 9 Progress Update

  1. Tested more versions and updated the community to be able to solve the issue encountered in the Wasi-NN plugin.
  2. Traced the DiT model for Document Image Classification
  3. Added support in Wasi-NN plugin for Tuple type output to support DiT model.
  4. Tested successful inference of DiT model fine-tuned on RVL-CDIP in Wasi-NN with Dummy Inputs
  5. Created pre-processing functions for resize_image and normalize_image for DiT model.

Week 10 Plan

  • Create an end-to-end pipeline along with post-processing functions for DiT model.
  • Put all pre and post-processing functions created till now for the Document-AI tasks in a Rust Library Crate and create a user interface to execute each task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants