LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

yanghaku · 2023-03-11T14:51:20Z

Motivation

Mediapipe is a collection of ML models for streaming data. The official website provides Python, iOS, Android, and TFLite-JS SDKs for using those models. As WasmEdge is increasingly used in data streaming applications, we would like to build a Rust library crate that enables easy integration of Mediapipe models in WasmEdge applications.

Details

Each MediaPipe model has a description page that describes its input and output tensors. The models are available in Tensorflow Lite format, which is supported by the WasmEdge Tensorflow Lite plugin.

We need at least one set of library functions for each model in Mediapipe. Each library function takes in a media object and returns the inference result. The function performs the following tasks.

Process the input media object (e.g., a byte array for a JPEG image) into a tensor for the model. As an example, you could use the Rust imageproc crate to process the image into a vector.
Use WasmEdge NN to run inference of the input tensor on the model.
Collect and interpret the result tensor.
- The function should at least return a struct containing the output parameters described in the model description page. For example, a face detection function should return a vector of structs. Each struct contains the coordinates of a detected page.
- The function should also return a visual representation of the inference results. For example, we should overlay detected face boundaries and landmarks on the original image. As an example, the draw_hollow_rect_mut() in imageproc could be used to draw detected boundaries.

Milestones

Create a list of models, and then for each model, list the pre-, and post-processing functions needed.
Implement the tasks: image classification (no video support), object detection (no video support) (1 week)
Implement the tasks: text classification and audio classification. (2 weeks)
Find the function we need in OpenCV, and try to implement the video support for vision tasks. (2 weeks)
Implement all other vision tasks such as hand landmarks detection. (2 weeks)
build a new TfLite library that includes MediaPipe custom operators (1 week)
Try to implement GPU support for MediaPipe models. (1 week)
Write the documents, then publish the library to crates.io. (1 week)

Repository URL: origin: https://github.com/yanghaku/mediapipe-rs-dev, now it will transfer to https://github.com/WasmEdge/mediapipe-rs

Mediapipe tasks progress:

Appendix

feat: A Rust library crate for MediaPipe models for WasmEdge NN

The text was updated successfully, but these errors were encountered:

yanghaku · 2023-03-11T14:58:04Z

MediaPipe Solutions

1. Overview

Vision Tasks
- Face detection (upgrade in progress)
- Face mesh (upgrade in progress)
- Pose landmark detection (upgrade in progress)
- Holistic (upgrade in progress)
- Image Segmentation (upgrade in progress)
- Object detection
- Image classification
- Hand landmarks detection
- Gesture recognition
Text Tasks
- Text classification
Audio Tasks
- Audio classification

2. Vision Tasks

2.1. pre-process for vision tasks

Vision tasks have three input media types: image, video and live stream.

For video and live stream, we must decode to images at first.
For image, use the following operations:

Image Transformation (scale the image)
Image To TfLite Tensor (fp32 or uint8, NHWC layout for TfLite)

2.2. Object detection

The number of models: 6 click here to official website to read more information about these 6 models

post-process

TfLite Tensors To Detections
Sigmoid Scores
Generate SSD Anchors
Perform Non Maximum Suppression (NMS)
Detection Projection (projects detections back to the original coordinate system)
Deduplicate Detections (for the same bounding box coordinates)
Draws annotations (label id to label name, draw detections to image)
Encode To Video Frame (if the output is video)

2.3. Image classification

The number of models: 4 click here to official website to read more information about these 4 models

post-process

TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

2.4. Hand landmarks detection

models: The hand landmarker model bundle contains a hand detection model and a hand landmarks detection model. The Hand detection model locates hands within the input image, and the hand landmarks detection model identifies specific hand landmarks on the cropped hand image defined by the hand detection model.

Phase 1: hand detection post-process

TfLite Tensors To Detections
Sigmoid Scores
Generate SSD Anchors
Perform Non Maximum Suppression (NMS)
Detection Projection (projects detections back to the original coordinate system)
Detections To Rectangles (generate the hand rectangles)

Phase 2: hand landmarks detection post-process

TfLite Tensors To Landmarks
Landmark Projection
Hand Landmarks To Rectangles
TfLite Tensor To Classification (for handedness)

Phase 3: generate output

Filter duplicate hand landmarks (by finding the overlapped hands)

2.5. Gesture recognition

models: The Gesture Recognizer contains two pre-packaged model bundles: a hand landmark model bundle and a gesture classification model bundle. The landmark model detects the presence of hands and hand geometry, and the gesture recognition model recognizes gestures based on hand geometry.

Phase 1: [Hand landmarks detection](#Hand landmarks detection)

Phase 2: Hand gesture recognition process and post-process

Handedness To Matrix
Landmarks To Matrix
Matrix To TfLite Tensor
TfLite Tensor To Classification

3. Text Tasks

3.1. Text classification

The number of models: 2 click here to official website to read more information about these 2 models

pre-process

Input Text To Tokens (wordpiece tokenization for bert, regex tokenization for average word embedding model)
Tokens To TfLite Tensor

post-process

TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

4. Audio Tasks

4.1. Audio classification

The number of models: 1 click here to official website to read more information about the model

pre-process

sample, padding and FFT, then convert matrix to TfLite Tensor.

post-process

TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

Discussion: Do we need to import into WasmEdge from the MediaPipe C library as a host function?

In my opinion, most of the data processing functions can be implemented in rust, even though some complex parts may take some time, such as wordpiece tokenization for bert, and FFT for audio pre-processing.

For image decoding and encoding, we can use image or imageproc, but for video, there are no available libraries to use, and it takes too much time to develop our own. I think we could import some video libraries into WasmEdge such as OpenCV, to solve the problem.

That's all, thanks.

yanghaku · 2023-03-20T08:38:40Z

First-week report

In the first week, I design the project architecture and implement the task: image classification and object detection (including pre-process, post-process, document, tests, and examples).

1. Multiple model format support.

In MediaPipe, every essential information is bundled in models, such as input tensor shapes, output tensor shapes, classification label files, quantization parameters, and so on. So we have to parse the TfLite model to get information.

And I design a model resource abstraction level, which is an abstraction for model information. Using this, the library can support other models simply: just implement the corresponding model parser.

The architecture is as follows:

---------------------------------------------------------------------------------------
|                                  user code                                          |
---------------------------------------------------------------------------------------
     ↓ (media input)                                             ↑  (result output)
=======================================================================================
     ↓                                                                ↑ 
--------------------           -------------------            ----------------------
| pre-processing   |           |   do inference |             |  post-processing   |
|   (to tensor)    |     ➔     |                |     ➔      |  (extract tensor)  |
--------------------           ------------------             ----------------------
       ↑ (get info)                                                  ↑ (get info)
---------------------------------------------------------------------------------------
|                          model resource abstraction                                 |
---------------------------------------------------------------------------------------
            ↑                           ↑                                     ↑
--------------------------      ------------------------    ---------------------------
| TfLite model parser    |      | Pytorch model parser |    |  etc.                   |
--------------------------      ------------------------    ---------------------------

When loading a model, the library will detect the model format using file head magic and call the corresponding model parser to get all the model information we need.

Now, I only implement the TfLite model parser.

2. Flexible input.

The library defines a rust interface ToTensor, any media such as image, audio, or text, which implements this interface can be used as input.

For image, this library implements the ToTensor for the crate image, so the image can be used as input directly.
Further, the library can implement the ToTensor for OpenCV format images and videos, and it can be used as input directly.
If the users have another format, such as ndarray, they can implement ToTensormanually, and use it as input.

Next week's plan

TfLite bundles its label information using the zip format, so we must implement the zip extraction, and get label information.
Start work on the audio and text tasks.

yanghaku · 2023-04-01T11:21:00Z

2-3 week report

Progress

Add zip extraction support to extract files from model data.
Implement the audio classification task. (user can use an audio decode library such as symphonia to read audio as input.)
Implement the text classification task, and support regex tokenizer and word piece tokenizer.
Compile the ffmpeg library to wasm32-wasi target.

Next week's plan

Try ffmpeg library to decode and encode video.
Try to implement video support for vision tasks.
Find the function we need in OpenCV.

yanghaku · 2023-04-08T08:24:12Z

4-Week Progress Report

Summary of Progress

Successfully integrated FFmpeg into our project, The project uses FFmpeg as a rust feature, which can process videos and audio.
Improve audio input: The library has built-in implementation to support Symphonia, FFmpeg, and raw audio data as input.
Added video support for vision tasks by utilizing FFmpeg as the video decoder and encoder, and converting frames to tensors.

Plan for Next Week

Implement the tasks: "Hand landmarks detection" and "Gesture Recognition".

yanghaku · 2023-04-16T03:11:35Z

5-Week Progress Report

Summary of Progress

Add Region of Interest support, which can select the region of interest and rotate any angle for images.
Implement the hand landmarks task and its subtask hand detections.
Implement the gesture recognition task.
Add more options for utils draw_detection and draw_landmarks.

Plan for Next Week

Implement the rest tasks: "Image segmentation" and "Image embedding".
Support more options for draw utils, such as text labels.

juntao · 2023-04-16T08:58:03Z

Thanks! Do you have instructions for us to try these?

I think we will probably eventually need ffmpeg to run as a plugin (native host functions) as opposed to compiling to Wasm for performance reasons. But that's for later!

yanghaku · 2023-04-16T11:05:17Z

@juntao Yes, The repo has README file that shows how to use the library. And repo has examples and tests which can be directly run by cargo.

About FFmpeg, the performance is slower than native. It is a temporary solution for video/audio processing before FFmpeg/Opencv plugins are available. Now, I use FFmpeg-wasm as a rust feature, which may be deleted after the FFmpeg plugin can be used.

The repo also has scripts that show how to set up environments and run examples/tests with FFmpeg.

The documentation and examples are not complete at this time, but I will update them once all MediaPipe tasks have been implemented.

yanghaku · 2023-04-18T03:07:06Z

Hello! @juntao , I have encountered an issue with the MediaPipe Image Segmentation tasks, as a specific one of the models uses a custom operator called Convolution2DTransposeBias.

And I noticed that the MediaPipe source code includes 10 custom operators that could potentially be in use in the future.

To resolve this issue, one possible solution is to build a custom libtensorflowlite_c.so library that includes these operators.
Alternatively, other better solutions may be available that could help address this problem.

Do I need to add this thing to the plan?

juntao · 2023-04-18T06:20:37Z

I think you could try to build a new tf library that includes those operators (option 1) and see if works? If it does, we should perhaps make the new tf library build as part of our WASI NN plugin. Thanks.

yanghaku · 2023-04-18T07:13:25Z

Ok, I have added it to the milestones.

yanghaku · 2023-04-23T02:55:20Z

6-Week Progress Report

Summary of Progress

Implement the rest tasks: "Image segmentation" and "Image embedding". Now all tasks released in MediaPipe have been implemented, and I will update the repo when a new model has been released in MediaPipe.
Add more examples and update the documents.
Find the function we need in OpenCV and FFmpeg.

Plan for Next Week

Try to build a new TfLite library that includes MediaPipe custom operators

Host Functions we need in OpenCV

OpenCV is written in C++ so we also need class member functions for the classes.

Classes

cv::Mat
cv::RotatedRect
cv::VideoCapture
cv::Rect
cv::Vec
cv::KeyPoint
cv::VideoWriter

Functions

cv::imread
cv::imdecode
cv::resize
cv::imencode
cv::putText
cv::circle
cv::line

Host Functions we need in FFmpeg

av_strerror
av_register_all
av_strdup
avformat_open_input
avformat_close_input
avformat_find_stream_info
av_find_best_stream
av_read_frame
av_frame_alloc
av_frame_free
av_init_packet
avcodec_parameters_alloc
avcodec_parameters_free
avcodec_parameters_to_context
avcodec_open2
avcodec_close
avcodec_send_packet
avcodec_receive_frame
avfilter_register_all
avfilter_graph_alloc
avfilter_graph_config
avfilter_graph_create_filter
avfilter_graph_get_filter
avfilter_graph_free
avfilter_inout_alloc
avfilter_inout_free
av_buffersink_get_frame
av_buffersrc_add_frame
av_buffersrc_close
sws_getContext
sws_scale
sws_freeContext

yanghaku · 2023-04-30T17:48:05Z

7-Week Progress Report

Summary of Progress

Add TfLite custom operators support and create a PR for the solutions.
Implement MediaPile custom operators, now all models can be run in WasmEdge successfully.

Plan for Next Week

Face Detection is available now in MediaPipe, so I can implement the solution next week.
Support all MediaPipe custom operators in https://github.com/yanghaku/mediapipe-custom-ops
Start to try TfLite GPU support for the WasmEdge WASI-NN plugin.

yanghaku · 2023-05-06T05:24:12Z

8-Week Progress Report

Summary of Progress

Implement MediaPipe task: Face Detection.
Support all MediaPipe custom operators in repo https://github.com/yanghaku/mediapipe-custom-ops, the pre-compiled library can be downloaded at https://github.com/yanghaku/mediapipe-custom-ops/releases.
Add TfLite GPU support for the WasmEdge WASI-NN plugin, the source code is in the branch: https://github.com/yanghaku/WasmEdge/tree/wasi_nn_tflite_gpu_test, the pre-compiled Tflite-Gpu library can be downloaded at https://drive.google.com/file/d/1cyCAPrtWih6tuFMfaIB78_rTpYFQJfLV/view.

Plan for Next Week

Add more options for utils draw_detection and draw_landmarks, such as colors, label text, etc.
Write more examples and documents, then publish the library to crates.io.

juntao · 2023-05-06T06:28:08Z

That's great progress. Thank you so much for the update!

dannypsnl · 2023-10-31T00:35:44Z

Close as completed issue, feel free to reopen.

alabulei1 added the LFX Mentorship Tasks for LFX Mentorship participants label Apr 3, 2023

yanghaku mentioned this issue Apr 30, 2023

[WASI-NN] tensorflow lite backend - tf_select_ops support and tflite custom ops support. #2456

Draft

assambar mentioned this issue Jun 6, 2023

LFX Workspace: Support Tensorflow and PyTorch in WasmEdge’s Python runtime #2545

Open

8 tasks

dannypsnl closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

yanghaku commented Mar 11, 2023 •

edited

Loading

yanghaku commented Mar 11, 2023 •

edited

Loading

yanghaku commented Mar 20, 2023

yanghaku commented Apr 1, 2023

yanghaku commented Apr 8, 2023

yanghaku commented Apr 16, 2023 •

edited

Loading

juntao commented Apr 16, 2023

yanghaku commented Apr 16, 2023

yanghaku commented Apr 18, 2023

juntao commented Apr 18, 2023

yanghaku commented Apr 18, 2023

yanghaku commented Apr 23, 2023

yanghaku commented Apr 30, 2023

yanghaku commented May 6, 2023

juntao commented May 6, 2023

dannypsnl commented Oct 31, 2023

LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

Comments

yanghaku commented Mar 11, 2023 • edited Loading

Motivation

Details

Milestones

Repository URL: origin: https://github.com/yanghaku/mediapipe-rs-dev, now it will transfer to https://github.com/WasmEdge/mediapipe-rs

Mediapipe tasks progress:

Appendix

yanghaku commented Mar 11, 2023 • edited Loading

MediaPipe Solutions

1. Overview

2. Vision Tasks

2.1. pre-process for vision tasks

2.2. Object detection

post-process

2.3. Image classification

post-process

2.4. Hand landmarks detection

Phase 1: hand detection post-process

Phase 2: hand landmarks detection post-process

Phase 3: generate output

2.5. Gesture recognition

Phase 1: [Hand landmarks detection](#Hand landmarks detection)

Phase 2: Hand gesture recognition process and post-process

3. Text Tasks

3.1. Text classification

pre-process

post-process

4. Audio Tasks

4.1. Audio classification

pre-process

post-process

Discussion: Do we need to import into WasmEdge from the MediaPipe C library as a host function?

yanghaku commented Mar 20, 2023

First-week report

1. Multiple model format support.

2. Flexible input.

Next week's plan

yanghaku commented Apr 1, 2023

2-3 week report

Progress

Next week's plan

yanghaku commented Apr 8, 2023

4-Week Progress Report

Summary of Progress

Plan for Next Week

yanghaku commented Apr 16, 2023 • edited Loading

5-Week Progress Report

Summary of Progress

Plan for Next Week

juntao commented Apr 16, 2023

yanghaku commented Apr 16, 2023

yanghaku commented Apr 18, 2023

juntao commented Apr 18, 2023

yanghaku commented Apr 18, 2023

yanghaku commented Apr 23, 2023

6-Week Progress Report

Summary of Progress

Plan for Next Week

Host Functions we need in OpenCV

Classes

Functions

Host Functions we need in FFmpeg

yanghaku commented Apr 30, 2023

7-Week Progress Report

Summary of Progress

Plan for Next Week

yanghaku commented May 6, 2023

8-Week Progress Report

Summary of Progress

Plan for Next Week

juntao commented May 6, 2023

dannypsnl commented Oct 31, 2023

yanghaku commented Mar 11, 2023 •

edited

Loading

yanghaku commented Mar 11, 2023 •

edited

Loading

yanghaku commented Apr 16, 2023 •

edited

Loading