Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFX Workspace: A Rust library crate for mediapipe models for WasmEdge NN #2355

Closed
16 of 17 tasks
yanghaku opened this issue Mar 11, 2023 · 15 comments
Closed
16 of 17 tasks
Labels
LFX Mentorship Tasks for LFX Mentorship participants

Comments

@yanghaku
Copy link
Collaborator

yanghaku commented Mar 11, 2023

Motivation

Mediapipe is a collection of ML models for streaming data. The official website provides Python, iOS, Android, and TFLite-JS SDKs for using those models. As WasmEdge is increasingly used in data streaming applications, we would like to build a Rust library crate that enables easy integration of Mediapipe models in WasmEdge applications.

Details

Each MediaPipe model has a description page that describes its input and output tensors. The models are available in Tensorflow Lite format, which is supported by the WasmEdge Tensorflow Lite plugin.

We need at least one set of library functions for each model in Mediapipe. Each library function takes in a media object and returns the inference result. The function performs the following tasks.

  • Process the input media object (e.g., a byte array for a JPEG image) into a tensor for the model. As an example, you could use the Rust imageproc crate to process the image into a vector.
  • Use WasmEdge NN to run inference of the input tensor on the model.
  • Collect and interpret the result tensor.
    • The function should at least return a struct containing the output parameters described in the model description page. For example, a face detection function should return a vector of structs. Each struct contains the coordinates of a detected page.
    • The function should also return a visual representation of the inference results. For example, we should overlay detected face boundaries and landmarks on the original image. As an example, the draw_hollow_rect_mut() in imageproc could be used to draw detected boundaries.

Milestones

  • Create a list of models, and then for each model, list the pre-, and post-processing functions needed.
  • Implement the tasks: image classification (no video support), object detection (no video support) (1 week)
  • Implement the tasks: text classification and audio classification. (2 weeks)
  • Find the function we need in OpenCV, and try to implement the video support for vision tasks. (2 weeks)
  • Implement all other vision tasks such as hand landmarks detection. (2 weeks)
  • build a new TfLite library that includes MediaPipe custom operators (1 week)
  • Try to implement GPU support for MediaPipe models. (1 week)
  • Write the documents, then publish the library to crates.io. (1 week)

Repository URL: origin: https://github.com/yanghaku/mediapipe-rs-dev, now it will transfer to https://github.com/WasmEdge/mediapipe-rs

Mediapipe tasks progress:

  • Object Detection
  • Image Classification
  • Image segmentation
  • Gesture Recognition
  • Hand Landmark Detection
  • Image embedding
  • Face Detection
  • Audio Classification
  • Text Classification

Appendix

feat: A Rust library crate for MediaPipe models for WasmEdge NN

@yanghaku
Copy link
Collaborator Author

yanghaku commented Mar 11, 2023

MediaPipe Solutions

1. Overview

  • Vision Tasks
    • Face detection (upgrade in progress)
    • Face mesh (upgrade in progress)
    • Pose landmark detection (upgrade in progress)
    • Holistic (upgrade in progress)
    • Image Segmentation (upgrade in progress)
    • Object detection
    • Image classification
    • Hand landmarks detection
    • Gesture recognition
  • Text Tasks
    • Text classification
  • Audio Tasks
    • Audio classification

2. Vision Tasks

2.1. pre-process for vision tasks

Vision tasks have three input media types: image, video and live stream.

For video and live stream, we must decode to images at first.
For image, use the following operations:

  • Image Transformation (scale the image)
  • Image To TfLite Tensor (fp32 or uint8, NHWC layout for TfLite)

2.2. Object detection

The number of models: 6 click here to official website to read more information about these 6 models

post-process

  • TfLite Tensors To Detections
  • Sigmoid Scores
  • Generate SSD Anchors
  • Perform Non Maximum Suppression (NMS)
  • Detection Projection (projects detections back to the original coordinate system)
  • Deduplicate Detections (for the same bounding box coordinates)
  • Draws annotations (label id to label name, draw detections to image)
  • Encode To Video Frame (if the output is video)

2.3. Image classification

The number of models: 4 click here to official website to read more information about these 4 models

post-process

  • TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
  • Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

2.4. Hand landmarks detection

models: The hand landmarker model bundle contains a hand detection model and a hand landmarks detection model. The Hand detection model locates hands within the input image, and the hand landmarks detection model identifies specific hand landmarks on the cropped hand image defined by the hand detection model.

Phase 1: hand detection post-process

  • TfLite Tensors To Detections
  • Sigmoid Scores
  • Generate SSD Anchors
  • Perform Non Maximum Suppression (NMS)
  • Detection Projection (projects detections back to the original coordinate system)
  • Detections To Rectangles (generate the hand rectangles)

Phase 2: hand landmarks detection post-process

  • TfLite Tensors To Landmarks
  • Landmark Projection
  • Hand Landmarks To Rectangles
  • TfLite Tensor To Classification (for handedness)

Phase 3: generate output

  • Filter duplicate hand landmarks (by finding the overlapped hands)

2.5. Gesture recognition

models: The Gesture Recognizer contains two pre-packaged model bundles: a hand landmark model bundle and a gesture classification model bundle. The landmark model detects the presence of hands and hand geometry, and the gesture recognition model recognizes gestures based on hand geometry.

Phase 1: [Hand landmarks detection](#Hand landmarks detection)

Phase 2: Hand gesture recognition process and post-process

  • Handedness To Matrix
  • Landmarks To Matrix
  • Matrix To TfLite Tensor
  • TfLite Tensor To Classification

3. Text Tasks

3.1. Text classification

The number of models: 2 click here to official website to read more information about these 2 models

pre-process

  • Input Text To Tokens (wordpiece tokenization for bert, regex tokenization for average word embedding model)
  • Tokens To TfLite Tensor

post-process

  • TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
  • Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

4. Audio Tasks

4.1. Audio classification

The number of models: 1 click here to official website to read more information about the model

pre-process

  • sample, padding and FFT, then convert matrix to TfLite Tensor.

post-process

  • TfLite Tensor To Classifications (If output tensors are quantized, they must be dequantized first)
  • Label Id to Label Name (use score threshold, category allow list, category deny list to filter)

Discussion: Do we need to import into WasmEdge from the MediaPipe C library as a host function?

In my opinion, most of the data processing functions can be implemented in rust, even though some complex parts may take some time, such as wordpiece tokenization for bert, and FFT for audio pre-processing.

For image decoding and encoding, we can use image or imageproc, but for video, there are no available libraries to use, and it takes too much time to develop our own. I think we could import some video libraries into WasmEdge such as OpenCV, to solve the problem.

That's all, thanks.

@yanghaku
Copy link
Collaborator Author

First-week report

In the first week, I design the project architecture and implement the task: image classification and object detection (including pre-process, post-process, document, tests, and examples).

1. Multiple model format support.

In MediaPipe, every essential information is bundled in models, such as input tensor shapes, output tensor shapes, classification label files, quantization parameters, and so on. So we have to parse the TfLite model to get information.

And I design a model resource abstraction level, which is an abstraction for model information. Using this, the library can support other models simply: just implement the corresponding model parser.

The architecture is as follows:

---------------------------------------------------------------------------------------
|                                  user code                                          |
---------------------------------------------------------------------------------------
     ↓ (media input)                                             ↑  (result output)
=======================================================================================
     ↓                                                                ↑ 
--------------------           -------------------            ----------------------
| pre-processing   |           |   do inference |             |  post-processing   |
|   (to tensor)    |     ➔     |                |     ➔      |  (extract tensor)  |
--------------------           ------------------             ----------------------
       ↑ (get info)                                                  ↑ (get info)
---------------------------------------------------------------------------------------
|                          model resource abstraction                                 |
---------------------------------------------------------------------------------------
            ↑                           ↑                                     ↑
--------------------------      ------------------------    ---------------------------
| TfLite model parser    |      | Pytorch model parser |    |  etc.                   |
--------------------------      ------------------------    ---------------------------

When loading a model, the library will detect the model format using file head magic and call the corresponding model parser to get all the model information we need.

Now, I only implement the TfLite model parser.

2. Flexible input.

The library defines a rust interface ToTensor, any media such as image, audio, or text, which implements this interface can be used as input.

For image, this library implements the ToTensor for the crate image, so the image can be used as input directly.
Further, the library can implement the ToTensor for OpenCV format images and videos, and it can be used as input directly.
If the users have another format, such as ndarray, they can implement ToTensormanually, and use it as input.

Next week's plan

  1. TfLite bundles its label information using the zip format, so we must implement the zip extraction, and get label information.
  2. Start work on the audio and text tasks.

@yanghaku
Copy link
Collaborator Author

yanghaku commented Apr 1, 2023

2-3 week report

Progress

  1. Add zip extraction support to extract files from model data.
  2. Implement the audio classification task. (user can use an audio decode library such as symphonia to read audio as input.)
  3. Implement the text classification task, and support regex tokenizer and word piece tokenizer.
  4. Compile the ffmpeg library to wasm32-wasi target.

Next week's plan

  1. Try ffmpeg library to decode and encode video.
  2. Try to implement video support for vision tasks.
  3. Find the function we need in OpenCV.

@alabulei1 alabulei1 added the LFX Mentorship Tasks for LFX Mentorship participants label Apr 3, 2023
@yanghaku
Copy link
Collaborator Author

yanghaku commented Apr 8, 2023

4-Week Progress Report

Summary of Progress

  1. Successfully integrated FFmpeg into our project, The project uses FFmpeg as a rust feature, which can process videos and audio.
  2. Improve audio input: The library has built-in implementation to support Symphonia, FFmpeg, and raw audio data as input.
  3. Added video support for vision tasks by utilizing FFmpeg as the video decoder and encoder, and converting frames to tensors.

Plan for Next Week

  1. Implement the tasks: "Hand landmarks detection" and "Gesture Recognition".

@yanghaku
Copy link
Collaborator Author

yanghaku commented Apr 16, 2023

5-Week Progress Report

Summary of Progress

  1. Add Region of Interest support, which can select the region of interest and rotate any angle for images.
  2. Implement the hand landmarks task and its subtask hand detections.
  3. Implement the gesture recognition task.
  4. Add more options for utils draw_detection and draw_landmarks.

Plan for Next Week

  1. Implement the rest tasks: "Image segmentation" and "Image embedding".
  2. Support more options for draw utils, such as text labels.

@juntao
Copy link
Member

juntao commented Apr 16, 2023

Thanks! Do you have instructions for us to try these?

I think we will probably eventually need ffmpeg to run as a plugin (native host functions) as opposed to compiling to Wasm for performance reasons. But that's for later!

@yanghaku
Copy link
Collaborator Author

@juntao Yes, The repo has README file that shows how to use the library. And repo has examples and tests which can be directly run by cargo.

About FFmpeg, the performance is slower than native. It is a temporary solution for video/audio processing before FFmpeg/Opencv plugins are available. Now, I use FFmpeg-wasm as a rust feature, which may be deleted after the FFmpeg plugin can be used.

The repo also has scripts that show how to set up environments and run examples/tests with FFmpeg.

The documentation and examples are not complete at this time, but I will update them once all MediaPipe tasks have been implemented.

@yanghaku
Copy link
Collaborator Author

Hello! @juntao , I have encountered an issue with the MediaPipe Image Segmentation tasks, as a specific one of the models uses a custom operator called Convolution2DTransposeBias.

And I noticed that the MediaPipe source code includes 10 custom operators that could potentially be in use in the future.

To resolve this issue, one possible solution is to build a custom libtensorflowlite_c.so library that includes these operators.
Alternatively, other better solutions may be available that could help address this problem.

Do I need to add this thing to the plan?

@juntao
Copy link
Member

juntao commented Apr 18, 2023

I think you could try to build a new tf library that includes those operators (option 1) and see if works? If it does, we should perhaps make the new tf library build as part of our WASI NN plugin. Thanks.

@yanghaku
Copy link
Collaborator Author

Ok, I have added it to the milestones.

@yanghaku
Copy link
Collaborator Author

6-Week Progress Report

Summary of Progress

  1. Implement the rest tasks: "Image segmentation" and "Image embedding". Now all tasks released in MediaPipe have been implemented, and I will update the repo when a new model has been released in MediaPipe.
  2. Add more examples and update the documents.
  3. Find the function we need in OpenCV and FFmpeg.

Plan for Next Week

  1. Try to build a new TfLite library that includes MediaPipe custom operators

Host Functions we need in OpenCV

OpenCV is written in C++ so we also need class member functions for the classes.

Classes

  • cv::Mat
  • cv::RotatedRect
  • cv::VideoCapture
  • cv::Rect
  • cv::Vec
  • cv::KeyPoint
  • cv::VideoWriter

Functions

  • cv::imread
  • cv::imdecode
  • cv::resize
  • cv::imencode
  • cv::putText
  • cv::circle
  • cv::line

Host Functions we need in FFmpeg

  • av_strerror
  • av_register_all
  • av_strdup
  • avformat_open_input
  • avformat_close_input
  • avformat_find_stream_info
  • av_find_best_stream
  • av_read_frame
  • av_frame_alloc
  • av_frame_free
  • av_init_packet
  • avcodec_parameters_alloc
  • avcodec_parameters_free
  • avcodec_parameters_to_context
  • avcodec_open2
  • avcodec_close
  • avcodec_send_packet
  • avcodec_receive_frame
  • avfilter_register_all
  • avfilter_graph_alloc
  • avfilter_graph_config
  • avfilter_graph_create_filter
  • avfilter_graph_get_filter
  • avfilter_graph_free
  • avfilter_inout_alloc
  • avfilter_inout_free
  • av_buffersink_get_frame
  • av_buffersrc_add_frame
  • av_buffersrc_close
  • sws_getContext
  • sws_scale
  • sws_freeContext

@yanghaku
Copy link
Collaborator Author

7-Week Progress Report

Summary of Progress

  1. Add TfLite custom operators support and create a PR for the solutions.
  2. Implement MediaPile custom operators, now all models can be run in WasmEdge successfully.

Plan for Next Week

  1. Face Detection is available now in MediaPipe, so I can implement the solution next week.
  2. Support all MediaPipe custom operators in https://github.com/yanghaku/mediapipe-custom-ops
  3. Start to try TfLite GPU support for the WasmEdge WASI-NN plugin.

@yanghaku
Copy link
Collaborator Author

yanghaku commented May 6, 2023

8-Week Progress Report

Summary of Progress

  1. Implement MediaPipe task: Face Detection.
  2. Support all MediaPipe custom operators in repo https://github.com/yanghaku/mediapipe-custom-ops, the pre-compiled library can be downloaded at https://github.com/yanghaku/mediapipe-custom-ops/releases.
  3. Add TfLite GPU support for the WasmEdge WASI-NN plugin, the source code is in the branch: https://github.com/yanghaku/WasmEdge/tree/wasi_nn_tflite_gpu_test, the pre-compiled Tflite-Gpu library can be downloaded at https://drive.google.com/file/d/1cyCAPrtWih6tuFMfaIB78_rTpYFQJfLV/view.

Plan for Next Week

  1. Add more options for utils draw_detection and draw_landmarks, such as colors, label text, etc.
  2. Write more examples and documents, then publish the library to crates.io.

@juntao
Copy link
Member

juntao commented May 6, 2023

That's great progress. Thank you so much for the update!

@dannypsnl
Copy link
Member

Close as completed issue, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LFX Mentorship Tasks for LFX Mentorship participants
Projects
None yet
Development

No branches or pull requests

4 participants