Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFX Workspace: A Rust crate for the YOLO family of object detection models #2768

Open
9 of 12 tasks
Charles-Schleich opened this issue Sep 2, 2023 · 11 comments
Open
9 of 12 tasks
Labels
LFX Mentorship Tasks for LFX Mentorship participants

Comments

@Charles-Schleich
Copy link

Charles-Schleich commented Sep 2, 2023

Motivation

YOLO (You Only Look Once) are a family of high performance models for general object detection in images and videos. There exist many tutorials online to use YOLO models for object detection in Python.
There are drawbacks to making use of python as the runtime language for inference in a production setting.
The typical Python setup will include adding packages to the environment such as OpenCV/Tensorflow/Pytorch, and potentially CUDA drivers should the target execution hardware be a Nvidia GPU.
One might then attempt to Dockerize the production project, additionally requiring Nvidia Container Toolkit.
In the case of embedded devices, and Micro-controllers this setup is unfeasible.

With WASI-NN plugins, WasmEdge is well suited for running AI applications, and can offer Rust + Wasm alternative to Python + Docker setups.

Details

An application doing inference must pre-process input data (Images, Audio, Video) into TFlite/ PyTorch formats and post-process the outputs from the models to be used further.
While many functions exist in opencvmini and media-pipe, there are some Python equivalent functions missing to fully support the pre and post processing required for the YOLO family of models.

In addition to adding these functions, creating a high level crate purely for object detection using YOLO would lower the barrier of entry to using Rust for object detection.

The aim of this issue is to:

  • Track and discuss the addition of functions to opencvmini and media-pipe crates, for pre and post processing needed for YOLO
  • Design and creation of a high level YOLO SDK in Rust for WasmEdge that can do inference on image and video data

Rust SDK Design

The design of the YOLO SDK will be similar to the media-pipe crate.
We may choose to extend the functionality of this crate depending on extra time available.

Milestones

  • Evaluate the minimum set of functions needed to support YOLO in a Python application
  • Add Rust equivalent functions to respective libraries (opencvmini) (2-3 week estimate)
  • Design and implement a Rust YOLO SDK, making use of existing ecosystem (2 - 4 weeks)
  • Support Processing of video files, potentially making use of FFMPEG to process video files (2-4 weeks)
    • Reading In of Video Files, splitting into frames and saving encoding information (1-2 weeks)
    • Re-combining processed frames and re-encoding video-files back into original format (1-2 weeks)
  • Write a set of examples for SDK usage (1 weeks)
  • Write the documentation
  • Polish project / API spec (variable)
  • Performance Improvements
  • Transfer repo ownership to WasmEdge
  • Publish the library to crates.io (variable)

Appendix

Python OpenCV example for object detection in YOLO

@Charles-Schleich
Copy link
Author

Charles-Schleich commented Sep 4, 2023

I have forked second-state/WasmEdge-WASINN-examples here.
I will be using the fork to add an example of using YOLO with Wasi-NN, as well as a 'live progress' repo while I work through pre and post processing functions needed by opencvmini to support the YOLO crate.

@juntao
Copy link
Member

juntao commented Sep 4, 2023

Excellent. Thanks!

@Charles-Schleich
Copy link
Author

Charles-Schleich commented Sep 11, 2023

Week 1 Update.

Added support for 2 new functions to the Wasmedge / opencvmini plugin,
resize and normalize
Related PR, #2795

Added support for the two functions in the secondstate rust opencvmini SDK,
As well as added example usage of functions exposed.
second-state/opencvmini#10

Begun creation of example of using Yolo in WasmNN
https://github.com/Charles-Schleich/WasmEdge-WASINN-examples

Proposal crate dependencies for the YOLO Crate:
OpenCVMini for pre and post processing.
WasmNN for forward passes through network.

Design thoughts:
YOLO-SDK should include builder pattern, with reusable client.
Something that follows the below pattern.

let yolo = YoloModel::new()
      .network(net_bytes, ModelTypeEnum)
      .class_names(Vec<String>)
      .build()
      .unwrap(); // With useful error message related to failure.

let classes = yolo.infer_image(image_bytes);
let classes2 = yolo.infer_image(image2_bytes);
let video_classes = yolo.infer_video(video_bytes);

Plan for following week:

  1. Continue with Yolo In WasmNN example. Doing all processing in Rust.

  2. Add functions to WasmEdge opencvmini plugin needed for post processing
    cv2.putText
    cv2.dnn.NMSBoxes

  3. Add enum definition for possible dtype values for use in normalize function of opencvmini rust sdk, for correct library usage.

  4. Create crate YOLO-SDK, to start bringing pieces together.

@hydai hydai added the LFX Mentorship Tasks for LFX Mentorship participants label Sep 12, 2023
@Charles-Schleich
Copy link
Author

Update Weeks 2 + 3

  1. Updated normalize function to OpenCVMini plugin updated opencvmini rust crate to match definition
  2. Expose noArray function to OpenCVMini plugin, updated opencvmini rust crate to match definition
  3. Implemented an example of using the a YOLO model in WASI-NN examples examples repo.
    Image pre-processing, and network output processing were both done in pure Rust.
    Maybe it also makes sense to write a version using the Wasi-nn plugin for WasmEdge, as opposed to the Rust crate compiled to WASM.
  4. Added github ci for yolo example WASI-NN examples
  5. Began work on the yolo-rs crate for WasmEdge

Plan for the next 3-6 weeks.

  • Focus on Yolov8 model support.
  • implement object detection for Images in yolo-rs utilizing WasmEdge Plugins
    • implement pure-rust feature of library while waiting for changes to be merged into WasmEdge + OpenCvmini crates
    • implement plugin-processingof library, which will be default.
      • Pre-processing using OpenCVMini plugins
      • inference using Wasi-nn plugin
      • post-processing using OpenCVmini plugin
  • implement Video detection in yolo-rs utilizing WasmEdge Plugins
    • investigating + using ffpmeg to parse out video frames.
  • implement processing support for yoloV8-seg and yoloV8-pose models

@Charles-Schleich
Copy link
Author

Week 4 Update.

  • Implemented object detection for images in pure Rust, using no plugins so far.
  • Wrote an initial implementation of the intersection over union algorithm using ndarray, and matrix multiplication over all objects.
  • Wrote an initial implementation for non-maximum suppression algorithm in Rust.
  • Fixed bounding box scaling for object detection in WASINN-examples Add [Example] Create pytorch-yolo-image example second-state/WasmEdge-WASINN-examples#35
  • Implemented the github workflow for pytorch WASINN-example above

Plan this week is to use FFMpeg to parse video frames, run detection on them individually, then reassembling the video
Looking at using https://github.com/yanghaku/ffmpeg-wasm32-wasi

@Charles-Schleich
Copy link
Author

Charles-Schleich commented Oct 22, 2023

Week 5 + 6 update:

Spent most of this time evaluating video processing options and attempting to build proof of concept applications to process video for the yolo-rs
First plan of action:
Trying to get all video processing code to compile to WASM and do all processing inside wasm executable.

  • Evaluated video-rs, relies on FFMPEG under the hood, and does not compile to target wasm32-wasi easily.
  • Evaluated meh/rust-ffmpeg, also does not compile to target wasm32-wasi
  • Evaluated zmwangx/rust-ffmpeg, aka ffmpeg-next, also does not compile to target wasm32-wasi
  • Evaluated xiph/rav1e, Does compile to wasm32-wasi target, codec with support for Av1 encoded videos but not x264, x265, VP9, among others, not viable for general processing.
  • Evaluated yanghaku/ffmpeg-wasm32-wasi, does compile to target wasm32-wasi, a little bit more involved build process, and requires large amounts of memory to compile.

After discussions with community and @juntao, decided that due to memory limitations of wasm32 application, largest video that would be able to be processed would need to fit into instance memory of <4GB, And it may be better to develop a plugin that the YOLO crate can call out to, in order to process video, this path means that we get native performance, as well as more options for enabling hardware specific optimizations.

Second plan: Developing a wasmedge plugin for the yolo crate purely for video processing using ffmpeg.
Using

This video processing plugin exists at Charles-Schleich/wasmedge-yolo-rs-video-processing-plugin
It is very much a work in progress, and will be merged into to the main Charles-Schleich/yolo-rs crate once i have a working proof of concept.

I plan on pursuing doing the video processing inside pure WASM after getting a fully working plugin for the yolo-rs crate at Charles-Schleich/yolo-rs

@Charles-Schleich
Copy link
Author

Week 7 Update:

Worked entirely this week on the video processing plugin for YOLO.
Charles-Schleich/wasmedge-yolo-rs-video-processing-plugin
This plugin currently has the following interface

mod plugin {
    type FramesCount = i32;
    type HostResultType = i32; // Can correspond 0 to okay, and num>0 to the equivalent of an error enum
    #[link(wasm_import_module = "yolo-video-proc")]
    extern "C" {
        pub fn load_video_to_host_memory(
            str_ptr: i32,
            str_len: i32,
            str_capacity: i32,
            width_ptr: *mut u32,
            height_ptr: *mut u32,
        ) -> FramesCount;

        pub fn get_frame(
            frame_index: i32,
            image_buf_ptr: i32,
            image_buf_len: i32,
            image_buf_capacity: i32,
        ) -> i32;

        pub fn write_frame(frame_index: i32, image_buf_ptr: i32, image_buf_len: i32) -> i32;

        pub fn assemble_output_frames_to_video(
            str_ptr: i32,
            str_len: i32,
            str_capacity: i32,
        ) -> FramesCount;
    }
}

Explanation of each function.
load_video_to_host_memory - Loads a video file from disk according to string filename, splits the frames into individual elements returns the number of frames in the video, stores the frames inside an input frame buffer (pre-inference) in Host Memory structure on the Heap.
get_frame - Accepts a pointer to an image instance from WASM memory, in the Host library converts the image to a image_rs::image type, and writes the image to WASM memory using the given pointers.
write_frame - Writes a frame from WASM memory back into Host memory, stores it in an output frame buffer
assemble_output_frames_to_video - Takes all of the frames in the output frame buffer, and encodes them into a video.

I am currently working with videos with just a single video stream.
I will add support for re-encoding an audio stream and other multiple video streams after i have a proof of concept working.

End of Week 7 progress report.

  • Parsing out frames from a video file into input buffer,
  • Added some simple processing to the images - Add a small red square at video origin (0,0) top left
  • saved individual frames to disk to validate

Interesting findings: I spent some time creating really low resolution videos, i.e. 2px by 2px, 5px by 5px, and 10px by 10px, so i could validate the format of bytes received from decoding the stream.
H264 Encoding adds padding to the frames, as the resolution was below the encoders block size, which made validating bytes surprisingly frustrating, i.e. a 2 by 2 resolution video one would expect 223 bytes (whcolour_channels) (assuming 8 bit colour depth) but the output was much larger than that.
However once using a video with higher resolution, i.e. 720p or 1080p, the number of bytes received after decoding is as expected (whcolour_channels).

@Charles-Schleich
Copy link
Author

Charles-Schleich commented Nov 13, 2023

Week 8 + 9 Update:

These two weeks have been focused entirely on understanding how video encoders work, how FFMPEG handles re-encoding raw frames into a video, and implementing that in Rust in the form of a plugin.
I have been using meh/rust-ffmpeg, to handle both decoding and encoding video as it seems to be in active development.

I have an repository yolo-video-processing-plugin in active development with a simple example involving a 3 step pipeline
Decode YUV420 Video -> Vec<RGB Frame>
mutate : Vec<RGB Frame>
Vec<Mutated RGB Frame> -> Encode YUV420 Video

I have been following the https://github.com/leandromoreira/ffmpeg-libav-tutorial
I am able to assemble the frames and re-encode them into a video, however my current issue is to do with the encoder time-bases and dropped frames, It is something that I want to fix before moving on.

My plan between now and the 22nd involves

  • Finish Re-encoding video (1-2 days)
  • Code Cleanup of main Yolo-rs, and the video-processing plugin (1 day)
  • Merge Video processing plugin with main repo (0.1 day)
  • Write examples for usage (0.5 days)
  • Write Docker file, for ease of build process (1 day)
  • Improve Performance (The remaining time of the LFX Mentorship, and beyond)

@Charles-Schleich
Copy link
Author

Charles-Schleich commented Nov 22, 2023

Week 10 + 11 Update:

Week 10 : mainly developing the Video processing plugin and battling FFMPEG rust bindings getting re-encoding of raw frames to work reliability.
The issue that kept coming up was after handing frames with evenly spaced presentation timestamps off to the encoder, the encoder would return packets with decode timestamps and presentation timestamps that could not be encoded.
One way to mitigate this was to set all frames to be of type I Frames (Intra-Frames), such that the encoder writes each frame directly without any intra-frame encoding, the alternative is dropped frames during encoding, and a video that does not contain the full original information.
Due to time constraints this trade off was chosen, as an integration with the main code base as well as testing, code cleanup was also necessary.

Week 11: Adding the video processing plugin code to the main repository,
writing examples of Image inference and Video inference.
Cleaning up and commenting on code.

Link to the final repo
https://github.com/Charles-Schleich/yolo-rs

Plan for week 12: 20 Nov -> 30 Nov

  • Write a Docker image for quick proto-typing,
  • Clean up code further, (Better Docs, Remove lingering unwraps, expects)
  • And add a visual diagram showcasing project layout
  • optimize code in efforts to improve runtime performance

@ehxdie
Copy link

ehxdie commented May 10, 2024

Hey you still working on this ??

@Charles-Schleich
Copy link
Author

Hello @ehxdie,

Yes, still working to improve developer experiencing and fix a few issues with the project.
Available at https://github.com/Charles-Schleich/yolo-rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LFX Mentorship Tasks for LFX Mentorship participants
Projects
None yet
Development

No branches or pull requests

4 participants