Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dora record #365

Merged
merged 9 commits into from
Oct 31, 2023
Merged

Dora record #365

merged 9 commits into from
Oct 31, 2023

Conversation

haixuanTao
Copy link
Collaborator

@haixuanTao haixuanTao commented Oct 19, 2023

This is a first experimental dora-record node that uses arrow file format to record data.

Advantage of this format

  • Being zero-copy and memory mapped. See: https://huggingface.co/docs/datasets/about_arrow
  • Can use plenty of tools such as hadoop, pandas, torch, R, matlab and so on, very quickly without preprocessing
  • Does not need to be serialized-deserialized which can be handy when partially reading data.
  • Should benefit any column-based computation of data such as training and so on.

Advantage of dora-record

The benefit of using dora-record is that it is going to save in column format all input received and they will be split across several files.
We can merge those files using the trace_id that is going to be shared when using opentelemetry features.

  • trace_id can also be queried from UI such as jaeger UI, influxDB and so on...
  • trace_id keep tracks of the logical flow of data, compared to timestamp based merging that might not reflect the actual logical flow of data.

We can also replay messages according to the timestamp of the message.

Limitation of the current implementation

  • data spread across multiple files. As the data is in columnar format, putting everything into one file with no prior information might make the data very sparse.
  • As the data is spread accross muiltiple files, a merging step is required that can be costly.
  • Bigger than alternative. Arrow format as it is not serialized, might be bigger than alternative. It might be interesting to also compress the data later on.

Currently not implemented in this version:

  • Variable file name and path ( even remote )
  • Async concurrently writen data, blocked by Async IPC Reader/Writer apache/arrow-rs#1207
  • Handle dora system event for replay. Not sure how to handle replaying dora event.
  • Additional metadata parameters logging.
  • Add batch recording of multiple row to avoid single row overhead
  • Possibly implement compression and also different files format

This should be reviewed and merged after #353

Base automatically changed from rust-typed-input to main October 26, 2023 13:00
Copy link
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I haven't reviewed the write_event function in detail yet, but the rest looks good.

examples/python-operator-dataflow/dataflow.yml Outdated Show resolved Hide resolved
libraries/extensions/dora-record/src/main.rs Outdated Show resolved Hide resolved
libraries/extensions/dora-record/src/main.rs Outdated Show resolved Hide resolved
libraries/extensions/dora-record/src/main.rs Outdated Show resolved Hide resolved
@phil-opp
Copy link
Collaborator

The new dora record node never finishes as it keeps on listening for dora timer messages. Thus, a timeout error occurs on CI:

image

@haixuanTao
Copy link
Collaborator Author

The new dora record node never finishes as it keeps on listening for dora timer messages. Thus, a timeout error occurs on CI:

image

I'll remove logging the ticker for now in that case

@haixuanTao haixuanTao marked this pull request as ready for review October 27, 2023 10:32
@haixuanTao haixuanTao assigned phil-opp and unassigned phil-opp Oct 27, 2023
Copy link
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

libraries/extensions/dora-record/src/main.rs Outdated Show resolved Hide resolved
@haixuanTao haixuanTao merged commit 95fcf46 into main Oct 31, 2023
16 of 17 checks passed
@haixuanTao haixuanTao deleted the dora-record branch October 31, 2023 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants