Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking forward to full description of fst format #3

Open
xiaodaigh opened this issue Jan 24, 2018 · 11 comments
Open

Looking forward to full description of fst format #3

xiaodaigh opened this issue Jan 24, 2018 · 11 comments
Assignees

Comments

@xiaodaigh
Copy link

I know it's going to be a bit of work, but a full-description of the fst format will help build connectors into it. From Julia, Python, and any other programming language. The potential is huge for such an awesome on-disk data manipulation framework!

I will try to help when I know enough C++. I secretly hope that once the format is well known, there can be independent implementation in Julia and Rust (at the risk of running out of sync with C++) but native implementations would be fun. But calling into C++ is also a good option.

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh, thanks! Yes I definitely need to spent time on documenting the format and perhaps more importantly, the fstlib API, so new connectors can be build!

It's not complicated, but the API will grow as computational features are added (which will run in parallel with the file IO). Providing for methods that can only be run on the master thread (such as R methods) will also have to be reflected in the API. Perhaps Rust with it's better concurrency could provide a faster connector for fstlib, that would be very interesting!

Just a question, why would you prefer a native implementation in Rust or Julia over calling the fstlib library from a Rust or Julia wrapper. Especially Julia will probably take a performance hit if used for the low-level operations that fstlib requires. Or are you referring to a native binding instead of a binding through the R-Julia interface package?

Is there an example package in Julia which could be used to model a native binding, a package using a simple C++ library for example? The binding could be made to a C or C++ API to fstlib using packages like Clang.jl or Cpp.jl, Cxx or CxxWrap perhaps? Starting a toy package early would certainly help a lot to create a uniform API that's suitable for different languages!

@xiaodaigh
Copy link
Author

Anyway, the first thing I would do is to use Cxx.jl to call into fstlib. But I might experiment with a pure Julia implementation at some point given the fst format is stable.

Julia has some low-level control as well but not good multi-threading at the moment. I think fstlib is good for scripting languages like Julia, R, and Python so it would be nice to actually write it in a scripting language as well. Given the format is stable, a pure Julia implementation will allow Julia programmers to contribute, not just those with C++ knowledge. But it's overall better to have all resources contribute to one library, in this case a C++ one in fstlib; I wish I know enough C++ to contribute. Learning...

Once the multi-threading story is better in Julia and there is better interop between R-Julia and Python-Julia, then you may be tempted to switch to Julia as well as the syntax is nice and simple, and it can be as fast as C/C++ in many cases.

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh,thanks, I think using Cxx.jl would be a nice solution where you only have a single code-base. It would be hard to maintain different versioning and new features across two distinct libraries in different languages (and it would cost a lot of time, currently the most valuable resource for fstlib development :-))

I would be very interested in trying to set up a fst package in Julia, please let me know if and how I can help with that!

@davidanthoff
Copy link

In general, is there a chance that fstlib might expose a pure C API, not a C++? That would make integration in other languages a lot easier.

E.g. for julia, Cxx.jl is great, but at this point installation is so tricky that it is really not an option for a widely used package. On the other hand, if fstlib just exposed a C API, one could integrate is super easily into julia.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Feb 17, 2018

Hi @davidanthoff, thanks for your question. Basically, the fst package in R also has a C only interface when looked at from the R side (that's all R understands), so that's similar to your request. In R, the Rcpp package is used for convenience and one of the things it does is generate a C interface that can be used by R.
From those C wrappers, the underlying C++ code from fstlib is used, would it be possible to have a setup like that for Julia?

For a full implementation of fstlib in Julia, you would need:

  • DLL with a C API for write_fst, read_fst, threads_fst, compress_fst, decompress_fst and metadata_fst (or similar names) to be called from Julia
  • Implementation of fstlib's column types (defined in ifstcolumn)
  • Implementation of fstlib's IFstTable, which will be a wrapper for a table in Julia.
  • Implementation of fstlib's IColumnFactory and ITypeFactory for generating new column vectors or data types natively in Julia.

These are all abstract classes which would need an implementation based on the Julia API. So you should be able to have access to the Julia API from the DLL.

The reason for that is that fstlib is a zero-copy library. So any data structure (such as columns) needed to hold data should be created directly in Julia and not copied from an existing memory buffer. That reduces memory requirements and increases the speed.

Perhaps when you have a basic setup, I could assist you in implementing the abstract classes for Julia. It would be very interesting to see an implementation of fstlib in other languages than C++ and R!

@xiaodaigh
Copy link
Author

Basically for a little bit of context, we have shown via benchmarking that fst has the fastest read/write speed in the Julia/R/Python-verse. Parquet and R's serialization are the only other major one we haven't tested.

So I would be extremely to keen to be able to use fst in Julia.

@davidanthoff
Copy link

To be fair, you didn’t measure Feather perf with the R or Python packages, those might be faster than the Julia implementation (or not, who knows).

@MarcusKlik
Copy link
Collaborator

Hi @xiaodaigh and @davidanthoff, that's great to hear. It would be nice to compare the various serialization options with a wide range of parameters. For example, for fstlib, the speed depends on a lot of factors:

  • The column type being serialized. Logicals and integers are very fast but character columns are much slower in general.
  • The compressibility of the (column-) data. Highly compressible data can be compressed faster and leads to a smaller amount of bytes to serialize (increasing speed)
  • Compiler flags. Using -O2 or -O3 flags (for GCC and Clang) matters a lot for the speeds measured.
  • Number of threads obviously but also the type of CPU used (I have found Xeon CPU's to process data faster than i5's for example, which makes higher compression settings relatively faster).
  • Disk speed and IOPS. Some serializers are optimized for disks with low IOPS and fstlib is optimized for disks with high IOPS.
  • Memory bandwidth. For some operations the memory bandwidth is the limiting factor. The choice of the system used for benchmarking sets the memory bandwidth. Different serializers have different dependencies on that limit.

Testing many systems is very labor intensive, but it would be very interesting to set up a benchmark that uses generated samples with various characteristics:

  • various types
  • various compressibility levels (e.g. factors can have 2 levels or 2000 and a small range of integers is easier to compress than larger ranges)
  • various sizes (fstlib shines more for large datasets, csv writers mostly scale linearly with size).

that way we could really learn about the strong and weak points of different serializers and how they relate to each other. Are your benchmarks published somewhere (or do you have plans for that) ?

thanks!

@xiaodaigh
Copy link
Author

xiaodaigh commented Feb 18, 2018

Obviously that is going to be a lot of work. I think ultimately we can set up a website where people can submit benchmarks from their system via running some Julia and/or R code. For now I am slowly adding benchmarking codes to the DataBench.jl repo.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Feb 18, 2018

Hi @davidanthoff, on your question about a Julia implementation. Perhaps it would be possible to create a package using small steps:

  • Milestone 1: a Julia package using a compiled library with a C API that returns a hello world string.

  • Milestone 2: a Julia package using a compiled library that returns meta-data about a table provided to the C API.

After milestone 2, we know that we can call the Julia API from the compiled library, that means we can implement the abstract classes from fstlib.

  • Milestone 3: the table wrapper and a single column type is implemented (for example integer columns). A 1 column table can be serialized from Julia to disk using fstlib (and read from R for example). Initial speed measurements can be taken for comparison.

  • Milestone 4: implement the other types one by one. Think about how to map the special types like Date or nanotime to the Julia world.

Would that be doable? If any special code is necessary to accommodate the Julia API, I can provide that from the fstlib library (for example, some API calls might only be allowed from the master thread like in R).

@MarcusKlik MarcusKlik self-assigned this Feb 18, 2018
@xiaodaigh
Copy link
Author

Milestone 1 can be easily achieved see https://github.com/JuliaInterop/CxxWrap.jl

I don't know anything about C++ and that's the issue. I want to help here, but I traced the code to _fst_fstretrieve for reading a fst file. But I can't to figure out how to go any further.

What would help is someone familiar with C++ to do this, but if it's me, I need some speficif directions on how to compile fstlib into a .so file and which C++ functions I can call in this manner?

#include "jlcxx/jlcxx.hpp"

JLCXX_MODULE define_julia_module(jlcxx::Module& mod)
{
  mod.method("greet", &greet);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants