Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Currently planned milestones for fst #48

Closed
MarcusKlik opened this issue Apr 8, 2017 · 1 comment
Closed

Currently planned milestones for fst #48

MarcusKlik opened this issue Apr 8, 2017 · 1 comment
Assignees

Comments

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 8, 2017

A list of currently planned milestones for fst with some key features:

  • format-complete: the fst format allows for:
    a) row binding of data frames
    b) column binding of data frames
    c) persisting (custom) column attributes
    d) persisting and indexing table keys
    e) a range of compression algorithms
    f) storing hashes for each data block (Feature request: data-integrity check by adding hashvalue #49 )

  • stand-alone C++ core library
    a) the core code for fst is available as a separate C++ library

  • interface: fst streaming object can be used like a data.frame:
    a) (simple) on-the-fly sub-setting (requires far less memory)
    b) selection of columns
    c) append columns
    d) append rows
    e) rbind several fst files or rbind fst files with in-memory data sets

  • multi-threading:
    a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
    b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.

  • added functionality:
    a) lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)
    b) directly convert csv to fst without memory overhead

  • interoperability:
    a) import data from Apache Parguet files
    b) types used in fst C++ core library are close to Apache Arrow
    c) Python interface?

  • advanced operations:
    a) on the fly sequential and parallel grouping using custom methods
    b) binary search on table key columns (extremely fast sub-setting of a key range)
    c) adding columns using a merge operation (with a fst file acting as the right-join data set)
    d) fst file can be sorted into a new fst file using merge-sort algorithm
    e) multiple fst-files represent a single data set
    f) operations can be performed on the set of fst files in parallel
    g) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort.
    h) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
    i) fill a data set range with specific rows from a fst file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29).

  • performance enhancements:
    a) encryption
    b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
    c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
    d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
    e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant