Currently planned milestones for fst #48

MarcusKlik · 2017-04-08T20:36:32Z

A list of currently planned milestones for fst with some key features:

format-complete: the fst format allows for:
a) row binding of data frames
b) column binding of data frames
c) persisting (custom) column attributes
d) persisting and indexing table keys
e) a range of compression algorithms
f) storing hashes for each data block (Feature request: data-integrity check by adding hashvalue #49 )
stand-alone C++ core library
a) the core code for fst is available as a separate C++ library
interface: fst streaming object can be used like a data.frame:
a) (simple) on-the-fly sub-setting (requires far less memory)
b) selection of columns
c) append columns
d) append rows
e) rbind several fst files or rbind fst files with in-memory data sets
multi-threading:
a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.
added functionality:
a) lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)
b) directly convert csv to fst without memory overhead
interoperability:
a) import data from Apache Parguet files
b) types used in fst C++ core library are close to Apache Arrow
c) Python interface?
advanced operations:
a) on the fly sequential and parallel grouping using custom methods
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) adding columns using a merge operation (with a fst file acting as the right-join data set)
d) fst file can be sorted into a new fst file using merge-sort algorithm
e) multiple fst-files represent a single data set
f) operations can be performed on the set of fst files in parallel
g) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort.
h) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
i) fill a data set range with specific rows from a fst file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29).
performance enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2017-12-22T23:14:02Z

Superseded by #117

MarcusKlik mentioned this issue Apr 8, 2017

Bug Report: In R: fst crash in both saving and reading very large files. (500M+ rows and 50+ columns, 100+GB) #46

Closed

MarcusKlik added the enhancement label Apr 8, 2017

This was referenced Apr 9, 2017

Append column to fst file #39

Open

Benchmark question #50

Open

This was referenced Apr 27, 2017

Make data.table an optional dependency #54

Closed

Quick way to get file properties with read.fst #58

Closed

MarcusKlik mentioned this issue Jul 8, 2017

Include fst package and missing values matthieugomez/benchmark-stata-r#5

Closed

MarcusKlik self-assigned this Dec 10, 2017

MarcusKlik closed this as completed Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Currently planned milestones for fst #48

Currently planned milestones for fst #48

MarcusKlik commented Apr 8, 2017 •

edited

Loading

MarcusKlik commented Dec 22, 2017

Currently planned milestones for fst #48

Currently planned milestones for fst #48

Comments

MarcusKlik commented Apr 8, 2017 • edited Loading

MarcusKlik commented Dec 22, 2017

MarcusKlik commented Apr 8, 2017 •

edited

Loading