You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A list of currently planned milestones for fst with some key features:
format-complete: the fst format allows for:
a) row binding of data frames
b) column binding of data frames
c) persisting (custom) column attributes
d) persisting and indexing table keys
e) a range of compression algorithms
f) storing hashes for each data block (Feature request: data-integrity check by adding hashvalue #49 )
stand-alone C++ core library
a) the core code for fst is available as a separate C++ library
interface: fst streaming object can be used like a data.frame:
a) (simple) on-the-fly sub-setting (requires far less memory)
b) selection of columns
c) append columns
d) append rows
e) rbind several fst files or rbind fst files with in-memory data sets
multi-threading:
a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.
added functionality:
a) lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)
b) directly convert csv to fst without memory overhead
interoperability:
a) import data from Apache Parguet files
b) types used in fst C++ core library are close to Apache Arrow
c) Python interface?
advanced operations:
a) on the fly sequential and parallel grouping using custom methods
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) adding columns using a merge operation (with a fst file acting as the right-join data set)
d) fst file can be sorted into a new fst file using merge-sort algorithm
e) multiple fst-files represent a single data set
f) operations can be performed on the set of fst files in parallel
g) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort.
h) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
i) fill a data set range with specific rows from a fst file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29).
performance enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).
This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)
The text was updated successfully, but these errors were encountered:
A list of currently planned milestones for
fst
with some key features:format-complete: the
fst
format allows for:a) row binding of data frames
b) column binding of data frames
c) persisting (custom) column attributes
d) persisting and indexing table keys
e) a range of compression algorithms
f) storing hashes for each data block (Feature request: data-integrity check by adding hashvalue #49 )
stand-alone C++ core library
a) the core code for
fst
is available as a separate C++ libraryinterface:
fst
streaming object can be used like adata.frame
:a) (simple) on-the-fly sub-setting (requires far less memory)
b) selection of columns
c) append columns
d) append rows
e) rbind several
fst
files or rbindfst
files with in-memory data setsmulti-threading:
a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.
added functionality:
a) lapply like functionality creating a
fst
file using a list of inputs (csv's, custom methods, etc.)b) directly convert
csv
tofst
without memory overheadinteroperability:
a) import data from Apache Parguet files
b) types used in
fst
C++ core library are close to Apache Arrowc) Python interface?
advanced operations:
a) on the fly sequential and parallel grouping using custom methods
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) adding columns using a merge operation (with a
fst
file acting as the right-join data set)d)
fst
file can be sorted into a newfst
file using merge-sort algorithme) multiple
fst
-files represent a single data setf) operations can be performed on the set of
fst
files in parallelg) set of
fst
-files can be sorted in parallel into a new set offst
files. This avoids the slow end-phase of sorting algorithms like merge sort.h) user-defined map-reduce operations that can be used on the
fst
file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.i) fill a data set range with specific rows from a
fst
file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29).performance enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used in
fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).
This list is subject to a lot of change depending on features and issues requested/reported by users of the
fst
package :-)The text was updated successfully, but these errors were encountered: