Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill a data.table range with specific rows from read.fst #29

Open
MarcusKlik opened this issue Feb 23, 2017 · 0 comments
Open

Fill a data.table range with specific rows from read.fst #29

MarcusKlik opened this issue Feb 23, 2017 · 0 comments

Comments

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Feb 23, 2017

With this feature you can populate say row 1001:2000 in a 1e6 row data.table with a 1000 row read from fst.read. All this is done in memory. This feature is very useful for combining data from multiple (fst) sources into a single result table without having the overhead of copies. For example, when performing the merge sort algorithm on a set of data files, you need to

  1. read first x rows from all files
  2. sort the resulting table
  3. write some rows to disk
  4. read next x rows form file with smallest first chunk
  5. sort resulting table
  6. goto 3

This can be performed efficiently in R by using data.table's fast sorting and populating the result table in memory. With such an algorithm operating on a collection of fst files, we basically have a method of sorting arbitrary large fst files without running out of memory (and it can be done with multiple threads!).

@MarcusKlik MarcusKlik added this to the Advanced operations milestone Apr 16, 2017
@MarcusKlik MarcusKlik removed this from the Advanced operations milestone Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant