Reading subsets of fst-stored data.tables #44

statquant · 2017-03-30T11:27:41Z

Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.

Just to be clear say we have a data.table with date,id,col1,col2,col3 saved as a fst file.
I'd like to be able to do something like read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")

I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents.
You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.

Regards and thanks

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2017-03-30T12:43:25Z

Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the fst file, you can scan a billion records. For selections which are not related to a stored key, we could use the selection mechanism from data.table, but on chunks of data instead of the whole table. The problem however is that you can't use aggregate statements for selection in that case, for example:

dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]

   X  Y
1: 1 10
2: 2  9
3: 3  8
4: 4  7
5: 5  6

This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the mean is not calculated correctly). So that is a problem. Possible solutions might be:

A selection requires the specification of a grouping variable. So the selection is done per group. If the groups are small enough, there will be no problems for large data sets.
No aggregate selections are allowed, only simple operators. The advantage of this solution is that we can program these simple operators in C++, increasing performance.
A more elaborate framework where we allow custom methods as operators on the data. These methods should have a map reduce-like character, for example for the above example:

# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))

# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]

# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]

[1] 10.35

So we calculated a mean by using sum and counting per chunk. The fst package could provide methods like fst.sum, fst.mean etc. to perform these operations.

For your use-case I think that option 2 is probably enough?

statquant · 2017-03-30T12:49:11Z

@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me.

Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc...

MarcusKlik · 2017-03-30T13:20:03Z

Nice, I will make sure that your feature is on the list for one of the next versions of fst.

MarcusKlik added the feature request label Mar 30, 2017

MarcusKlik self-assigned this Mar 30, 2017

phillc73 mentioned this issue Apr 9, 2017

Append column to fst file #39

Open

MarcusKlik added this to the Interface milestone Apr 16, 2017

MarcusKlik removed this from the Interface milestone Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading subsets of fst-stored data.tables #44

Reading subsets of fst-stored data.tables #44

statquant commented Mar 30, 2017

MarcusKlik commented Mar 30, 2017 •

edited

statquant commented Mar 30, 2017

MarcusKlik commented Mar 30, 2017

Reading subsets of fst-stored data.tables #44

Reading subsets of fst-stored data.tables #44

Comments

statquant commented Mar 30, 2017

MarcusKlik commented Mar 30, 2017 • edited

statquant commented Mar 30, 2017

MarcusKlik commented Mar 30, 2017

MarcusKlik commented Mar 30, 2017 •

edited