Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading subsets of fst-stored data.tables #44

Open
statquant opened this issue Mar 30, 2017 · 3 comments
Open

Reading subsets of fst-stored data.tables #44

statquant opened this issue Mar 30, 2017 · 3 comments
Assignees

Comments

@statquant
Copy link

Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.

Just to be clear say we have a data.table with date,id,col1,col2,col3 saved as a fst file.
I'd like to be able to do something like read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")

I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents.
You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.

Regards and thanks

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Mar 30, 2017

Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the fst file, you can scan a billion records. For selections which are not related to a stored key, we could use the selection mechanism from data.table, but on chunks of data instead of the whole table. The problem however is that you can't use aggregate statements for selection in that case, for example:

dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]

   X  Y
1: 1 10
2: 2  9
3: 3  8
4: 4  7
5: 5  6

This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the mean is not calculated correctly). So that is a problem. Possible solutions might be:

  • A selection requires the specification of a grouping variable. So the selection is done per group. If the groups are small enough, there will be no problems for large data sets.
  • No aggregate selections are allowed, only simple operators. The advantage of this solution is that we can program these simple operators in C++, increasing performance.
  • A more elaborate framework where we allow custom methods as operators on the data. These methods should have a map reduce-like character, for example for the above example:
# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))

# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]

# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]

[1] 10.35

So we calculated a mean by using sum and counting per chunk. The fst package could provide methods like fst.sum, fst.mean etc. to perform these operations.

For your use-case I think that option 2 is probably enough?

@statquant
Copy link
Author

@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me.

Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc...

@MarcusKlik
Copy link
Collaborator

Nice, I will make sure that your feature is on the list for one of the next versions of fst.

@MarcusKlik MarcusKlik added this to the Interface milestone Apr 16, 2017
@MarcusKlik MarcusKlik removed this from the Interface milestone Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants