-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading subsets of fst-stored data.tables #44
Comments
Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]
X Y
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6 This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the
# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]
# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]
[1] 10.35 So we calculated a For your use-case I think that option 2 is probably enough? |
@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me. Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc... |
Nice, I will make sure that your feature is on the list for one of the next versions of |
Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.
Just to be clear say we have a data.table with
date,id,col1,col2,col3
saved as a fst file.I'd like to be able to do something like
read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")
I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents.
You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.
Regards and thanks
The text was updated successfully, but these errors were encountered: