New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data.table looses columns (when more than 2M) #13
Comments
Thanks! I'm very interested in the mechanism that your are building where you have |
Note however that |
Hi Marcus, Apologies for my late reaction. Missed your comments. Thanks for coming back to me and confirming the issue I had (it is as designed) I have this number of columns as it is a mts (matrix of time series). I am creating this from a data.table. I tried to transpose the matrix but I looses quite some time. But now I read your response I believe I need to investigate this option further as it could be the best way forward. |
For your information: I am rather happy with your package and use it as follows. Just wrote a message to the guys of ProjectTemplate that they should build your package in for their Caching mechanism ` load_dt <-
|
Nice! I think finding a way to transpose the data would be very useful for your case. Storing all the column names takes a bite out of the performance (at least for short columns). And for every column, a in-file seek operation is required, which is cheap on a fast SSD drive but adding up for millions of short columns. The regular Thanks a lot for filing your issue on the ProjectTemplate repository! |
Hi Marcus, I thought my whole machine collapsed when I tried. still need to figure out the best way to cope with my data set, performance and memory. I need to give it a thought potentially I can leverage data.table in a clever way, having the time series in the rows instead of columns. will let you know when I succeed |
Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single mat <- matrix(1:10950000, nrow = 1000) # 1095 rows x 10000 columns
dt <- data.table(Val = as.vector(mat))
dt[, Days := 1:365]
print(dt)
Val Days
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
---
10949996: 10949996 361
10949997: 10949997 362
10949998: 10949998 363
10949999: 10949999 364
10950000: 10950000 365 I don't know if this applies to your case, just a thought! |
Hi mark,
Thanks for the idea. Actually I am coming from a data.table with 11M records and just a few columns and I store this using fst. This data set are irregular time series which I need to make regular by creating an entry for each day. This make it explode.
As the end result is a matrix with 1.2B cells and it is created in a out 70 secs.
I will check at which stage I can store it a data.table preferably as late as possible in the process as long as loading this data set is quicker than the generation
Vriendelijke groeten,
Floris Padt
…----- Bericht beantwoorden -----
Van: "Mark Klik" <notifications@github.com>
Aan: "fstpackage/fst" <fst@noreply.github.com>
CC: "Floris Padt" <floris@fmpadt.nl>, "Author" <author@noreply.github.com>
Onderwerp: [fstpackage/fst] data.table looses columns (when more than 2M) (#13)
Datum: di, apr. 18, 2017 21:02
Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single data.table column and then just cycle the days in a second column?
mat <- matrix(1:10950000, nrow = 1000) # 1095 rows x 10000 columns
dt <- data.table(Val = as.vector(mat))
dt[, Days := 1:365]
print(dt)
Val Days
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
---
10949996: 10949996 361
10949997: 10949997 362
10949998: 10949998 363
10949999: 10949999 364
10950000: 10950000 365
I don't know if this applies to your case, just a thought!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/fstpackage/fst","title":"fstpackage/fst","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5..png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/fstpackage/fst"}},"updates":{"snippets":[{"icon":"PERSON","message":"@MarcusKlik in #13: Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single `data.table` column and then just cycle the days in a second column?\r\n\r\n```r\r\nmat \u003c- matrix(1:10950000, nrow = 1000) # 1095 rows x 10000 columns\r\ndt \u003c- data.table(Val = as.vector(mat))\r\ndt[, Days := 1:365]\r\n\r\nprint(dt)\r\n\r\n Val Days\r\n 1: 1 1\r\n 2: 2 2\r\n 3: 3 3\r\n 4: 4 4\r\n 5: 5 5\r\n --- \r\n10949996: 10949996 361\r\n10949997: 10949997 362\r\n10949998: 10949998 363\r\n10949999: 10949999 364\r\n10950000: 10950000 365\r\n```\r\n\r\nI don't know if this applies to your case, just a thought!"}],"action":{"name":"View Issue","url":"#13 (comment)"}}}
|
I also have a case where The data frame of interest has 58 rows and 286,521 columns. The result of
If I reformulate as the transpose, though, it's working fine for me. |
Hi @kbroman , thanks for reporting the issue. You see the effects of (incorrectly) down-casting the number of columns to a short int (C++) like @Fpadt does. In your case 286521 & 0xffff = 24377. Using the development version of devtools::install_github("fstpackage/fst", ref = "develop") That only works when you re-write your data using the development version as well. The development version stores a maximum of INT_MAX (C++) columns (about 2 billion). From version 3.0.0 R onward, R supports long vectors, so I could accommodate for even more columns in the format (but don't know if there would be a use-case for large data sets in wide format) |
Thanks @MarcusKlik; I should have thought to check the development version. That does indeed work for me, but wow things are a lot faster when I go with the transposed (lots of rows but not many columns) version of the data. |
Nice! Yes, So depending on your drive and the number of columns compared to the number of rows, you might notice a significant difference between wide and long format. Thanks @kbroman for your quick response and for testing the development version! |
Additionally, while parsing the |
Hi @Fpadt and @kbroman, thanks again for filing your issues with the maximum number of columns. This issue is solved in the develop version of It would be interesting to know if your data sets were actually (single type) matrices or (multiple-type) |
Super! The application I had in mind had one column of character strings followed by oodles of numeric columns. |
Hi Marcus, I have a data.table with 12 M records. With only 4 columns 2 character, 1 IDate and an integer. This is loaded via fst in <10 seconds (great). Actually these as intermittent sales (article, product, date, quantity). I need to forecast these so in the end I need regular time series. To make this regular I dcast the data table to get a matrix with time series. Date as rownames and article_store as columns and the sales (integer) in the cells. And indeed this is actually 1 very long vector with NA and integer. Still the row.names and column names should be preserved. Making this matrix from the data.table takes 80 seconds and it would be awesome If I can just save and read it fst as the sales will never change, only have to append rows (and columns for new article stores). In summary:
Already quite happy about the performance but will test if I can even improve with your new functionality |
Thanks! So you both have large (@kbroman almost all) parts of you data set that could actually be stored as one contiguous block on disk, and much speed could be gained there. Perhaps I should detect adjacent columns with identical types and serialize them as a matrix internally. That requires some more thought, thanx @kbroman and @Fpadt for your feedback!! |
Amazing package, excellent for building a cache mechanism. I have a matrix with time series with dates in the rows (700) and the entity in the columns (2 million). I tried coercing this matrix to a data.table and read/write with fst I loose many columns
write.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 2191021 variables:
read.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 28333 variables:
The text was updated successfully, but these errors were encountered: