data.table looses columns (when more than 2M) #13

Fpadt · 2017-01-31T11:51:06Z

Amazing package, excellent for building a cache mechanism. I have a matrix with time series with dates in the rows (700) and the entity in the columns (2 million). I tried coercing this matrix to a data.table and read/write with fst I loose many columns

write.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 2191021 variables:
read.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 28333 variables:

MarcusKlik · 2017-01-31T21:48:21Z

Thanks! I'm very interested in the mechanism that your are building where you have fst caching your data. I have a few more features in mind for the fst package, so please let me know if you have specific requirements! Currently the number of columns is stored as a short integer (in C++). That means that your 2191021 columns are actually stored as (2191021 AND 0xffff) = 28333 columns. I will make sure that you can store more columns in the next version.

MarcusKlik · 2017-01-31T21:53:49Z

Note however that fst uses column based storage and compression. So having a lot of relatively small columns is more expensive than having few relatively large columns. Also, the basic size of a compression block is 16kB at the moment (still experimenting with optimal size). So columns with a length of 700 will probably have a slightly worse compression factor than larger columns.

Fpadt · 2017-04-17T06:56:10Z

Hi Marcus,

Apologies for my late reaction. Missed your comments. Thanks for coming back to me and confirming the issue I had (it is as designed)

I have this number of columns as it is a mts (matrix of time series). I am creating this from a data.table.
Still playing around with the best and most performing way to do this. The rows contain days and indeed in general these are about length 1000 (3 x 365 days). The columns contain the forecast entities which can be the combination Artcile_Store resulting in millions of columns.

I tried to transpose the matrix but I looses quite some time. But now I read your response I believe I need to investigate this option further as it could be the best way forward.

Fpadt · 2017-04-17T06:58:49Z

For your information:

I am rather happy with your package and use it as follows.
This coding just checks if a fst file exists which is as old or newer than the original file. If there is it loads it in memory, if there is not it takes the burden of loading the original but directly save a fst

Just wrote a message to the guys of ProjectTemplate that they should build your package in for their Caching mechanism

` load_dt <-
function(pDATA_TABLE, pPATH = PATH_DATA) {

    file_time_format <- "%Y-%m-%d %H:%M:%S"

    # if fst version exists load it when it is more recent else load normal .RData file
  if (file.exists(file_fst) &
      strptime(file.mtime(file_fst)  , format = file_time_format) >=
      strptime(file.mtime(file_RData), format = file_time_format)) {
    assign(pDATA_TABLE,
           read.fst(path = file_fst, as.data.table = TRUE),
           envir = .GlobalEnv)
  } else {
    load(file_RData,  envir = .GlobalEnv)
    write.fst(get(pDATA_TABLE), file_fst)
  }
 
  # return()
}`

MarcusKlik · 2017-04-17T21:44:38Z

Nice! I think finding a way to transpose the data would be very useful for your case. Storing all the column names takes a bite out of the performance (at least for short columns). And for every column, a in-file seek operation is required, which is cheap on a fast SSD drive but adding up for millions of short columns.

The regular gather or melt methods don't work effectively with so many columns?

Thanks a lot for filing your issue on the ProjectTemplate repository!

Fpadt · 2017-04-18T15:42:13Z

Hi Marcus,

I thought my whole machine collapsed when I tried. still need to figure out the best way to cope with my data set, performance and memory. I need to give it a thought potentially I can leverage data.table in a clever way, having the time series in the rows instead of columns. will let you know when I succeed

MarcusKlik · 2017-04-18T19:02:23Z

Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single data.table column and then just cycle the days in a second column?

mat <- matrix(1:10950000, nrow = 1000)  # 1095 rows x 10000 columns
dt <- data.table(Val = as.vector(mat))
dt[, Days := 1:365]

print(dt)

               Val Days
       1:        1    1
       2:        2    2
       3:        3    3
       4:        4    4
       5:        5    5
      ---              
10949996: 10949996  361
10949997: 10949997  362
10949998: 10949998  363
10949999: 10949999  364
10950000: 10950000  365

I don't know if this applies to your case, just a thought!

Fpadt · 2017-04-18T21:34:56Z

Hi mark, Thanks for the idea. Actually I am coming from a data.table with 11M records and just a few columns and I store this using fst. This data set are irregular time series which I need to make regular by creating an entry for each day. This make it explode. As the end result is a matrix with 1.2B cells and it is created in a out 70 secs. I will check at which stage I can store it a data.table preferably as late as possible in the process as long as loading this data set is quicker than the generation Vriendelijke groeten, Floris Padt

…

----- Bericht beantwoorden ----- Van: "Mark Klik" <notifications@github.com> Aan: "fstpackage/fst" <fst@noreply.github.com> CC: "Floris Padt" <floris@fmpadt.nl>, "Author" <author@noreply.github.com> Onderwerp: [fstpackage/fst] data.table looses columns (when more than 2M) (#13) Datum: di, apr. 18, 2017 21:02 Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single data.table column and then just cycle the days in a second column? mat <- matrix(1:10950000, nrow = 1000) # 1095 rows x 10000 columns dt <- data.table(Val = as.vector(mat)) dt[, Days := 1:365] print(dt) Val Days 1: 1 1 2: 2 2 3: 3 3 4: 4 4 5: 5 5 --- 10949996: 10949996 361 10949997: 10949997 362 10949998: 10949998 363 10949999: 10949999 364 10950000: 10950000 365 I don't know if this applies to your case, just a thought! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/fstpackage/fst","title":"fstpackage/fst","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5..png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/fstpackage/fst"}},"updates":{"snippets":[{"icon":"PERSON","message":"@MarcusKlik in #13: Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single `data.table` column and then just cycle the days in a second column?\r\n\r\n```r\r\nmat \u003c- matrix(1:10950000, nrow = 1000) # 1095 rows x 10000 columns\r\ndt \u003c- data.table(Val = as.vector(mat))\r\ndt[, Days := 1:365]\r\n\r\nprint(dt)\r\n\r\n Val Days\r\n 1: 1 1\r\n 2: 2 2\r\n 3: 3 3\r\n 4: 4 4\r\n 5: 5 5\r\n --- \r\n10949996: 10949996 361\r\n10949997: 10949997 362\r\n10949998: 10949998 363\r\n10949999: 10949999 364\r\n10950000: 10950000 365\r\n```\r\n\r\nI don't know if this applies to your case, just a thought!"}],"action":{"name":"View Issue","url":"#13 (comment)"}}}

kbroman · 2017-04-20T18:45:52Z

I also have a case where write.fst is not writing the full data frame.

The data frame of interest has 58 rows and 286,521 columns. The result of write.fst has just 24,377 columns.

dim(z);write.fst(z, "many_cols.fst")
 [1]    58 286521
dim(read.fst("many_cols.fst"))
 [1]    58 24377

If I reformulate as the transpose, though, it's working fine for me.

MarcusKlik · 2017-04-20T20:31:04Z

Hi @kbroman , thanks for reporting the issue. You see the effects of (incorrectly) down-casting the number of columns to a short int (C++) like @Fpadt does. In your case 286521 & 0xffff = 24377. Using the development version of fst should fix your problem:

devtools::install_github("fstpackage/fst", ref = "develop")

That only works when you re-write your data using the development version as well. The development version stores a maximum of INT_MAX (C++) columns (about 2 billion). From version 3.0.0 R onward, R supports long vectors, so I could accommodate for even more columns in the format (but don't know if there would be a use-case for large data sets in wide format)

kbroman · 2017-04-20T21:30:15Z

Thanks @MarcusKlik; I should have thought to check the development version. That does indeed work for me, but wow things are a lot faster when I go with the transposed (lots of rows but not many columns) version of the data.

MarcusKlik · 2017-04-20T21:43:42Z

Nice! Yes, fst has a columnar format and each column has a certain offset in the file format. Reading the column requires a seek operation to the correct offset. Usually, that will be very fast and hardly noticeable (recent SSD drives can do a few hundreds of thousands (1e5) random seeks per second at least), but with so many columns, performance will definitely suffer. The format also contains meta-information on each column (type, attributes, version, etc) and processing that requires (a little) extra CPU time as well.

So depending on your drive and the number of columns compared to the number of rows, you might notice a significant difference between wide and long format.

Thanks @kbroman for your quick response and for testing the development version!

MarcusKlik · 2017-04-20T21:52:52Z

Additionally, while parsing the fst file, a single R-vector has to be created by the R framework for each column. These allocations (by R) have some overhead and unfortunately there is no way to avoid that (and they can't be done in parallel). As soon as fst crosses the boundary from unmanaged to managed code, things tend to slow down a bit :-)

MarcusKlik · 2017-07-13T22:33:40Z

Hi @Fpadt and @kbroman, thanks again for filing your issues with the maximum number of columns. This issue is solved in the develop version of fst and the maximum number of columns that can be stored is now INT_MAX (C++) so about 2 billion (as can be seen here)

It would be interesting to know if your data sets were actually (single type) matrices or (multiple-type) data.frame's, because I could add a feature to fst's to use it's serialization mechanism to store matrices as well (which are just vectors underneath off course) and that would be much faster than writing matrices as if they are data.frame's!

kbroman · 2017-07-13T23:32:31Z

Super!

The application I had in mind had one column of character strings followed by oodles of numeric columns.

Fpadt · 2017-07-14T04:35:50Z

Hi Marcus,

I have a data.table with 12 M records. With only 4 columns 2 character, 1 IDate and an integer. This is loaded via fst in <10 seconds (great). Actually these as intermittent sales (article, product, date, quantity).

I need to forecast these so in the end I need regular time series. To make this regular I dcast the data table to get a matrix with time series. Date as rownames and article_store as columns and the sales (integer) in the cells. And indeed this is actually 1 very long vector with NA and integer. Still the row.names and column names should be preserved. Making this matrix from the data.table takes 80 seconds and it would be awesome If I can just save and read it fst as the sales will never change, only have to append rows (and columns for new article stores).

In summary:

data.table (long)
Data.table (wide)
Matrix (mts, multiple time series)
Result: forecast

Already quite happy about the performance but will test if I can even improve with your new functionality

MarcusKlik · 2017-07-14T05:27:46Z

Thanks! So you both have large (@kbroman almost all) parts of you data set that could actually be stored as one contiguous block on disk, and much speed could be gained there.

Perhaps I should detect adjacent columns with identical types and serialize them as a matrix internally. That requires some more thought, thanx @kbroman and @Fpadt for your feedback!!

MarcusKlik added the bug label Feb 22, 2017

MarcusKlik added this to the Format complete milestone Apr 16, 2017

MarcusKlik self-assigned this Apr 20, 2017

MarcusKlik closed this as completed Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table looses columns (when more than 2M) #13

data.table looses columns (when more than 2M) #13

Fpadt commented Jan 31, 2017

MarcusKlik commented Jan 31, 2017 •

edited

MarcusKlik commented Jan 31, 2017 •

edited

Fpadt commented Apr 17, 2017

Fpadt commented Apr 17, 2017

MarcusKlik commented Apr 17, 2017

Fpadt commented Apr 18, 2017

MarcusKlik commented Apr 18, 2017

Fpadt commented Apr 18, 2017 via email

kbroman commented Apr 20, 2017

MarcusKlik commented Apr 20, 2017 •

edited

kbroman commented Apr 20, 2017

MarcusKlik commented Apr 20, 2017 •

edited

MarcusKlik commented Apr 20, 2017

MarcusKlik commented Jul 13, 2017

kbroman commented Jul 13, 2017

Fpadt commented Jul 14, 2017

MarcusKlik commented Jul 14, 2017

data.table looses columns (when more than 2M) #13

data.table looses columns (when more than 2M) #13

Comments

Fpadt commented Jan 31, 2017

MarcusKlik commented Jan 31, 2017 • edited

MarcusKlik commented Jan 31, 2017 • edited

Fpadt commented Apr 17, 2017

Fpadt commented Apr 17, 2017

MarcusKlik commented Apr 17, 2017

Fpadt commented Apr 18, 2017

MarcusKlik commented Apr 18, 2017

Fpadt commented Apr 18, 2017 via email

kbroman commented Apr 20, 2017

MarcusKlik commented Apr 20, 2017 • edited

kbroman commented Apr 20, 2017

MarcusKlik commented Apr 20, 2017 • edited

MarcusKlik commented Apr 20, 2017

MarcusKlik commented Jul 13, 2017

kbroman commented Jul 13, 2017

Fpadt commented Jul 14, 2017

MarcusKlik commented Jul 14, 2017

MarcusKlik commented Jan 31, 2017 •

edited

MarcusKlik commented Jan 31, 2017 •

edited

MarcusKlik commented Apr 20, 2017 •

edited

MarcusKlik commented Apr 20, 2017 •

edited