Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table looses columns (when more than 2M) #13

Closed
Fpadt opened this issue Jan 31, 2017 · 17 comments
Closed

data.table looses columns (when more than 2M) #13

Fpadt opened this issue Jan 31, 2017 · 17 comments
Assignees
Labels

Comments

@Fpadt
Copy link

Fpadt commented Jan 31, 2017

Amazing package, excellent for building a cache mechanism. I have a matrix with time series with dates in the rows (700) and the entity in the columns (2 million). I tried coercing this matrix to a data.table and read/write with fst I loose many columns

write.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 2191021 variables:
read.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 28333 variables:

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Jan 31, 2017

Thanks! I'm very interested in the mechanism that your are building where you have fst caching your data. I have a few more features in mind for the fst package, so please let me know if you have specific requirements! Currently the number of columns is stored as a short integer (in C++). That means that your 2191021 columns are actually stored as (2191021 AND 0xffff) = 28333 columns. I will make sure that you can store more columns in the next version.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Jan 31, 2017

Note however that fst uses column based storage and compression. So having a lot of relatively small columns is more expensive than having few relatively large columns. Also, the basic size of a compression block is 16kB at the moment (still experimenting with optimal size). So columns with a length of 700 will probably have a slightly worse compression factor than larger columns.

@MarcusKlik MarcusKlik added the bug label Feb 22, 2017
@MarcusKlik MarcusKlik added this to the Format complete milestone Apr 16, 2017
@Fpadt
Copy link
Author

Fpadt commented Apr 17, 2017

Hi Marcus,

Apologies for my late reaction. Missed your comments. Thanks for coming back to me and confirming the issue I had (it is as designed)

I have this number of columns as it is a mts (matrix of time series). I am creating this from a data.table.
Still playing around with the best and most performing way to do this. The rows contain days and indeed in general these are about length 1000 (3 x 365 days). The columns contain the forecast entities which can be the combination Artcile_Store resulting in millions of columns.

I tried to transpose the matrix but I looses quite some time. But now I read your response I believe I need to investigate this option further as it could be the best way forward.

@Fpadt
Copy link
Author

Fpadt commented Apr 17, 2017

For your information:

I am rather happy with your package and use it as follows.
This coding just checks if a fst file exists which is as old or newer than the original file. If there is it loads it in memory, if there is not it takes the burden of loading the original but directly save a fst

Just wrote a message to the guys of ProjectTemplate that they should build your package in for their Caching mechanism

` load_dt <-
function(pDATA_TABLE, pPATH = PATH_DATA) {

    file_time_format <- "%Y-%m-%d %H:%M:%S"

    # if fst version exists load it when it is more recent else load normal .RData file
  if (file.exists(file_fst) &
      strptime(file.mtime(file_fst)  , format = file_time_format) >=
      strptime(file.mtime(file_RData), format = file_time_format)) {
    assign(pDATA_TABLE,
           read.fst(path = file_fst, as.data.table = TRUE),
           envir = .GlobalEnv)
  } else {
    load(file_RData,  envir = .GlobalEnv)
    write.fst(get(pDATA_TABLE), file_fst)
  }
 
  # return()
}`

@MarcusKlik
Copy link
Collaborator

Nice! I think finding a way to transpose the data would be very useful for your case. Storing all the column names takes a bite out of the performance (at least for short columns). And for every column, a in-file seek operation is required, which is cheap on a fast SSD drive but adding up for millions of short columns.

The regular gather or melt methods don't work effectively with so many columns?

Thanks a lot for filing your issue on the ProjectTemplate repository!

@Fpadt
Copy link
Author

Fpadt commented Apr 18, 2017

Hi Marcus,

I thought my whole machine collapsed when I tried. still need to figure out the best way to cope with my data set, performance and memory. I need to give it a thought potentially I can leverage data.table in a clever way, having the time series in the rows instead of columns. will let you know when I succeed

@MarcusKlik
Copy link
Collaborator

Hi, good luck and would be interesting to know how you convert your table. By the way, if you start out with a matrix, you basically already have a one-dimensional vector. Perhaps you can store your matrix in a single data.table column and then just cycle the days in a second column?

mat <- matrix(1:10950000, nrow = 1000)  # 1095 rows x 10000 columns
dt <- data.table(Val = as.vector(mat))
dt[, Days := 1:365]

print(dt)

               Val Days
       1:        1    1
       2:        2    2
       3:        3    3
       4:        4    4
       5:        5    5
      ---              
10949996: 10949996  361
10949997: 10949997  362
10949998: 10949998  363
10949999: 10949999  364
10950000: 10950000  365

I don't know if this applies to your case, just a thought!

@Fpadt
Copy link
Author

Fpadt commented Apr 18, 2017 via email

@kbroman
Copy link

kbroman commented Apr 20, 2017

I also have a case where write.fst is not writing the full data frame.

The data frame of interest has 58 rows and 286,521 columns. The result of write.fst has just 24,377 columns.

dim(z);write.fst(z, "many_cols.fst")
 [1]    58 286521
dim(read.fst("many_cols.fst"))
 [1]    58 24377

If I reformulate as the transpose, though, it's working fine for me.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 20, 2017

Hi @kbroman , thanks for reporting the issue. You see the effects of (incorrectly) down-casting the number of columns to a short int (C++) like @Fpadt does. In your case 286521 & 0xffff = 24377. Using the development version of fst should fix your problem:

devtools::install_github("fstpackage/fst", ref = "develop")

That only works when you re-write your data using the development version as well. The development version stores a maximum of INT_MAX (C++) columns (about 2 billion). From version 3.0.0 R onward, R supports long vectors, so I could accommodate for even more columns in the format (but don't know if there would be a use-case for large data sets in wide format)

@kbroman
Copy link

kbroman commented Apr 20, 2017

Thanks @MarcusKlik; I should have thought to check the development version. That does indeed work for me, but wow things are a lot faster when I go with the transposed (lots of rows but not many columns) version of the data.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 20, 2017

Nice! Yes, fst has a columnar format and each column has a certain offset in the file format. Reading the column requires a seek operation to the correct offset. Usually, that will be very fast and hardly noticeable (recent SSD drives can do a few hundreds of thousands (1e5) random seeks per second at least), but with so many columns, performance will definitely suffer. The format also contains meta-information on each column (type, attributes, version, etc) and processing that requires (a little) extra CPU time as well.

So depending on your drive and the number of columns compared to the number of rows, you might notice a significant difference between wide and long format.

Thanks @kbroman for your quick response and for testing the development version!

@MarcusKlik MarcusKlik self-assigned this Apr 20, 2017
@MarcusKlik
Copy link
Collaborator

Additionally, while parsing the fst file, a single R-vector has to be created by the R framework for each column. These allocations (by R) have some overhead and unfortunately there is no way to avoid that (and they can't be done in parallel). As soon as fst crosses the boundary from unmanaged to managed code, things tend to slow down a bit :-)

@MarcusKlik
Copy link
Collaborator

Hi @Fpadt and @kbroman, thanks again for filing your issues with the maximum number of columns. This issue is solved in the develop version of fst and the maximum number of columns that can be stored is now INT_MAX (C++) so about 2 billion (as can be seen here)

It would be interesting to know if your data sets were actually (single type) matrices or (multiple-type) data.frame's, because I could add a feature to fst's to use it's serialization mechanism to store matrices as well (which are just vectors underneath off course) and that would be much faster than writing matrices as if they are data.frame's!

@kbroman
Copy link

kbroman commented Jul 13, 2017

Super!

The application I had in mind had one column of character strings followed by oodles of numeric columns.

@Fpadt
Copy link
Author

Fpadt commented Jul 14, 2017

Hi Marcus,

I have a data.table with 12 M records. With only 4 columns 2 character, 1 IDate and an integer. This is loaded via fst in <10 seconds (great). Actually these as intermittent sales (article, product, date, quantity).

I need to forecast these so in the end I need regular time series. To make this regular I dcast the data table to get a matrix with time series. Date as rownames and article_store as columns and the sales (integer) in the cells. And indeed this is actually 1 very long vector with NA and integer. Still the row.names and column names should be preserved. Making this matrix from the data.table takes 80 seconds and it would be awesome If I can just save and read it fst as the sales will never change, only have to append rows (and columns for new article stores).

In summary:

  1. data.table (long)
  2. Data.table (wide)
  3. Matrix (mts, multiple time series)
  4. Result: forecast

Already quite happy about the performance but will test if I can even improve with your new functionality

@MarcusKlik
Copy link
Collaborator

Thanks! So you both have large (@kbroman almost all) parts of you data set that could actually be stored as one contiguous block on disk, and much speed could be gained there.

Perhaps I should detect adjacent columns with identical types and serialize them as a matrix internally. That requires some more thought, thanx @kbroman and @Fpadt for your feedback!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants