Improve tool handling of very large input files #7

bateman · 2018-12-04T09:40:08Z

We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

[1] https://rpubs.com/msundar/large_data_analysis
[2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
[3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

The text was updated successfully, but these errors were encountered:

maelick · 2019-04-08T13:04:04Z

A simpler solution is to chunk the data and give each chunk to Senti4SD. That's what PR #9 did in a simple way. I also wrote my own script which adds several improvements, including allowing to call Senti4SD as an R function rather than as a bash script.

For example, I can work with a 100k dataset like this:

source("classification_functions.R")
model <- LoadModel("modelLiblinear.Rda")
text <- unique(read.csv2(gzfile("test100k.csv.gz"), header=FALSE)[[1]])[1:100000]
system.time(res <- Senti4SDChunked(text, model, "."))

The code is available here: maelick/Senti4SD@5b0df31, and I can create a PR if there's interest.

I've been able to run it successfully on my 8GB memory laptop in 3800s with a chunk size of 1000. On a supercomputer I tried with a higher chunk size (10k) but only improved run time to 2800s. The reason why there's so little improvement is that I suspect an important amount of time is spent reading and writing huge CSV files. Using rJava (as I've mentioned in #10) instead of CSV files to communicate between Java and R could significantly improve performances... And reusability of the tool :-)

nnovielli · 2019-04-09T10:23:02Z

Currently, we don't have resources to work on this issue. Please, open the PR, we will merge it in a separate branch to make it available for others. Thank you.

bateman mentioned this issue Dec 4, 2018

Futures timed out after [24 hours] #6

Closed

bateman added the enhancement label Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tool handling of very large input files #7

Improve tool handling of very large input files #7

bateman commented Dec 4, 2018

maelick commented Apr 8, 2019

nnovielli commented Apr 9, 2019

Improve tool handling of very large input files #7

Improve tool handling of very large input files #7

Comments

bateman commented Dec 4, 2018

maelick commented Apr 8, 2019

nnovielli commented Apr 9, 2019