Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tool handling of very large input files #7

Open
bateman opened this issue Dec 4, 2018 · 2 comments
Open

Improve tool handling of very large input files #7

bateman opened this issue Dec 4, 2018 · 2 comments

Comments

@bateman
Copy link
Member

bateman commented Dec 4, 2018

We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

[1] https://rpubs.com/msundar/large_data_analysis
[2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
[3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

@maelick
Copy link

maelick commented Apr 8, 2019

A simpler solution is to chunk the data and give each chunk to Senti4SD. That's what PR #9 did in a simple way. I also wrote my own script which adds several improvements, including allowing to call Senti4SD as an R function rather than as a bash script.

For example, I can work with a 100k dataset like this:

source("classification_functions.R")
model <- LoadModel("modelLiblinear.Rda")
text <- unique(read.csv2(gzfile("test100k.csv.gz"), header=FALSE)[[1]])[1:100000]
system.time(res <- Senti4SDChunked(text, model, "."))

The code is available here: maelick/Senti4SD@5b0df31, and I can create a PR if there's interest.

I've been able to run it successfully on my 8GB memory laptop in 3800s with a chunk size of 1000. On a supercomputer I tried with a higher chunk size (10k) but only improved run time to 2800s. The reason why there's so little improvement is that I suspect an important amount of time is spent reading and writing huge CSV files. Using rJava (as I've mentioned in #10) instead of CSV files to communicate between Java and R could significantly improve performances... And reusability of the tool :-)

@nnovielli
Copy link
Contributor

Currently, we don't have resources to work on this issue. Please, open the PR, we will merge it in a separate branch to make it available for others. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants