multiple_split_libraries_fastq.py -- Argument list too long #2069

ElDeveloper · 2015-08-06T15:21:01Z

If the length of the paths in the split_libraries_fastq.py command are too long (as they get to be when you have a few hundred samples each in its own FASTQ file), multiple_split_libraries_fastq.py will fail with the following error:

File "/usr/local/bin/multiple_split_libraries_fastq.py", line 219, in <module>

 main()
File "/usr/local/bin/multiple_split_libraries_fastq.py", line 216, in main
  close_logger_on_success=True)
File "/usr/local/lib/python2.7/dist-packages/qiime/workflow/util.py", line 114, in call_commands_serially
  stdout, stderr, return_value = qiime_system_call(e[1])
File "/usr/local/lib/python2.7/dist-packages/qcli/util.py", line 36, in qcli_system_call
 stderr=PIPE)
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
 errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
 raise child_exception
 OSError: [Errno 7] Argument list too long

I don't think there's a good solution for this, we had previously found this problem in Qiita but ended up not finding a good solution. Anyone have any ideas about this?

@walterst @gregcaporaso

The text was updated successfully, but these errors were encountered:

walterst · 2015-08-06T15:26:44Z

What is the max number of arguments? Or is it the length of the argument string that it's unhappy with?

In either case, it might be possible to check for this, and if it's detected, split the input command into multiple commands, and check the counts of the output reads from each command to generate the next command with the --start_seq_id parameter added to give unique numbers to the reads, and finally concatenate the separate reads.

ElDeveloper · 2015-08-06T15:47:35Z

As far as I understand, this is a limitation imposed by the OS/Kernel
(not sure about this), so not really sure how to fix this.

On (Aug-06-15| 8:26), Tony wrote:

What is the max number of arguments? Or is it the length of the argument string that it's unhappy with?

In either, it might be possible to check for this, and if it's detected, split the input command into multiple commands, and check the length of the output reads from each command to generate the next command with the --start_seq_id parameter added to give unique numbers to the reads, and finally concatenate the separate reads.

Reply to this email directly or view it on GitHub:
#2069 (comment)

antgonza · 2015-08-06T15:51:59Z

Yes about kernel and is based on characters long and not on number of parameters.

The only way I can think of is splitting the commands by a x number of parameters, for example if x is 10: multiple_split_libraries_fastq.py -i file1,file2,file3,...,file100,filex to split_libraries_fastq.py -i file1,...,file10...;split_libraries_fastq.py -i file11,...,file20...;...

wasade · 2015-08-06T16:05:11Z

Would need to be separate logical lines, joining by ";" doesn't reduce
character count

On Thu, Aug 6, 2015 at 9:51 AM, Antonio Gonzalez notifications@github.com
wrote:

Yes about kernel and is based on characters long and not on number of
parameters.

The only way I can think of is splitting the commands by a x number of
parameters, for example if x is 10: multiple_split_libraries_fastq.py -i
file1,file2,file3,...,file100,filex to split_libraries_fastq.py -i
file1,...,file10...;split_libraries_fastq.py -i file11,...,file20...;...

—
Reply to this email directly or view it on GitHub
#2069 (comment).

antgonza · 2015-08-06T16:07:25Z

Agree, that was just an example 😄

walterst · 2015-08-06T16:22:07Z

How about this potential solution:
add a parameter to multiple_split_libraries_fastq.py (-n, --number_files_per_cmd), that's set to 0 by default (would create a single command just as it does now). Any other int will specify the number of comma-separated files/sampleids to process in a single split_libraries_fastq.py command. Then it does the series of commands split_libraries_fastq.py -> count_seqs -> next split_libraries_fastq.py with --start_seq_id value based upon count_seqs.py result and so on until it's complete, then does a cat call on all of the separate output seqs.fna files

ElDeveloper · 2015-08-06T18:07:33Z

The more I think about this the more I realize that most of these
solutions are ephemeral. I am not super familiar with the code, but
would it be at all possible to deal with this at the Python level? i.e.
making direct calls to the Python code such that instead of the file
paths being a bash string, these are really just a Python list? I know
this may not be straight-forward but I think this may be the only
reasonable solution?

On (Aug-06-15| 9:22), Tony wrote:

How about this potential solution:
add a parameter to multiple_split_libraries_fastq.py (-n, --number_files_per_cmd), that's set to 0 by default (would create a single command just as it does now). Any other int will specify the number of comma-separated files/sampleids to process in a single split_libraries_fastq.py command. Then it does the series of commands split_libraries_fastq.py -> count_seqs -> next split_libraries_fastq.py with --start_seq_id value based upon count_seqs.py result and so on until it's complete, then does a cat call on all of the separate output seqs.fna files

Reply to this email directly or view it on GitHub:
#2069 (comment)

agentfog · 2015-08-07T19:30:07Z

Hi, Yoshiki

I like your idea of bypassing the CLI/system call, but how would you deal with logging? Currently, the "command executor" call_commands_serially() logs the retval, stdout and stderr of the subprocess, in this case, split_libraries_fastq.py.

An alternative would be to add a batch mode to split_libraries_fastq.py so that you could specify the path to a text file which contains all the required file paths and sample IDs.

ElDeveloper · 2015-08-07T23:22:34Z

Good point @agentfog! Maybe what needs to happen is that every each sample has to be processed individually and finally all results can be collated together? ... not ideal but I guess it would work.

ElDeveloper · 2015-08-24T16:00:53Z

After talking with @rob-knight, he suggested that the options that took parameters that can trigger this error (--sample_id, -i and -b) took a text file where each line is a file path or a sample id. This would get rid limitation that the shell imposes. Consequently multiple_split_libraries_fastq.py would create this file as part of its workflow.

agentfog · 2015-08-25T02:00:12Z

That's what I meant by "batch mode". The way I imagined the text file, each line would correspond to a sample and there would be two or three columns depending on the demux mode. The columns would be either (read_file, sample_id) or (read_file, barcode_file, mapping_file). What do you think?

ElDeveloper · 2015-08-25T06:14:53Z

@agentfog Ah, got it - sorry for missing that! Yes, though I think we
really want to go with the approach of a single file per input type i.e.
one for sequences, one for barcodes and one for sample identifiers.
@walterst, @gregcaporaso and others if nobody has objections with this
approach I'll try to get this in ASAP, please let me know before Tuesday
August 25 ~ 1PM PT, otherwise I'll begin working on this.

On (Aug-24-15|19:00), agentfog wrote:

That's what I meant by "batch mode". The way I imagined the text file, each
line would correspond to a sample and there would be two or three columns
depending on the demux mode. The columns would be either (read_file,
sample_id) or (read_file, barcode_file, mapping_file).

What do you think?

Reply to this email directly or view it on GitHub:
#2069 (comment)

walterst · 2015-08-25T12:37:26Z

Sounds good.

ElDeveloper · 2015-08-27T22:22:54Z

BTW, I'm working on this right now, had to delay it due to other things that came up, but I should be able to post a PR sometime soon.

With this new argument split_libraries_fastq.py can now take a file that list the input files for the `-b`, `-i`, `--sample_ids` and `--mapping_files` arguments. Thus allowing an indefinite amount of files as inputs, as opposed to the previous approach where we would be limited by the maximum number of characters that can fit into a single command line execution. Fixes biocore#2069

ElDeveloper added the bug label Aug 6, 2015

ElDeveloper mentioned this issue Aug 28, 2015

BUG: Add --read_arguments_from_file to sl_fastq.py #2077

Merged

antgonza closed this as completed in #2077 Sep 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple_split_libraries_fastq.py -- Argument list too long #2069

multiple_split_libraries_fastq.py -- Argument list too long #2069

ElDeveloper commented Aug 6, 2015

walterst commented Aug 6, 2015

ElDeveloper commented Aug 6, 2015

antgonza commented Aug 6, 2015

wasade commented Aug 6, 2015

antgonza commented Aug 6, 2015

walterst commented Aug 6, 2015

ElDeveloper commented Aug 6, 2015

agentfog commented Aug 7, 2015

ElDeveloper commented Aug 7, 2015

ElDeveloper commented Aug 24, 2015

agentfog commented Aug 25, 2015 via email

ElDeveloper commented Aug 25, 2015

walterst commented Aug 25, 2015

ElDeveloper commented Aug 27, 2015

multiple_split_libraries_fastq.py -- Argument list too long #2069

multiple_split_libraries_fastq.py -- Argument list too long #2069

Comments

ElDeveloper commented Aug 6, 2015

walterst commented Aug 6, 2015

ElDeveloper commented Aug 6, 2015

antgonza commented Aug 6, 2015

wasade commented Aug 6, 2015

antgonza commented Aug 6, 2015

walterst commented Aug 6, 2015

ElDeveloper commented Aug 6, 2015

agentfog commented Aug 7, 2015

ElDeveloper commented Aug 7, 2015

ElDeveloper commented Aug 24, 2015

agentfog commented Aug 25, 2015 via email

ElDeveloper commented Aug 25, 2015

walterst commented Aug 25, 2015

ElDeveloper commented Aug 27, 2015