Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

multiple_split_libraries_fastq.py -- Argument list too long #2069

Closed
ElDeveloper opened this issue Aug 6, 2015 · 14 comments
Closed

multiple_split_libraries_fastq.py -- Argument list too long #2069

ElDeveloper opened this issue Aug 6, 2015 · 14 comments
Labels

Comments

@ElDeveloper
Copy link
Member

If the length of the paths in the split_libraries_fastq.py command are too long (as they get to be when you have a few hundred samples each in its own FASTQ file), multiple_split_libraries_fastq.py will fail with the following error:

File "/usr/local/bin/multiple_split_libraries_fastq.py", line 219, in <module>

 main()
File "/usr/local/bin/multiple_split_libraries_fastq.py", line 216, in main
  close_logger_on_success=True)
File "/usr/local/lib/python2.7/dist-packages/qiime/workflow/util.py", line 114, in call_commands_serially
  stdout, stderr, return_value = qiime_system_call(e[1])
File "/usr/local/lib/python2.7/dist-packages/qcli/util.py", line 36, in qcli_system_call
 stderr=PIPE)
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
 errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
 raise child_exception
 OSError: [Errno 7] Argument list too long

I don't think there's a good solution for this, we had previously found this problem in Qiita but ended up not finding a good solution. Anyone have any ideas about this?

@walterst @gregcaporaso

@ElDeveloper ElDeveloper added the bug label Aug 6, 2015
@walterst
Copy link
Contributor

walterst commented Aug 6, 2015

What is the max number of arguments? Or is it the length of the argument string that it's unhappy with?

In either case, it might be possible to check for this, and if it's detected, split the input command into multiple commands, and check the counts of the output reads from each command to generate the next command with the --start_seq_id parameter added to give unique numbers to the reads, and finally concatenate the separate reads.

@ElDeveloper
Copy link
Member Author

As far as I understand, this is a limitation imposed by the OS/Kernel
(not sure about this), so not really sure how to fix this.

On (Aug-06-15| 8:26), Tony wrote:

What is the max number of arguments? Or is it the length of the argument string that it's unhappy with?

In either, it might be possible to check for this, and if it's detected, split the input command into multiple commands, and check the length of the output reads from each command to generate the next command with the --start_seq_id parameter added to give unique numbers to the reads, and finally concatenate the separate reads.


Reply to this email directly or view it on GitHub:
#2069 (comment)

@antgonza
Copy link
Contributor

antgonza commented Aug 6, 2015

Yes about kernel and is based on characters long and not on number of parameters.

The only way I can think of is splitting the commands by a x number of parameters, for example if x is 10: multiple_split_libraries_fastq.py -i file1,file2,file3,...,file100,filex to split_libraries_fastq.py -i file1,...,file10...;split_libraries_fastq.py -i file11,...,file20...;...

@wasade
Copy link
Member

wasade commented Aug 6, 2015

Would need to be separate logical lines, joining by ";" doesn't reduce
character count

On Thu, Aug 6, 2015 at 9:51 AM, Antonio Gonzalez notifications@github.com
wrote:

Yes about kernel and is based on characters long and not on number of
parameters.

The only way I can think of is splitting the commands by a x number of
parameters, for example if x is 10: multiple_split_libraries_fastq.py -i
file1,file2,file3,...,file100,filex to split_libraries_fastq.py -i
file1,...,file10...;split_libraries_fastq.py -i file11,...,file20...;...


Reply to this email directly or view it on GitHub
#2069 (comment).

@antgonza
Copy link
Contributor

antgonza commented Aug 6, 2015

Agree, that was just an example 😄

@walterst
Copy link
Contributor

walterst commented Aug 6, 2015

How about this potential solution:
add a parameter to multiple_split_libraries_fastq.py (-n, --number_files_per_cmd), that's set to 0 by default (would create a single command just as it does now). Any other int will specify the number of comma-separated files/sampleids to process in a single split_libraries_fastq.py command. Then it does the series of commands split_libraries_fastq.py -> count_seqs -> next split_libraries_fastq.py with --start_seq_id value based upon count_seqs.py result and so on until it's complete, then does a cat call on all of the separate output seqs.fna files

@ElDeveloper
Copy link
Member Author

The more I think about this the more I realize that most of these
solutions are ephemeral. I am not super familiar with the code, but
would it be at all possible to deal with this at the Python level? i.e.
making direct calls to the Python code such that instead of the file
paths being a bash string, these are really just a Python list? I know
this may not be straight-forward but I think this may be the only
reasonable solution?

On (Aug-06-15| 9:22), Tony wrote:

How about this potential solution:
add a parameter to multiple_split_libraries_fastq.py (-n, --number_files_per_cmd), that's set to 0 by default (would create a single command just as it does now). Any other int will specify the number of comma-separated files/sampleids to process in a single split_libraries_fastq.py command. Then it does the series of commands split_libraries_fastq.py -> count_seqs -> next split_libraries_fastq.py with --start_seq_id value based upon count_seqs.py result and so on until it's complete, then does a cat call on all of the separate output seqs.fna files


Reply to this email directly or view it on GitHub:
#2069 (comment)

@agentfog
Copy link

agentfog commented Aug 7, 2015

Hi, Yoshiki

I like your idea of bypassing the CLI/system call, but how would you deal with logging? Currently, the "command executor" call_commands_serially() logs the retval, stdout and stderr of the subprocess, in this case, split_libraries_fastq.py.

An alternative would be to add a batch mode to split_libraries_fastq.py so that you could specify the path to a text file which contains all the required file paths and sample IDs.

@ElDeveloper
Copy link
Member Author

Good point @agentfog! Maybe what needs to happen is that every each sample has to be processed individually and finally all results can be collated together? ... not ideal but I guess it would work.

@ElDeveloper
Copy link
Member Author

After talking with @rob-knight, he suggested that the options that took parameters that can trigger this error (--sample_id, -i and -b) took a text file where each line is a file path or a sample id. This would get rid limitation that the shell imposes. Consequently multiple_split_libraries_fastq.py would create this file as part of its workflow.

@agentfog
Copy link

agentfog commented Aug 25, 2015 via email

@ElDeveloper
Copy link
Member Author

@agentfog Ah, got it - sorry for missing that! Yes, though I think we
really want to go with the approach of a single file per input type i.e.
one for sequences, one for barcodes and one for sample identifiers.
@walterst, @gregcaporaso and others if nobody has objections with this
approach I'll try to get this in ASAP, please let me know before Tuesday
August 25 ~ 1PM PT, otherwise I'll begin working on this.

On (Aug-24-15|19:00), agentfog wrote:

That's what I meant by "batch mode". The way I imagined the text file, each
line would correspond to a sample and there would be two or three columns
depending on the demux mode. The columns would be either (read_file,
sample_id) or (read_file, barcode_file, mapping_file).

What do you think?


Reply to this email directly or view it on GitHub:
#2069 (comment)

@walterst
Copy link
Contributor

Sounds good.

@ElDeveloper
Copy link
Member Author

BTW, I'm working on this right now, had to delay it due to other things that came up, but I should be able to post a PR sometime soon.

ElDeveloper added a commit to ElDeveloper/qiime that referenced this issue Aug 28, 2015
With this new argument split_libraries_fastq.py can now take a file that
list the input files for the `-b`, `-i`, `--sample_ids` and
`--mapping_files` arguments. Thus allowing an indefinite amount of files
as inputs, as opposed to the previous approach where we would be limited
by the maximum number of characters that can fit into a single command
line execution.

Fixes biocore#2069
ElDeveloper added a commit to ElDeveloper/qiime that referenced this issue Aug 28, 2015
With this new argument split_libraries_fastq.py can now take a file that
list the input files for the `-b`, `-i`, `--sample_ids` and
`--mapping_files` arguments. Thus allowing an indefinite amount of files
as inputs, as opposed to the previous approach where we would be limited
by the maximum number of characters that can fit into a single command
line execution.

Fixes biocore#2069
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants