Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic config problem with paired samples numerically named with _samplenum #1919

Closed
stephenturner opened this issue May 1, 2017 · 6 comments

Comments

@stephenturner
Copy link
Contributor

commented May 1, 2017

Not sure how to eloquently describe this problem. Imagine I have PE sequencing data for samples 1-12 named like this, unfortunately with no leading zeros.

sample_1_1.fastq.gz	sample_1_2.fastq.gz
sample_2_1.fastq.gz	sample_2_2.fastq.gz
sample_3_1.fastq.gz	sample_3_2.fastq.gz
sample_4_1.fastq.gz	sample_4_2.fastq.gz
sample_5_1.fastq.gz	sample_5_2.fastq.gz
sample_6_1.fastq.gz	sample_6_2.fastq.gz
sample_7_1.fastq.gz	sample_7_2.fastq.gz
sample_8_1.fastq.gz	sample_8_2.fastq.gz
sample_10_1.fastq.gz	sample_10_2.fastq.gz
sample_11_1.fastq.gz	sample_11_2.fastq.gz
sample_12_1.fastq.gz	sample_12_2.fastq.gz

When attempting automated configuration based on a CSV file using -w template, I get warnings that bcbio is adding minimal metadata for samples _1 and _2, and looking at the yaml file created, the files: list is incorrectly created.

I imagine it's something to do with how the template generation script is looking for _1.fastq.gz and _2.fastq.gz, but is getting confused by the _1 and _2 in the sample names themselves.

In any case, my workaround was to simply rename the files or symlink them without the _ between "sample" and the number. But it's probably not-that-edge-of-a-case potentially worth addressing, or making it at least obvious what's happening -- it took me a few minutes to figure out what the issue was.

@leiendeckerlu

This comment has been minimized.

Copy link
Contributor

commented May 3, 2017

I encountered exactly the same issue. I agree, that it seems to be related to the way bcbio looks for the PE files and then gets confused.

@roryk

This comment has been minimized.

Copy link
Collaborator

commented May 3, 2017

Sorry about that-- those file names are ambiguous so we have a hard time writing something to automatically detect the pairing, since we don't know if the _1 and _2 (or _3 and _4) apply to the first or second _1 or _2. For those particular samples we could guess because one goes higher than _4 but we wouldn't know if the experiment had < 5 samples. It's hard to think of a super good way to handle this case rather than renaming.

@stephenturner

This comment has been minimized.

Copy link
Contributor Author

commented May 3, 2017

Makes sense. Might be worth a note at https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#automated-sample-configuration

For FASTQ files, the template functionality will identify pairs using standard conventions (_1 and _2, including Illumina extensions like _R1), so use the base filename without these (/path/to/yourfile_R1.fastq => yourfile)

@lpantano

This comment has been minimized.

Copy link
Collaborator

commented May 4, 2017

@stephenturner

This comment has been minimized.

Copy link
Contributor Author

commented May 5, 2017

@lpantano

This comment has been minimized.

Copy link
Collaborator

commented May 6, 2017

Thanks! I'll close this.

@lpantano lpantano closed this May 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.