Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channels not found with grep to $all_reads #1

Closed
vsoch opened this issue Sep 22, 2017 · 3 comments
Closed

Channels not found with grep to $all_reads #1

vsoch opened this issue Sep 22, 2017 · 3 comments

Comments

@vsoch
Copy link
Contributor

vsoch commented Sep 22, 2017

hey @amojarro ! I'm working on some singularity images (like Docker but safe for HPC) to go along with a publication for an internal container organization format, and was recommended to use your pipeline by one of the community (do you know Pim?) I'm doing well - I have two versions of the container:

https://github.com/vsoch/carrierseq/tree/singularity

but I am hitting a snag. This call:

grep -Eo '_ch[0-9]+_|ch=[0-9]+' $all_reads > $output_folder/06_poisson_calculation/01_reads_channels.lst

returns nothing. I am using the data that you linked, and thinking either it changed or the call with grep should be adjusted. What happens after nothing is found is the python script obviously gets angry when 0 is given for the denominator.

Thanks for your help with this!

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2017

and see here --> https://github.com/vsoch/carrierseq/blob/singularity/docs/singularity.md for the overall idea.

@amojarro
Copy link
Owner

Hi @vsoch, It looks like the Sequence Read Archive (SRA) has replaced the original read headers.

Normally, the sequence data would contain either the output information from the Albacore basecaller or from a Poretools fastq conversion command (fast5 > fastq).

For example, Albacore would look like [read ID run ID read channel start_time]:

@cc74d4a9-b62f-4274-86d0-7d95370b6aba runid=55268 read=23015 ch=434 start_time=2017-06-22T17:44:34Z

And Poretools [read ID path/to/fast5]:

@channel_434_cc74d4a9-b62f-4274-86d0-7d95370b6aba_template /Users/mojarro/Documents/Sequencing/Low_Input_Sequencing/minknow_1_5_18/fast5/pass/127/VENUSAUR_20170511_FNFAE22530_MN17220_sequencing_run_sample_id_55268_ch434_read23015_strand.fast5

However, the header information has now been replaced with an SRA ID and only the read ID:

>gnl|SRA|SRR5935058.1 895b5243-42d4-4cc6-8b5b-c29c813bf663_Basecall_1D template (Biological)

Thank you for the comment, I will investigate how to preserve the original metadata on NCBI. In the meantime I have uploaded the original fastq file to dropbox.

https://www.dropbox.com/sh/vyor82ulzh7n9ke/AAC4W8rMe4z5hdb7j4QhF_IYa?dl=0

@vsoch
Copy link
Contributor Author

vsoch commented Sep 24, 2017

Fantastic! Thanks for your quick response and looking into this - I'll give it another try with the updated file, and will keep a lookout from updates from you here. A similar thing happened to me and a colleague with data URLs, and we ultimately opted to serve the data ourselves.

@vsoch vsoch closed this as completed Sep 26, 2017
amojarro pushed a commit that referenced this issue Feb 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants