Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microRNAseq analysis using bcbio for non model organisms #2427

Closed
WimSpee opened this issue Jun 29, 2018 · 6 comments
Closed

microRNAseq analysis using bcbio for non model organisms #2427

WimSpee opened this issue Jun 29, 2018 · 6 comments

Comments

@WimSpee
Copy link
Contributor

WimSpee commented Jun 29, 2018

Hi,

Do you expect that the microRNAseq analysis capability provided by bcbio would make sense for analysis of microRNAseq data of non model organisms?

I am trying to see if I can process the Capsicum annuum microRNAseq data generated in this project using bcbio:
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA177852

I am new to microRNAseq analysis so I am not really sure how to run this analysis and I am also not sure what kind of output I should expect.

The following is the yaml file that I am using:

upload:
  dir: ../final
details:
  - analysis: smallRNA-seq
    algorithm:
      aligner: star # any other aligner is supported.
      # change adapter according project
      # adapters: ["TGGAATTCTCGGGTGC"]
      expression_caller: [ seqcluster, mirdeep2]
      # expression_caller: [trna, seqcluster, mirdeep2, mirge] Read docs to know how to use
      # miRge tools: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq
      # species: hsa
    genome_build: my_ref
#resources:
#  atropos:
#    options: ["-u 4", "-u -4"]
#  mirge:
#    options: ["-lib $PATH_TO_LIBS_FOLDER"]

This is the log file produced by the analysis.

[2018-06-27T18:55Z] grid_controller: System YAML configuration: /workspace/my_user/tmp_bcbio_1.1.0_development/data_dir/galaxy/bcbio_system.yaml
[2018-06-27T18:56Z] grid_controller: Timing: organize samples
[2018-06-27T18:56Z] grid_controller: ipython: organize_samples
[2018-06-27T18:56Z] exeuction_node_20: Using input YAML configuration: /leading_dir/config/DA_1164_samples-merged.
yaml
[2018-06-27T18:56Z] exeuction_node_20: Checking sample YAML configuration: /leading_dir/config/DA_1164_samples-mer
ged.yaml
[2018-06-27T18:56Z] exeuction_node_20: Testing minimum versions of installed programs
[2018-06-27T18:56Z] grid_controller: ipython: prepare_sample
[2018-06-27T18:56Z] grid_controller: Timing: adapter trimming
[2018-06-27T18:56Z] grid_controller: ipython: trim_srna_sample
[2018-06-27T19:26Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_01/DA_1164_01.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T20:17Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_02/DA_1164_02.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T21:00Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_03/DA_1164_03.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T21:42Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_04/DA_1164_04.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T22:20Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_05/DA_1164_05.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T22:55Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_06/DA_1164_06.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-27T23:31Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_07/DA_1164_07.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T00:28Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_08/DA_1164_08.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T01:11Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_09/DA_1164_09.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T01:54Z] exeuction_node_20: Collapsing /leading_dir/work/trimmed/DA_1164_10/DA_1164_10.clean.fastq.gz
 with --min_size 16 --min 1
[2018-06-28T02:13Z] grid_controller: Timing: prepare
[2018-06-28T02:13Z] grid_controller: ipython: seqcluster_prepare
[2018-06-28T03:05Z] exeuction_node_24: Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1
[2018-06-28T03:08Z] grid_controller: Timing: alignment
[2018-06-28T03:08Z] grid_controller: ipython: srna_alignment
[2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner
[2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping...
[2018-06-28T03:11Z] grid_controller: Timing: small RNA annotation
[2018-06-28T03:11Z] grid_controller: ipython: srna_annotation
[2018-06-28T03:12Z] grid_controller: Timing: cluster
[2018-06-28T03:12Z] grid_controller: ipython: seqcluster_cluster
[2018-06-28T04:59Z] grid_controller: Timing: quality control
[2018-06-28T04:59Z] grid_controller: ipython: pipeline_summary
[2018-06-28T04:59Z] exeuction_node_20: QC: DA_1164_01 fastqc

I am not sure how to specify that dnapi should be run for de-novo adapter detection followed by adapter trimming. As far as I can tell dnapi was not used for adapter trimming. The fastqc part of the multiqQC report shows that of the 50bp reads the last 25 bp is almost 100% adapters.

As far as I can tell Capsicum annuum is not in mirbase. Therefore I did not enter a 3 letter species code. I am not sure if it makes sense to just enter the species code of a somewhat related species
http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly
Or that I better just don't provide a species code.

The analysis did not seem to produce much results. See the file list at the bottom of this comment. Then again I am also not sure what to expect.

The lack of output might in part be because mirdeep2 Rfam not being installed/found. Should I have done that myself?

[2018-06-28T03:08Z] exeuction_node_24: Aligning lane DA_1164_01 with star aligner
[2018-06-28T03:11Z] exeuction_node_24: mirdeep2 Rfam file not instaled. Skipping...

What I kind of expect as output for an microRNAseq analysis is:

  • identification/ filtering of known/discovered non microRNA sequences (either biological (e.g. other RNA's) or adapters)
  • identification of know mircroRNA sequences from mirbase or similar
  • identification of new microRNA's
  • per sample alignment BAM files of the microRNA sequences (not sure if this should run against the genome or transcriptome (or both). And I am not sure if these alignments identify target loci/mRNAs or microRNA precursur loci/mRNAs (or both))
  • microRNA target mRNA/gene prediction
  • microRNA quantification

Do you think it is possible to get the above results using bcbio for microRNAseq data of a non model organism?
How would I then do that using bcbio? Is the yaml that I use correct? Should I add tRNA as an expression caller?

Since I am new to microRNAseq the bcbio microRNAseq documentation is also a bit short me.
I would also very much appreciate it if you can point to me a recent sort of best practice method / review paper that describes the method(s) that bcbio in general tries to provide for microRNAseq analysis.

Thank you very much!

final/
final/DA_1164_05
final/DA_1164_05/qc
final/DA_1164_05/qc/fastqc
final/DA_1164_05/qc/fastqc/fastqc_report.html
final/DA_1164_05/qc/fastqc/fastqc_data.txt
final/DA_1164_05/qc/fastqc/DA_1164_05.zip
final/DA_1164_05/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_05/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_05/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_05/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_05/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_05/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_05/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_05/DA_1164_05-ready.trimming_stats
final/DA_1164_04
final/DA_1164_04/qc
final/DA_1164_04/qc/fastqc
final/DA_1164_04/qc/fastqc/fastqc_report.html
final/DA_1164_04/qc/fastqc/fastqc_data.txt
final/DA_1164_04/qc/fastqc/DA_1164_04.zip
final/DA_1164_04/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_04/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_04/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_04/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_04/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_04/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_04/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_04/DA_1164_04-ready.trimming_stats
final/DA_1164_09
final/DA_1164_09/qc
final/DA_1164_09/qc/fastqc
final/DA_1164_09/qc/fastqc/fastqc_report.html
final/DA_1164_09/qc/fastqc/fastqc_data.txt
final/DA_1164_09/qc/fastqc/DA_1164_09.zip
final/DA_1164_09/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_09/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_09/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_09/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_09/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_09/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_09/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_09/DA_1164_09-ready.trimming_stats
final/DA_1164_08
final/DA_1164_08/qc
final/DA_1164_08/qc/fastqc
final/DA_1164_08/qc/fastqc/fastqc_report.html
final/DA_1164_08/qc/fastqc/fastqc_data.txt
final/DA_1164_08/qc/fastqc/DA_1164_08.zip
final/DA_1164_08/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_08/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_08/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_08/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_08/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_08/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_08/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_08/DA_1164_08-ready.trimming_stats
final/DA_1164_07
final/DA_1164_07/qc
final/DA_1164_07/qc/fastqc
final/DA_1164_07/qc/fastqc/fastqc_report.html
final/DA_1164_07/qc/fastqc/fastqc_data.txt
final/DA_1164_07/qc/fastqc/DA_1164_07.zip
final/DA_1164_07/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_07/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_07/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_07/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_07/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_07/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_07/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_07/DA_1164_07-ready.trimming_stats
final/DA_1164_06
final/DA_1164_06/qc
final/DA_1164_06/qc/fastqc
final/DA_1164_06/qc/fastqc/fastqc_report.html
final/DA_1164_06/qc/fastqc/fastqc_data.txt
final/DA_1164_06/qc/fastqc/DA_1164_06.zip
final/DA_1164_06/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_06/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_06/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_06/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_06/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_06/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_06/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_06/DA_1164_06-ready.trimming_stats
final/DA_1164_01
final/DA_1164_01/qc
final/DA_1164_01/qc/fastqc
final/DA_1164_01/qc/fastqc/fastqc_report.html
final/DA_1164_01/qc/fastqc/fastqc_data.txt
final/DA_1164_01/qc/fastqc/DA_1164_01.zip
final/DA_1164_01/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_01/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_01/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_01/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_01/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_01/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_01/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_01/qc/small-rna
final/DA_1164_01/qc/small-rna/DA_1164_01.txt
final/DA_1164_01/DA_1164_01-ready.trimming_stats
final/DA_1164_03
final/DA_1164_03/qc
final/DA_1164_03/qc/fastqc
final/DA_1164_03/qc/fastqc/fastqc_report.html
final/DA_1164_03/qc/fastqc/fastqc_data.txt
final/DA_1164_03/qc/fastqc/DA_1164_03.zip
final/DA_1164_03/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_03/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_03/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_03/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_03/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_03/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_03/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_03/DA_1164_03-ready.trimming_stats
final/DA_1164_02
final/DA_1164_02/qc
final/DA_1164_02/qc/fastqc
final/DA_1164_02/qc/fastqc/fastqc_report.html
final/DA_1164_02/qc/fastqc/fastqc_data.txt
final/DA_1164_02/qc/fastqc/DA_1164_02.zip
final/DA_1164_02/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_02/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_02/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_02/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_02/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_02/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_02/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_02/DA_1164_02-ready.trimming_stats
final/DA_1164_10
final/DA_1164_10/qc
final/DA_1164_10/qc/fastqc
final/DA_1164_10/qc/fastqc/fastqc_report.html
final/DA_1164_10/qc/fastqc/fastqc_data.txt
final/DA_1164_10/qc/fastqc/DA_1164_10.zip
final/DA_1164_10/qc/fastqc/Per_base_sequence_quality.tsv
final/DA_1164_10/qc/fastqc/Per_tile_sequence_quality.tsv
final/DA_1164_10/qc/fastqc/Per_sequence_quality_scores.tsv
final/DA_1164_10/qc/fastqc/Per_base_sequence_content.tsv
final/DA_1164_10/qc/fastqc/Per_sequence_GC_content.tsv
final/DA_1164_10/qc/fastqc/Per_base_N_content.tsv
final/DA_1164_10/qc/fastqc/Sequence_Length_Distribution.tsv
final/DA_1164_10/DA_1164_10-ready.trimming_stats
final/2018-06-28_DA_1164_samples-merged
final/2018-06-28_DA_1164_samples-merged/programs.txt
final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen.log
final/2018-06-28_DA_1164_samples-merged/bcbio-nextgen-commands.log
final/2018-06-28_DA_1164_samples-merged/project-summary.yaml
final/2018-06-28_DA_1164_samples-merged/report
final/2018-06-28_DA_1164_samples-merged/report/srna_report.rmd
final/2018-06-28_DA_1164_samples-merged/report/summary.csv
final/2018-06-28_DA_1164_samples-merged/multiqc
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_report.html
final/2018-06-28_DA_1164_samples-merged/multiqc/report
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_08_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_04_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_07_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_06_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_02_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_05_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_10_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_01_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_09_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/report/metrics/DA_1164_03_bcbio.txt
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_config.yaml
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data
final/2018-06-28_DA_1164_samples-merged/multiqc/multiqc_data/multiqc_data_final.json
final/2018-06-28_DA_1164_samples-merged/multiqc/list_files_final.txt
final/2018-06-28_DA_1164_samples-merged/seqcluster
final/2018-06-28_DA_1164_samples-merged/seqcluster/log
final/2018-06-28_DA_1164_samples-merged/seqcluster/log/run.log
final/2018-06-28_DA_1164_samples-merged/seqcluster/log/trace.log
final/2018-06-28_DA_1164_samples-merged/seqcluster/seqs_rmlw.bam_cov.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/read_stats.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/cluster.bed
final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj.pk
final/2018-06-28_DA_1164_samples-merged/seqcluster/list_obj_red.pk
final/2018-06-28_DA_1164_samples-merged/seqcluster/counts.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/size_counts.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/positions.bed
final/2018-06-28_DA_1164_samples-merged/seqcluster/counts_sequence.tsv
final/2018-06-28_DA_1164_samples-merged/seqcluster/seqcluster.json
final/2018-06-28_DA_1164_samples-merged/seqclusterViz
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/run.log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/log/trace.log
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/344
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/profiles/5
final/2018-06-28_DA_1164_samples-merged/seqclusterViz/seqcluster.db
@lpantano
Copy link
Collaborator

lpantano commented Jun 29, 2018 via email

@WimSpee
Copy link
Contributor Author

WimSpee commented Jul 4, 2018

Hi Lorena Pantano.

Thank you for the information. I did not know that plant specific tools were needed. Do you mean any of these two tools?
miRPlant: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-275
miRDeep-P: https://academic.oup.com/bioinformatics/article/27/18/2614/181153

The first paper mentions that different tools are needed because of that the miRNA precursors are different / longer in plants than in animals.

The most challenging problem in identifying novel plant miRNA is to find a suitable genomic region as a miRNA precursor candidate (to test whether it forms hairpins) because the majority of precursor miRNA in plants are between 100-200 bp [4], which is much longer than those in animals.

Do you know if there are other reasons plant specific miRNA tools are needed?

I will try to use / look at the seqcluster results.

I will try with trim_reads : True .

Also I will try the analysis with Solanum lycopersicum (mirbase SLY) as the known miRNA data set. That species is some what close (also in the nightshade family), and the miRNA seqeunces are conserved in plants according to one of the above papers.
http://www.mirbase.org/cgi-bin/mirna_summary.pl?org=sly

Do I need to do anything to make use of the the SLY miRNA known sequences?

Do you know how and by who the sequences for a species get added in mirbase?

It would be nice if the microRNA seq functionality of bcbio works for plants. I kind of hoped it would / did not expect a need plant specific tools. At the same time I understand your primary focus is on other species, thus it would only make sense to me if it's not to much work or we could do part of it.

Thanks again for the information.

@lpantano
Copy link
Collaborator

lpantano commented Jul 4, 2018 via email

@WimSpee
Copy link
Contributor Author

WimSpee commented Jul 11, 2018

Hi @lpantano . I tried to run the same analysis with trim_reads : true and species: sly.

This resulted in the following error during adapter removal:

[2018-07-10T16:51Z] execution_machine_x2: 2018-07-10 18:51:49,585 INFO: This is Atropos 1.1.18 with Python 3.6.5
[2018-07-10T16:51Z] execution_machine_x2: 2018-07-10 18:51:49,590 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ...
[2018-07-10T16:52Z] execution_machine_x2: =======
[2018-07-10T16:52Z] execution_machine_x2: Atropos
[2018-07-10T16:52Z] execution_machine_x2: =======
[2018-07-10T16:52Z] execution_machine_x2: Atropos version: 1.1.18
[2018-07-10T16:52Z] execution_machine_x2: Python version: 3.6.5
[2018-07-10T16:52Z] execution_machine_x2: Command line parameters: trim --max-reads 500000 -u 22 -se /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_10.fastq.gz -o /data/run/Projects/DA-1164/DA_1164_samples-merged/work/bcbiotx/tmpHUH_oP/DA_1164_10end.fastq.gz
[2018-07-10T16:52Z] execution_machine_x2: Sample ID: DA_1164_10
[2018-07-10T16:52Z] execution_machine_x2: Input format: FASTQ, Read 1, w/ Qualities
[2018-07-10T16:52Z] execution_machine_x2: Input files:
[2018-07-10T16:52Z] execution_machine_x2:   /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_10.fastq.gz
[2018-07-10T16:52Z] execution_machine_x2: Start time: 2018-07-10T18:51:49.589469
[2018-07-10T16:52Z] execution_machine_x2: Wallclock time: 15.54 s (31 us/read; 1.93 M reads/minute)
[2018-07-10T16:52Z] execution_machine_x2: CPU time (main process): 11.20 s
[2018-07-10T16:52Z] execution_machine_x2: --------
[2018-07-10T16:52Z] execution_machine_x2: Trimming
[2018-07-10T16:52Z] execution_machine_x2: --------
[2018-07-10T16:52Z] execution_machine_x2: Reads                                  records   fraction
[2018-07-10T16:52Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T16:52Z] execution_machine_x2: Total reads processed:                 500,000
[2018-07-10T16:52Z] execution_machine_x2: Reads written (passing filters):       500,000     100.0%
[2018-07-10T16:52Z] execution_machine_x2: Base pairs                                  bp   fraction
[2018-07-10T16:52Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T16:52Z] execution_machine_x2: Total bp processed:                 25,500,000
[2018-07-10T16:52Z] execution_machine_x2: Cut unconditionally                 11,000,000      43.1%
[2018-07-10T16:52Z] execution_machine_x2: Total bp written (filtered):        14,500,000      56.9%
[2018-07-10T16:52Z] execution_machine_x2: Unexpected error
Traceback (most recent call last):
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 51, in _setup_logging
    yield config
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 92, in trim_srna_sample
    return ipython.zip_args(apply(srna.trim_srna_sample, *args))
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 61, in trim_srna_sample
    adapters = adapter if adapter else _dnapi_prediction(in_file, out_dir)
  File "/home/my_user/workspace/tmp_bcbio_1.1.0_development/data_dir/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 157, in _dnapi_prediction
    max_score = iterative_result[1][1]
IndexError: list index out of range

This seems to be sample specific. Other samples seem not to run into this error.

[2018-07-10T12:54Z] execution_machine_x2: 2018-07-10 14:54:13,644 INFO: This is Atropos 1.1.18 with Python 3.6.5
[2018-07-10T12:54Z] execution_machine_x2: 2018-07-10 14:54:13,655 INFO: Trimming 0 adapter with at most 10.0% errors in single-end mode ...
[2018-07-10T12:54Z] execution_machine_x2: =======
[2018-07-10T12:54Z] execution_machine_x2: Atropos
[2018-07-10T12:54Z] execution_machine_x2: =======
[2018-07-10T12:54Z] execution_machine_x2: Atropos version: 1.1.18
[2018-07-10T12:54Z] execution_machine_x2: Python version: 3.6.5
[2018-07-10T12:54Z] execution_machine_x2: Command line parameters: trim --max-reads 500000 -u 22 -se /data/run/Projects/DA-1164/input_fastq/
concat/DA_1164_01.fastq.gz -o /data/run/Projects/DA-1164/DA_1164_samples-merged/work/bcbiotx/tmpEtG7q8/DA_1164_01end.fastq.gz
[2018-07-10T12:54Z] execution_machine_x2: Sample ID: DA_1164_01
[2018-07-10T12:54Z] execution_machine_x2: Input format: FASTQ, Read 1, w/ Qualities
[2018-07-10T12:54Z] execution_machine_x2: Input files:
[2018-07-10T12:54Z] execution_machine_x2:   /data/run/Projects/DA-1164/input_fastq/concat/DA_1164_01.fastq.gz
[2018-07-10T12:54Z] execution_machine_x2: Start time: 2018-07-10T14:54:13.654689
[2018-07-10T12:54Z] execution_machine_x2: Wallclock time: 15.33 s (31 us/read; 1.96 M reads/minute)
[2018-07-10T12:54Z] execution_machine_x2: CPU time (main process): 11.04 s
[2018-07-10T12:54Z] execution_machine_x2: --------
[2018-07-10T12:54Z] execution_machine_x2: Trimming
[2018-07-10T12:54Z] execution_machine_x2: --------
[2018-07-10T12:54Z] execution_machine_x2: Reads                                  records   fraction
[2018-07-10T12:54Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T12:54Z] execution_machine_x2: Total reads processed:                 500,000
[2018-07-10T12:54Z] execution_machine_x2: Reads written (passing filters):       500,000     100.0%
[2018-07-10T12:54Z] execution_machine_x2: Base pairs                                  bp   fraction
[2018-07-10T12:54Z] execution_machine_x2: ----------------------------------- ---------- ----------
[2018-07-10T12:54Z] execution_machine_x2: Total bp processed:                 25,500,000
[2018-07-10T12:54Z] execution_machine_x2: Cut unconditionally                 11,000,000      43.1%
[2018-07-10T12:54Z] execution_machine_x2: Total bp written (filtered):        14,500,000      56.9%
[2018-07-10T12:54Z] execution_machine_x2: Adding adapter to the list: TGGAATTCTCGGG with score 282.8354
[2018-07-10T12:54Z] execution_machine_x2: Adding adapter to the list: GGTGCCAAGGAA with score 78.588
[2018-07-10T12:54Z] execution_machine_x2: remove adapter for DA_1164_01
[2018-07-10T14:02Z] execution_machine_x2: Collapsing /data/run/Projects/DA-1164/DA_1164_samples-merged/work/trimmed/DA_1164_01/DA_1164
_01.clean.fastq.gz with --min_size 16 --min 1

For this sample I am also not sure why Atropos is run before _dnapi_prediction and where Atropos get's the TGGAATTCTCGGG and GGTGCCAAGGAA adapters from.

@lpantano
Copy link
Collaborator

Hi,

sorry about this. It seems that the tool we used to predict the adapter is not working there. If you know the 3' adapter, I'll suggest to add the adapter to the adapters: [] to the config file:

https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/illumina-srnaseq.yaml#L8

If you don't know you can ask the sequencing core for that.

In this case, I'll suggest to start from scratch the analysis.

Let me know if that helps.

@roryk
Copy link
Collaborator

roryk commented Aug 10, 2019

Thanks, closing this as it seems like its been answered.

@roryk roryk closed this as completed Aug 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants