calling fastq-dump install from conda doesn't work #196

njbowen · 2022-03-15T22:25:37Z

njbowen
Mar 15, 2022

when testing the ( https://www.biostarhandbook.com/software-installation.html#run-a-realistic-analysis ) run a realistic analysis as of just now, calling fastq-dump install from handbook install sratoolkit, i assume with conda, doesn't work on new install on Windows/Ubuntu20.04.4 LTS

error was:
$ make vcf
mkdir -p reads
fastq-dump -F -X 10000 --split-files -O reads SRR1553425
Failed to call external services.
make: *** [Makefile:110: reads/SRR1553425_1.fastq] Error 64

had to download latest ubuntu version of sratoolkit from ncbi @
http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

i unpacked and put in home/myusername/bin folder created for doctor.py
and added the sra's bin to the path in .bashrc
export PATH=~/bin/sratoolkit.2.11.2-ubuntu64/bin:$PATH

seemed to better, output pasted below, down to a new error:
$ make vcf
mkdir -p reads
fastq-dump -F -X 10000 --split-files -O reads SRR1553425
Read 10000 spots for SRR1553425
Written 10000 spots for SRR1553425

Will generate both adapters.

echo ">illumina" > reads/adapter.fa
echo "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" >> reads/adapter.fa
echo ">nextera" >> reads/adapter.fa
echo "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC" >> reads/adapter.fa

Apply the trimming.

trimmomatic PE -threads 4 -phred33 -basein reads/SRR1553425_1.fastq -baseout reads/SRR1553425.fq
ILLUMINACLIP:reads/adapter.fa:2:30:5 SLIDINGWINDOW:4:15 MINLEN:50
TrimmomaticPE: Started with arguments:
-threads 4 -phred33 -basein reads/SRR1553425_1.fastq -baseout reads/SRR1553425.fq ILLUMINACLIP:reads/adapter.fa:2:30:5 SLIDINGWINDOW:4:15 MINLEN:50
Using templated Input files: reads/SRR1553425_1.fastq reads/SRR1553425_2.fastq
Using templated Output files: reads/SRR1553425_1P.fq reads/SRR1553425_1U.fq reads/SRR1553425_2P.fq reads/SRR1553425_2U.fq
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 10000 Both Surviving: 9834 (98.34%) Forward Only Surviving: 96 (0.96%) Reverse Only Surviving: 30 (0.30%) Dropped: 40 (0.40%)
TrimmomaticPE: Completed successfully
mkdir -p bam

Note how we filter alignment for mapped reads only.

bwa mem -t 4 refs/AF086833.fa reads/SRR1553425_1P.fq reads/SRR1553425_2P.fq | samtools view -b -F 4 | samtools sort -@ 4 > bam/SRR1553425-AF086833.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 19668 sequences (1965522 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1054, 7603, 33, 974)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (99, 177, 270)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 612)
[M::mem_pestat] mean and std.dev: (193.76, 119.01)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 783)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (193, 280, 387)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 775)
[M::mem_pestat] mean and std.dev: (298.31, 140.61)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 969)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (39, 98, 259)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 699)
[M::mem_pestat] mean and std.dev: (146.67, 163.09)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 919)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (105, 177, 279)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 627)
[M::mem_pestat] mean and std.dev: (199.78, 122.85)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 801)
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 19668 reads in 1.062 CPU sec, 0.276 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4[bam_sort_core] merging from 0 files and 4 in-memory blocks...
refs/AF086833.fa reads/SRR1553425_1P.fq reads/SRR1553425_2P.fq
[main] Real time: 0.376 sec; CPU: 1.094 sec
samtools index bam/SRR1553425-AF086833.bam
samtools flagstat bam/SRR1553425-AF086833.bam
20615 + 0 in total (QC-passed reads + QC-failed reads)
19569 + 0 primary
0 + 0 secondary
1046 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
20615 + 0 mapped (100.00% : N/A)
19569 + 0 primary mapped (100.00% : N/A)
19569 + 0 paired in sequencing
9785 + 0 read1
9784 + 0 read2
19478 + 0 properly paired (99.53% : N/A)
19568 + 0 with itself and mate mapped
1 + 0 singletons (0.01% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
mkdir -p vcf
bcftools mpileup -O v -f refs/AF086833.fa bam/SRR1553425-AF086833.bam | bcftools call --ploidy 1 -mv -O z -o vcf/SRR1553425-AF086833.vcf.gz
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
bcftools index vcf/SRR1553425-AF086833.vcf.gz

VCF file: vcf/SRR1553425-AF086833.vcf.gz

Snpeff needs the files in specific folders.

mkdir -p data/AF086833

Download the GenBank file, has to be called genes.gbk.

bio fetch AF086833 > data/AF086833/genes.gbk

Append entry to current genome to the config.

echo "AF086833.genome : AF086833" >> snpEff.config

Build the snpEff database.

snpEff build -v AF086833
00:00:00 SnpEff version SnpEff 5.1 (build 2022-01-21 06:23), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'AF086833'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AF086833'
00:00:00 Reading config file: /home/nbowen/work/snpEff.config
00:00:00 done
00:00:00 Chromosome: 'AF086833.2' length: 18959
00:00:00 Create exons from CDS (if needed):
...........00:00:00 Exons created for 9 transcripts.
00:00:00 Deleting redundant exons (if needed):
00:00:00 Total transcripts with deleted exons: 0
00:00:00 Collapsing zero length introns (if needed):
.
0 00:00:00 Total collapsed transcripts: 1
00:00:00 Adding genomic sequences to genes:
00:00:00 Done (4 sequences added).
00:00:00 Adding genomic sequences to exons:
00:00:00 Done (10 sequences added, 0 ignored).
00:00:00 Finishing up genome
00:00:00 Adjusting transcripts:
00:00:00 Adjusting genes:
WARNING_GENE_COORDINATES: Gene 'Gene_8287_9739' (name:'VP30'), adjusting start coordinate from 8287 to 8508
WARNING_GENE_COORDINATES: Gene 'Gene_8287_9739' (name:'VP30'), adjusting end coordinate from 9739 to 9374
WARNING_GENE_COORDINATES: Gene 'Gene_9884_11517' (name:'VP24'), adjusting start coordinate from 9884 to 10344
WARNING_GENE_COORDINATES: Gene 'Gene_9884_11517' (name:'VP24'), adjusting end coordinate from 11517 to 11099
WARNING_GENE_COORDINATES: Gene 'Gene_11500_18281' (name:'L'), adjusting start coordinate from 11500 to 11580
WARNING_GENE_COORDINATES: Gene 'Gene_11500_18281' (name:'L'), adjusting end coordinate from 18281 to 18218
WARNING_GENE_COORDINATES: Gene 'Gene_3031_4406' (name:'VP35'), adjusting start coordinate from 3031 to 3128
WARNING_GENE_COORDINATES: Gene 'Gene_3031_4406' (name:'VP35'), adjusting end coordinate from 4406 to 4150
WARNING_GENE_COORDINATES: Gene 'Gene_55_3025' (name:'NP'), adjusting start coordinate from 55 to 469
WARNING_GENE_COORDINATES: Gene 'Gene_55_3025' (name:'NP'), adjusting end coordinate from 3025 to 2688
WARNING_GENE_COORDINATES: Gene 'Gene_4389_5893' (name:'VP40'), adjusting start coordinate from 4389 to 4478
WARNING_GENE_COORDINATES: Gene 'Gene_4389_5893' (name:'VP40'), adjusting end coordinate from 5893 to 5458
WARNING_GENE_COORDINATES: Gene 'Gene_5899_8304' (name:'GP'), adjusting start coordinate from 5899 to 6038
WARNING_GENE_COORDINATES: Gene 'Gene_5899_8304' (name:'GP'), adjusting end coordinate from 8304 to 8067
00:00:00 Adjusting chromosomes lengths:
00:00:00 Ranking exons:
00:00:00 Create UTRs from CDS (if needed):
00:00:00 Remove empty chromosomes:
00:00:00 Marking as 'coding' from CDS information:
00:00:00 Done: 0 transcripts marked
00:00:00
00:00:00 Caracterizing exons by splicing (stage 1) :

00:00:00 Caracterizing exons by splicing (stage 2) :
00:00:00 done.
00:00:00 [Optional] Rare amino acid annotations
WARNING_FILE_NOT_FOUND: Rare Amino Acid analysis: Cannot read protein sequence file '/home/nbowen/work/./data/AF086833/protein.fa', nothing done.
ERROR: CDS check file '/home/nbowen/work/./data/AF086833/cds.fa' not found.
00:00:00 Protein check file: '/home/nbowen/work/./data/AF086833/genes.gbk'

00:00:00 Checking database using protein sequences
00:00:00 Comparing Proteins...
Labels:
'+' : OK
'.' : Missing
'*' : Error
++++++++*00:00:00

    Protein check:  AF086833        OK: 8   Not found: 0    Errors: 1       Error percentage: 11.11111111111111%

00:00:00 Protein sequences comparison failed!
ERROR: Database check failed.
00:00:00 Logging
00:00:01 Checking for updates...
00:00:01 Done.
make: *** [Makefile:154: data/AF086833/snpEffectPredictor.bin] Error 255
(bioinfo)

then:
$ ls
Makefile bam data reads refs snpEff.config sratoolkit.tar.gz vcf
(bioinfo)

$ find .
.
./bam
./bam/SRR1553425-AF086833.bam
./bam/SRR1553425-AF086833.bam.bai
./data
./data/AF086833
./data/AF086833/genes.gbk
./Makefile
./reads
./reads/adapter.fa
./reads/SRR1553425_1.fastq
./reads/SRR1553425_1P.fq
./reads/SRR1553425_1U.fq
./reads/SRR1553425_2.fastq
./reads/SRR1553425_2P.fq
./reads/SRR1553425_2U.fq
./refs
./refs/AF086833.fa
./refs/AF086833.fa.amb
./refs/AF086833.fa.ann
./refs/AF086833.fa.bwt
./refs/AF086833.fa.fai
./refs/AF086833.fa.pac
./refs/AF086833.fa.sa
./refs/AF086833.gff
./refs/NC_045512.fa
./refs/NC_045512.fa.amb
./refs/NC_045512.fa.ann
./refs/NC_045512.fa.bwt
./refs/NC_045512.fa.fai
./refs/NC_045512.fa.pac
./refs/NC_045512.fa.sa
./refs/NC_045512.gff
./snpEff.config
./sratoolkit.tar.gz
./vcf
./vcf/SRR1553425-AF086833.vcf.gz
./vcf/SRR1553425-AF086833.vcf.gz.csi
(bioinfo)

ialbert · 2022-03-15T23:52:58Z

ialbert
Mar 15, 2022
Maintainer

There used to be a problem with fastq-dump that I reported and was fixed here:

bioconda/bioconda-recipes#31396

can you run fastq-dump (the one from the environment) and report the version fails

~/miniconda3/envs/bioinfo/bin/fastq-dump

there is also this issue that seems to occur, also with conda based fastq-dump

ncbi/sra-tools#467

0 replies

ialbert · 2022-03-16T02:10:40Z

ialbert
Mar 16, 2022
Maintainer

I have also noticed that even when it works it takes a very long time to run even the test command that the official SRA page recommends:

fastq-dump --stdout -X 2 SRR390728

takes almost a minute to run. I wonder wether this has to do with some flakiness at NIH.

actually, a few seconds later, the same command that before executed fine now exits with this error:

2022-03-16T02:08:36 fastq-dump.2.8.0 sys: connection failed while opening file within cryptographic module 
- mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )

0 replies

ialbert · 2022-03-16T15:20:49Z

ialbert
Mar 16, 2022
Maintainer

I was unable to reproduce the fastq-dump error, but it something that occasionally occurs. I will add a note to the book about manually installing fastq-dump

As for the snpEff problem, it seems it is caused by the new version of snpEff, that was recently released, it seems that it does not operate correctly

I have opened an issue with snpEff

pcingola/SnpEff#388

in the meantime the solution is to downgrade to snpEff 5.0

mamba install snpEff==5.0

the version lock will be added to the install instructions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

calling fastq-dump install from conda doesn't work #196

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

calling fastq-dump install from conda doesn't work #196

Uh oh!

Uh oh!

njbowen Mar 15, 2022

Will generate both adapters.

Apply the trimming.

Note how we filter alignment for mapped reads only.

VCF file: vcf/SRR1553425-AF086833.vcf.gz

Snpeff needs the files in specific folders.

Download the GenBank file, has to be called genes.gbk.

Append entry to current genome to the config.

Build the snpEff database.

Replies: 3 comments

Uh oh!

ialbert Mar 15, 2022 Maintainer

Uh oh!

ialbert Mar 16, 2022 Maintainer

Uh oh!

Uh oh!

ialbert Mar 16, 2022 Maintainer

njbowen
Mar 15, 2022

ialbert
Mar 15, 2022
Maintainer

ialbert
Mar 16, 2022
Maintainer

ialbert
Mar 16, 2022
Maintainer