In [1]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR/metagenomics
ls

PRJEB13831.txt  process-data.ipynb


# Step 1: Pick data

We want to pick a random sample of 50 datasets from **PRJEB13831.txt**.

In [4]:
cut -f 2 PRJEB13831.txt |tail -n+2 | sort | uniq | shuf | head -n 50 > /tmp/samples50.txt

In [8]:
(for i in `cat /tmp/samples50.txt`;
do
    grep "${i}" PRJEB13831.txt | head -n 1 | cut -f 3
done) > metagenomes.txt

In [9]:
head metagenomes.txt
wc -l metagenomes.txt

ERR1713335
ERR1713397
ERR1713406
ERR1713375
ERR1713374
ERR1713351
ERR1713352
ERR1713336
ERR1713400
ERR1713376
50 metagenomes.txt


Great. We now have our sequence run data from 50 random samples.

# Step 1: Download data

As a first step, let's download 5 random metagenomic datasets in the file `metagenomes.txt`.

In [1]:
for name in `head -n 5 metagenomes.txt`;
do
    set -x
    time fasterq-dump -o data/${name} --split-files ${name}
    time pigz data/${name}*
    set +x
done

+ fasterq-dump -o data/ERR1713335 --split-files ERR1713335
spots read      : 25,423,122
reads read      : 50,846,244
reads written   : 50,846,244

real	1m19.200s
user	2m20.918s
sys	0m51.724s
+ pigz data/ERR1713335_1.fastq data/ERR1713335_2.fastq

real	0m51.609s
user	43m26.651s
sys	0m17.653s
+ set +x
+ fasterq-dump -o data/ERR1713397 --split-files ERR1713397
spots read      : 44,621,047
reads read      : 89,242,094
reads written   : 89,242,094

real	3m28.542s
user	4m20.716s
sys	1m32.130s
+ pigz data/ERR1713397_1.fastq data/ERR1713397_2.fastq

real	1m44.627s
user	75m57.375s
sys	0m31.828s
+ set +x
+ fasterq-dump -o data/ERR1713406 --split-files ERR1713406
spots read      : 45,737,232
reads read      : 91,474,464
reads written   : 91,474,464

real	3m31.915s
user	4m32.552s
sys	1m36.749s
+ pigz data/ERR1713406_1.fastq data/ERR1713406_2.fastq

real	1m37.228s
user	78m37.324s
sys	0m32.397s
+ set +x
+ fasterq-dump -o data/ERR1713375 --split-files ERR1713375
spots read      : 21,799,125
reads rea

In [3]:
ls data

[0m[01;31mERR1713335_1.fastq.gz[0m  [01;31mERR1713375_1.fastq.gz[0m  [01;31mERR1713406_1.fastq.gz[0m
[01;31mERR1713335_2.fastq.gz[0m  [01;31mERR1713375_2.fastq.gz[0m  [01;31mERR1713406_2.fastq.gz[0m
[01;31mERR1713374_1.fastq.gz[0m  [01;31mERR1713397_1.fastq.gz[0m
[01;31mERR1713374_2.fastq.gz[0m  [01;31mERR1713397_2.fastq.gz[0m


In [4]:
du -sh data

30G	data


Awesome. We've downloaded and compressed all the metagenomes.

# Step 2: Examining data

Now, let's use the software [dashing](https://github.com/dnbaker/dashing) (version `0.4.0`) to estimate the total number of unique k-mers in our dataset.

This sofware must be downloaded and placed in your `PATH`.

In [1]:
dashing_s512 --version
dashing_s512 hll -k 31 -p 32 data/*.fastq.gz

Dashing version: v0.4.0-5-g9c13
Dashing version: v0.4.0-5-g9c13
Dashing version: v0.4.0-5-g9c13
[int bns::hll_main(int, char**)] Processing 10 paths with 32 threads
Estimated number of unique exact matches: 12650023887.000000


This result tells us that among all our metagenomes, we have appromximetly 1.3*10^10 (13 billion) unique 31-mers in our dataset.

Let's normalize this value based on the number of bases in our data to get a value of **unique k-mers/hundred bp**.

First, we have to count all the basepairs in our dataset.

In [3]:
# Code derived from https://bioinformatics.stackexchange.com/a/966
pigz -dc data/*.fastq.gz | awk 'NR%4==2{c++; l+=length($0)}
          END{
                print "Number of reads: "c;
                print "Number of bases in reads: "l
              }'

Number of reads: 329317224
Number of bases in reads: 49726900824


Next we have to convert to our normalized value **unique k-mers/hundred bp**.

In [6]:
KMERS=12650023887
BP=49726900824
echo "${KMERS}/(${BP}/100)" | bc -l

25.43899514625420043249


Awesome. This dataset has about 25 unique 31-mers/100 bp.

In [2]:
conda run --name kat kat hist -o ERR1713335_kat_hist -t 32 -m 31 data/ERR1713335*.gz

Kmer Analysis Toolkit (KAT) V2.4.2

Running KAT in HIST mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (data/ERR1713335_1.fastq.gz data/ERR1713335_2.fastq.gz) ...
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!

 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!

 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 success!
 succ

Awesome. We'll have to resize this image, so let's use `convert` from imagemagick to resize so we can display inline.

In [7]:
ls
convert ERR1713335_kat_hist.png -resize 600x images/ERR1713335_kat_hist.png

[0m[01;34mdata[0m                                    [01;34mimages[0m
ERR1713335_kat_hist                     metagenomes.txt
ERR1713335_kat_hist.dist_analysis.json  PRJEB13831.txt
[01;35mERR1713335_kat_hist.png[0m                 process-data.ipynb


The final k-mer distribution is:

![ERR1713335_kat_hist.png](images/ERR1713335_kat_hist.png)

I'm not sure why this is empty, since the histogram file shows data in it.

In [8]:
head ERR1713335_kat_hist

# Title:31-mer spectra for: ERR1713335_1.fastq.gz ERR1713335_2.fastq.gz
# XLabel:31-mer frequency
# YLabel:# distinct 31-mers
# Kmer value:31
# Input 1:data/ERR1713335_1.fastq.gz data/ERR1713335_2.fastq.gz
###
1 1547349168
2 365350996
3 126094578
4 72768428
