K-mer-wide association study.
- Python 3.6+ with installed libraries:
- sklearn
- matplotlib
- logomaker
- setproctitle
- KMC 3
- Bowtie 2
- Diamond
This demonstration uses bash syntax, but the pipeline can run on any platform. Compiled binaries are available in releases.
Create a text file containing the paths to the input fastq files,
one per line.
For this demonstration we will assume it is named files.txt
.
Assuming there are n
extraction jobs and this is job i
:
count -f files.txt -p $i -np $n -o counts_part_$i.gz
merge -t count -i "counts_part_*.gz" -o counts_all.gz
If m
is the minimal sample count for testing a k-mer:
filter -i counts_all.gz -o counts_filtered.gz -n $m
Assuming there are n
extraction jobs and this is job i
:
has -f files.txt -i counts_filtered.gz -p $i -np $n -o has_part_$i.gz
merge -t has -i "has_part_*.gz" -o has_all.gz
Using minimizers of length z
(the paper uses z=9
):
split -i has_all.gz -o "has_part_*.gz" -k $z
mnzfiles -i "has_part_*.gz" -o mnz_files.txt
# For each file f in mnz_files.txt:
mnzgraph -i $f -o "$(basename $f .gz)_centers.gz" -t $num_threads
Using 1/n
k-mers and 1/s
samples for population structure:
smpkmers -r $n -i "has_part_*_centers.gz" -o has_popstr_tmp.gz
smpkmers -s $s -i has_popstr_tmp.gz -o has_popstr.gz
hastojson -i has_popstr.gz -o has_popstr.json
Edit popstr/popstr.py
constants with has_popstr.json
as the input.
Run it to create the projection information.
Assuming the result of the previous stage is components.json
:
# For each file in files.txt:
projectpopstr -c components.json -i file_123.fq -o file_123.popstr.json
Create a pandas .h5
file where the rows are samples and columns are
covariates.
Add the population structure projections to this matrix.
The rows should match the order of samples in files.txt
that was generated
in 1.1.
For each file f
in has_part_*_centers.gz
:
python kwas/kwas.py -i $f -o $f.kwas.csv -c covariates.h5 -x hastojson
Assuming significance threshold p
, for each KWAS output file:
postkwas -i $f -o $f.significant -p $p
postkwas -i $f -o $f.nonsignificant -p $p -n
For each file f
from the previous stage:
cut -d, -f1 $f.significant | tail -n+2 > kmers.significant.txt
cut -d, -f1 $f.nonsignificant | tail -n+2 > kmers.nonsignificant.txt
Then concatenate all the significant ones into one file, and the nonsignificant ones into another file.
For each sample, use Bowtie or Diamond to map it to a reference.
smfq -s file.sam -k kmers.significant.txt -o file.significant.json
smfq -s file.sam -k kmers.nonsignificant.txt -o file.nonsignificant.json
smfqmerge -i "file*.significant.json" -o merged.significant.json
smfqmerge -i "file*.nonsignificant.json" -o merged.nonsignificant.json
python smgqhg -s merged.significant.json -n merged.nonsignificant.json -o rnames.json
Creates a JSON object where sig
contains a list of reference names that were
enriched with significant associations,
and found
contains all the reference names that were encountered.