KWAS

K-mer-wide association study.

Required software

Python 3.6+ with installed libraries:
- sklearn
- matplotlib
- logomaker
- setproctitle
KMC 3
Bowtie 2
Diamond

Usage

This demonstration uses bash syntax, but the pipeline can run on any platform. Compiled binaries are available in releases.

1. K-mer extraction

1.1. Create a file list

Create a text file containing the paths to the input fastq files, one per line. For this demonstration we will assume it is named files.txt.

1.2. Count k-mers

Assuming there are n extraction jobs and this is job i:

count -f files.txt -p $i -np $n -o counts_part_$i.gz

1.3. Merge counts

merge -t count -i "counts_part_*.gz" -o counts_all.gz

1.4. Filter counts

If m is the minimal sample count for testing a k-mer:

filter -i counts_all.gz -o counts_filtered.gz -n $m

1.5. Extract k-mer presence (HAS files) for abundant k-mers

Assuming there are n extraction jobs and this is job i:

has -f files.txt -i counts_filtered.gz -p $i -np $n -o has_part_$i.gz

1.6. Merge k-mer presence

merge -t has -i "has_part_*.gz" -o has_all.gz

1.7. Split by minimizer

Using minimizers of length z (the paper uses z=9):

split -i has_all.gz -o "has_part_*.gz" -k $z

1.8. Cluster minimizers

mnzfiles -i "has_part_*.gz" -o mnz_files.txt
# For each file f in mnz_files.txt:
mnzgraph -i $f -o "$(basename $f .gz)_centers.gz" -t $num_threads

2. Population structure

2.1. Subsample k-mers and samples

Using 1/n k-mers and 1/s samples for population structure:

smpkmers -r $n -i "has_part_*_centers.gz" -o has_popstr_tmp.gz
smpkmers -s $s -i has_popstr_tmp.gz -o has_popstr.gz
hastojson -i has_popstr.gz -o has_popstr.json

2.2. Create projection "matrix"

Edit popstr/popstr.py constants with has_popstr.json as the input. Run it to create the projection information.

2.3. Extract population structure covariates

Assuming the result of the previous stage is components.json:

# For each file in files.txt:
projectpopstr -c components.json -i file_123.fq -o file_123.popstr.json

3. KWAS

3.1. Create covariates table

Create a pandas .h5 file where the rows are samples and columns are covariates. Add the population structure projections to this matrix. The rows should match the order of samples in files.txt that was generated in 1.1.

3.2. Run KWAS

For each file f in has_part_*_centers.gz:

python kwas/kwas.py -i $f -o $f.kwas.csv -c covariates.h5 -x hastojson

3.3. Collect significant associations

Assuming significance threshold p, for each KWAS output file:

postkwas -i $f -o $f.significant -p $p
postkwas -i $f -o $f.nonsignificant -p $p -n

3.4. Extract lists of significant and nonsignificant k-mers

For each file f from the previous stage:

cut -d, -f1 $f.significant | tail -n+2 > kmers.significant.txt
cut -d, -f1 $f.nonsignificant | tail -n+2 > kmers.nonsignificant.txt

Then concatenate all the significant ones into one file, and the nonsignificant ones into another file.

4. Enrichment analysis

4.1. Map to a reference

For each sample, use Bowtie or Diamond to map it to a reference.

4.2. Extract reference counts

smfq -s file.sam -k kmers.significant.txt -o file.significant.json
smfq -s file.sam -k kmers.nonsignificant.txt -o file.nonsignificant.json

4.3. Merge reference counts

smfqmerge -i "file*.significant.json" -o merged.significant.json
smfqmerge -i "file*.nonsignificant.json" -o merged.nonsignificant.json

4.4. Run hypergeometric tests

python smgqhg -s merged.significant.json -n merged.nonsignificant.json -o rnames.json

Creates a JSON object where sig contains a list of reference names that were enriched with significant associations, and found contains all the reference names that were encountered.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
count		count
dump		dump
enrichment		enrichment
filter		filter
fisher		fisher
gmerge		gmerge
gofisher		gofisher
graphs		graphs
has		has
hasloader		hasloader
hasmat		hasmat
hastojson		hastojson
iterx		iterx
kmc		kmc
kmr		kmr
kwas		kwas
lazy		lazy
merge		merge
mnzfiles		mnzfiles
mnzgraph		mnzgraph
popstr		popstr
postkwas		postkwas
prfjson		prfjson
prflogo		prflogo
prfsort		prfsort
profile		profile
progress		progress
projectpopstr		projectpopstr
smfq		smfq
smfqhg		smfqhg
smfqmerge		smfqmerge
smpkmers		smpkmers
split		split
util		util
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KWAS

Required software

Usage

1. K-mer extraction

1.1. Create a file list

1.2. Count k-mers

1.3. Merge counts

1.4. Filter counts

1.5. Extract k-mer presence (HAS files) for abundant k-mers

1.6. Merge k-mer presence

1.7. Split by minimizer

1.8. Cluster minimizers

2. Population structure

2.1. Subsample k-mers and samples

2.2. Create projection "matrix"

2.3. Extract population structure covariates

3. KWAS

3.1. Create covariates table

3.2. Run KWAS

3.3. Collect significant associations

3.4. Extract lists of significant and nonsignificant k-mers

4. Enrichment analysis

4.1. Map to a reference

4.2. Extract reference counts

4.3. Merge reference counts

4.4. Run hypergeometric tests

About

Releases 2

Languages

fluhus/kwas

Folders and files

Latest commit

History

Repository files navigation

KWAS

Required software

Usage

1. K-mer extraction

1.1. Create a file list

1.2. Count k-mers

1.3. Merge counts

1.4. Filter counts

1.5. Extract k-mer presence (HAS files) for abundant k-mers

1.6. Merge k-mer presence

1.7. Split by minimizer

1.8. Cluster minimizers

2. Population structure

2.1. Subsample k-mers and samples

2.2. Create projection "matrix"

2.3. Extract population structure covariates

3. KWAS

3.1. Create covariates table

3.2. Run KWAS

3.3. Collect significant associations

3.4. Extract lists of significant and nonsignificant k-mers

4. Enrichment analysis

4.1. Map to a reference

4.2. Extract reference counts

4.3. Merge reference counts

4.4. Run hypergeometric tests

About

Resources

Stars

Watchers

Forks

Releases 2

Languages