Skip to content

droeatumn/kass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kass

main.nf assembles KIR haplotypes from PacBio HiFi reads.
annotate.nf annotates the structure of assembled contigs
align.nf aligns and reports on the raw or assembled sequences.

Dependencies

Install Java, Groovy, Nextflow, Docker, and Git. Create accounts in GitHub and Docker Hub. Add 'docker.enabled = true' and 'docker.fixOwnership = true' to your Nexflow configuration (e.g., $HOME/.nextflow/config). Make sure Docker is running and you are logged in to Docker Hub. Use the --nocontainer option to run without any container (natively).

Structural analysis

KPI can be used to determine the presence/absence of genes and haplotype pairs from the raw data: https://github.com/droeatumn/kpi

Assembly

Input The input is a directory containing one or more compressed fastq files, each representing one individual. Each file is from PacBio HiFi consensus sequences (preferably 99.9%).

Output
Each input file has a correspoding output file (*.contigs.fasta) with the assembled contigs. Each contig is annotated with gene content and order in the file with the suffix 'annoation.txt'.

Running Use the parameter 'raw' to indicate the input directory, and 'output' to indicate the directory to put the output. The defaults are 'raw' and 'output' under the location where kass was pulled. Optionally use --threads to optionally set maximum number of threads to use (default 8). To output the off-kir reads, use --off.

./main.nf --raw inDir --output outDir
e.g., ./main.nf --raw ~/input --output ~/output

The image contains an example: simulated reads from a single cA01˜tA01 haplotype (GenBank accession KP420442).
./main.nf --raw ~/git/kass/input/example1 --output outDir

Annotation

Input
The input is a folder containing fasta files (usually contigs) to be annotated. Each file may contain more than one sequence. The files cannot be gzipped.

Output
For each gene, feature tables (ft.txt) and a features sequences (features.fasta) are created to annotate the alleles. The feature table contains the gene and intron/exon locations for each contig in GenBank update format. The features.fasta files extract the alleles from the contigs and group them into their genes.

For more information, an annotation file (annotation.txt) will be created to annotate the gene locations in each contig. Ideally the annotation only contains gene names, as opposed to markup. Markup means that part of the assembly didn't match the markup for any full gene. The markup.txt files have more information about the locations of the probes in the contigs and the key to converting the probe pairs into motif characters. If the annotation.txt files have motif markup (characters) instead of gene names, then either the assembly is wrong in that area or the regex expressions that map the motifs to the genes needs to be altered. This is defined in input/features.txt.

Running
Use the 'raw' parameter to indicate the input directory, and 'output' to indicate the directory to put the output. Use 'refFasta' to indicate the name of the reference fasta file that is located in the input directory. Use 'threadNum' to optionally set maximum number of threads to use (default 8).

annotate.nf --raw inDir --output outDir --threads threadNum
e.g., annotate.nf --raw ~/input --output ~/output --threads 12

Alignment

Input
The input is a directory containing a reference sequence in a fasta file along with one or more fasta/fastq files to be aligned to that reference.

Output
Index files are output for the reference fasta.
For each non-reference input file, a sorted bam file, its index, and the unaligned reads are output. Also, Qualimap (qualimap.pdf), NanoPlot (NanoPlot-report.html), and Quast (quast/report.html) reports are generated for the alignment and a FastQC report (fastqc.html) is generated if the input is a fastq file. Also for fastq files, add --bwa="-xpacbio" to the command.
Use caution interpreting these reports. For example, Quast doesn't just report on the alignment. It reports misassembles that are really patterns of homology. This leads to erroneously high rates of mismatches. Use qualimap for the error rate.

Running
Use the 'raw' parameter to indicate the input directory, and 'output' to indicate the directory to put the output. Use 'refFasta' to indicate the name of the reference fasta file that is located in the input directory. Use 'threadNum' to optionally set maximum number of threads to use (default 8).

align.nf --raw inDir --reference refFasta --output outDir --threads threadNum
e.g., align.nf --raw ~/input --reference KP420442.fasta --output ~/output --threads 12

Bundled references

Some references and their indexes are bundled in input/references/. KP420439 and KP420442 are cA01˜tA01. KP420440 is cB01˜tB01. They each have a bed file that documents the locations of the genes.

Citation

Roe D. Efficient Sequencing, Assembly, and Annotation of Human KIR Haplotypes. Frontiers in Immunology (2020) 11:11 (https://doi.org/10.3389/fimmu.2020.582927)

Miscellaneous

Hardware
Minimum recommended hardware for targeted sequencing is 30G memory and 8 CPU cores. More of each helps, especially with WGS. Run time is 1-2 hours per ID, depending on platform, genotype variation, and parallel execution.

Allele annotation
Full-gene alleles are annotated with respect to the IPD database version 2.10.0 (http://www.ebi.ac.uk/ipd/kir).
Robinson J, Halliwell JA, Hayhurst JH, Flicek P, Parham P, Marsh SGE: The IPD and IPD-IMGT/HLA Database: allele variant databases Nucleic Acids Research (2015), 43:D423-431
Robinson J, Malik A, Parham P, Bodmer JG, Marsh SGE: IMGT/HLA - a sequence database for the human major histocompatibility complex Tissue Antigens (2000), 55:280-287