main.nf assembles KIR haplotypes from PacBio HiFi reads.
annotate.nf annotates the structure of assembled contigs
align.nf aligns and reports on the raw or assembled sequences.
Output
Each input file has a correspoding output file (*.contigs.fasta) with the assembled contigs. Each contig is annotated with gene content and order in the file with the suffix 'annoation.txt'.
Running Use the parameter 'raw' to indicate the input directory, and 'output' to indicate the directory to put the output. The defaults are 'raw' and 'output' under the location where kass was pulled. Optionally use --threads to optionally set maximum number of threads to use (default 8). To output the off-kir reads, use --off.
./main.nf --raw inDir --output outDir
e.g.,
./main.nf --raw ~/input --output ~/output
The image contains an example: simulated reads from a single cA01˜tA01 haplotype (GenBank accession KP420442).
./main.nf --raw ~/git/kass/input/example1 --output outDir
The input is a folder containing fasta files (usually contigs) to be annotated. Each file may contain more than one sequence. The files cannot be gzipped.
Output
For each gene, feature tables (ft.txt) and a features sequences (features.fasta) are created to annotate the alleles. The feature table contains the gene and intron/exon locations for each contig in GenBank update format. The features.fasta files extract the alleles from the contigs and group them into their genes.
For more information, an annotation file (annotation.txt) will be created to annotate the gene locations in each contig. Ideally the annotation only contains gene names, as opposed to markup. Markup means that part of the assembly didn't match the markup for any full gene. The markup.txt files have more information about the locations of the probes in the contigs and the key to converting the probe pairs into motif characters. If the annotation.txt files have motif markup (characters) instead of gene names, then either the assembly is wrong in that area or the regex expressions that map the motifs to the genes needs to be altered. This is defined in input/features.txt.
Running
Use the 'raw' parameter to indicate the input directory, and 'output' to indicate the directory to put the output. Use 'refFasta' to indicate the name of the reference fasta file that is located in the input directory. Use 'threadNum' to optionally set maximum number of threads to use (default 8).
annotate.nf --raw inDir --output outDir --threads threadNum
e.g.,
annotate.nf --raw ~/input --output ~/output --threads 12
The input is a directory containing a reference sequence in a fasta file along with one or more fasta/fastq files to be aligned to that reference.
Output
Index files are output for the reference fasta.
For each non-reference input file, a sorted bam file, its index, and the unaligned reads are output. Also, Qualimap (qualimap.pdf), NanoPlot (NanoPlot-report.html), and Quast (quast/report.html) reports are generated for the alignment and a FastQC report (fastqc.html) is generated if the input is a fastq file. Also for fastq files, add
--bwa="-xpacbio"
to the command.Use caution interpreting these reports. For example, Quast doesn't just report on the alignment. It reports misassembles that are really patterns of homology. This leads to erroneously high rates of mismatches. Use qualimap for the error rate.
Running
Use the 'raw' parameter to indicate the input directory, and 'output' to indicate the directory to put the output. Use 'refFasta' to indicate the name of the reference fasta file that is located in the input directory. Use 'threadNum' to optionally set maximum number of threads to use (default 8).
align.nf --raw inDir --reference refFasta --output outDir --threads threadNum
e.g.,
align.nf --raw ~/input --reference KP420442.fasta --output ~/output --threads 12
Hardware
Minimum recommended hardware for targeted sequencing is 30G memory and 8 CPU cores. More of each helps, especially with WGS. Run time is 1-2 hours per ID, depending on platform, genotype variation, and parallel execution.
Allele annotation
Full-gene alleles are annotated with respect to the IPD database version 2.10.0 (http://www.ebi.ac.uk/ipd/kir).
Robinson J, Halliwell JA, Hayhurst JH, Flicek P, Parham P, Marsh SGE: The IPD and IPD-IMGT/HLA Database: allele variant databases Nucleic Acids Research (2015), 43:D423-431
Robinson J, Malik A, Parham P, Bodmer JG, Marsh SGE: IMGT/HLA - a sequence database for the human major histocompatibility complex Tissue Antigens (2000), 55:280-287