# Build custom databases with segmented flu genomes using kraken_flu utility
A new utilty [kraken_flu](https://gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu) was created to generate Kraken2 databases with segmented flu genomes from any set of input files (either from kraken2 or directly from NCBI).

The tool identifies influenza A H1N1, A H3N2 and B in the genome fasta file and creates new artificial taxa in the taxonomy files of the database build. Each segment of any of the viruses of interest is assign the original whole genome as the parent and will be listed as a new taxon in the Kraken2 output files.

# install the kraken_flu tool
The tool can be installed from local gitlab. Creating a venv for it.

In [27]:
python3 -m venv ~/kraken_flu

Collecting kraken_flu@ git+ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git
  Cloning ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git to /tmp/pip-install-1czikc84/kraken-flu_e04ea4d0e306496fadd281fe42d9687d
  Running command git clone --filter=blob:none --quiet 'ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git' /tmp/pip-install-1czikc84/kraken-flu_e04ea4d0e306496fadd281fe42d9687d
  Resolved ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git to commit 277e12048c3b39c50d23bcc55f7573051872efc3
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [58]:
~/kraken_flu/bin/pip install kraken_flu@git+ssh://git@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git

Collecting kraken_flu@ git+ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git
  Cloning ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git to /tmp/pip-install-0v5bblyy/kraken-flu_715417f5e65b493b9e990ac9afcfe9ca
  Running command git clone --filter=blob:none --quiet 'ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git' /tmp/pip-install-0v5bblyy/kraken-flu_715417f5e65b493b9e990ac9afcfe9ca
  Resolved ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git to commit dd586116acc4a5c991beee57f9111e83a77c7936
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: kraken_flu
  Building wheel for kraken_flu (pyproject.toml) ... [?25ldone
[?25h  Created wheel for kraken_flu: filename=kraken_flu-0.1.dev20+gdd58611-py2.py3-none-any.whl size=12411 

In [59]:
~/kraken_flu/bin/kraken_flu -v

kraken_flu 0.1.dev20+gdd58611


## Download common data
The following data will be used by several subsequent database builds. The individual databases will be using copies of the relevant files in these download direcotries so we do not have to keep re-downloading the large files involved.

### Download NCBI taxonomy data
Use the kraken2 build tool to download taxonomy files from [NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). These files will be needed for all database builds.

In [14]:
module load kraken2/2.1.2

        Module loaded. For more information run 'module help kraken2/2.1.2'.[m
        - By default, kraken2 will search for named databases in /data/pam/softw[mare/kraken2 and the current working directory. The full path to[m
        any database can also be used.[m
        - `kraken2-build` commands will sometimes experience an error suggesting[m that it is not possible to download a library using rsync. While the option --u[mse-ftp[m
        may fix this, rsync tends to be faster and more reliable. Try re-running[m the `kraken2-build` command until the download is successful. Note that `kraken[m2-build`[m
        will produce checkpoints during the installation process and will restar[mt at the last successful step.[m
        - If `kraken2-build build` freezes or hangs indefinitely, using the `--f[mast-build` option may help. An earlier version of the software (if available) ma[my also[m
        allow the build step to proceed successfully.[m
[K[?1l>


In [4]:
export BASE_DIR=/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu/

In [7]:
mkdir -p ${BASE_DIR}

In [5]:
export TAX_PATH=${BASE_DIR}/taxonomy_download/

In [6]:
export LIB_PATH_VIR=${BASE_DIR}/library_download_viral/

In [10]:
mkdir -p {$TAX_PATH}
mkdir -p ${LIB_PATH_VIR}

In [11]:
kraken2-build --download-taxonomy --db ${TAX_PATH}

Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.


In [12]:
kraken2-build \
    --download-library viral \
    --db ${LIB_PATH_VIR}

Step 1/2: Performing rsync file transfer of requested files
Rsync file transfer complete.
Step 2/2: Assigning taxonomic IDs to sequences
Processed 14972 projects (18639 sequences, 549.88 Mbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.


___Resulting files:___
These are the files created by the downloads. The file library.fna is the genome sequence file. As this was downloaded from a kraken2 pre-built repository of refseq, the FASTA headers already contain taxonomy IDs, which will be modified by the kranek_flu tool.


In [7]:
tree ${BASE_DIR}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu/
├── db1-allrefseq_wo_human
│   ├── library
│   │   └── library.fna
│   └── taxonomy
│       ├── modes.dmp
│       └── names.dmp
├── library_download_viral
│   └── library
│       └── viral
│           ├── assembly_summary.txt
│           ├── library.fna
│           ├── library.fna.masked
│           ├── manifest.txt
│           └── prelim_map.txt
└── taxonomy_download
    └── taxonomy
        ├── accmap.dlflag
        ├── citations.dmp
        ├── delnodes.dmp
        ├── division.dmp
        ├── gc.prt
        ├── gencode.dmp
        ├── images.dmp
        ├── merged.dmp
        ├── names.dmp
        ├── nodes.dmp
        ├── nucl_gb.accession2taxid
        ├── nucl_wgs.accession2taxid
        ├── readme.txt
        ├── taxdump.dlflag
        ├── taxdump.tar.gz
        └── taxdump.untarflag

8 directories, 24 files


# Use kraken_flu to create a database with all of viral refseq
The tool will convert the taxonomy and library files and write the modified copies to a new directory. The modified files will have new taxon IDs in the FASTA header for all flu A H1N1, A H3N2 and B viral genome sequences. The pattern is currently hardcoded into the tool but can be easily extended to cover other viruses.

In [60]:
# FORCE DELETE any existing direcotry
rm -rf ${BASE_DIR}/db1-allrefseq_wo_human

# run the utility
~/kraken_flu/bin/kraken_flu \
    --library_path ${LIB_PATH_VIR}/library/viral/ \
    --taxonomy_path ${TAX_PATH}/taxonomy \
    --out_dir ${BASE_DIR}/db1-allrefseq_wo_human

2023-12-18 16:52:53,797 writing modified FASTA file to /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db1-allrefseq_wo_human/library/library.fna
2023-12-18 16:52:53,812 scanning file /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//library_download_viral//library/viral/library.fna for influenza viruses
2023-12-18 16:52:57,121 done - found 32 segment sequences in FASTA file
2023-12-18 16:53:03,534 finished writing modified FASTA file
2023-12-18 16:53:03,537 writing modified prelim_map.txt file to /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db1-allrefseq_wo_human/library/prelim_map.txt
2023-12-18 16:53:03,594 finished writing modified prelim_map.txt file
2023-12-18 16:53:04,298 writing modified names file to /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db1-allrefseq_wo_human/taxonomy/names.dmp
2023-12-18 16:53:04,495 finished writing modified names file
2023-12-18 16:53:05,086 w

In [61]:
tree ${BASE_DIR}/db1-allrefseq_wo_human

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db1-allrefseq_wo_human
├── library
│   ├── library.fna
│   └── prelim_map.txt
└── taxonomy
    ├── names.dmp
    └── nodes.dmp

2 directories, 4 files


show the tail of the new names file which now contains segments as new names

In [49]:
tail ${BASE_DIR}/db1-allrefseq_wo_human/taxonomy/names.dmp

3108293	|	NC_002016.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 7, complete sequence	|		|	scientific name	|
3108294	|	NC_002020.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 8, complete sequence	|		|	scientific name	|
3108295	|	NC_026438.1 Influenza A virus (A/California/07/2009(H1N1)) segment 1 polymerase PB2 (PB2) gene, complete cds	|		|	scientific name	|
3108296	|	NC_026435.1 Influenza A virus (A/California/07/2009(H1N1)) segment 2 polymerase PB1 (PB1) gene, complete cds; and nonfunctional PB1-F2 protein (PB1-F2) gene, complete sequence	|		|	scientific name	|
3108297	|	NC_026437.1 Influenza A virus (A/California/07/2009(H1N1)) segment 3 polymerase PA (PA) gene, complete cds	|		|	scientific name	|
3108298	|	NC_026433.1 Influenza A virus (A/California/07/2009(H1N1)) segment 4 hemagglutinin (HA) gene, complete cds	|		|	scientific name	|
3108299	|	NC_026436.1 Influenza A virus (A/California/07/2009(H1N1)) segment 5 nucleocapsid protein (NP) gene, complete cds	|		

tail of the nodes file shows the new nodes having been added

In [50]:
tail ${BASE_DIR}/db1-allrefseq_wo_human/taxonomy/nodes.dmp

3108293	|	211044	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108294	|	211044	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108295	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108296	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108297	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108298	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108299	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108300	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108301	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|
3108302	|	641809	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	1	|		|		|


show the modified FASTA headers for the H1N1 sequences

In [63]:
grep H1N1 ${BASE_DIR}/db1-allrefseq_wo_human/library/library.fna

>kraken:taxid|3108287|NC_002023.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 1, complete sequence
>kraken:taxid|3108288|NC_002021.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 2, complete sequence
>kraken:taxid|3108289|NC_002022.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 3, complete sequence
>kraken:taxid|3108290|NC_002017.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 4, complete sequence
>kraken:taxid|3108291|NC_002019.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 5, complete sequence
>kraken:taxid|3108292|NC_002018.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 6, complete sequence
>kraken:taxid|3108293|NC_002016.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 7, complete sequence
>kraken:taxid|3108294|NC_002020.1 Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 8, complete sequence
>kraken:taxid|3108295|NC_026438.1 Influenza A virus (A/California/07/2009(H1N1)) segment 1 polymerase PB2 (PB2) gene, co

# Build the new database
Use kraken2 build tool to build the final new database from the modified genome and taxonomy files


In [64]:
kraken2-build \
    --build \
    --db ${BASE_DIR}/db1-allrefseq_wo_human

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.059s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 648810788 bytes
Capacity estimation complete. [20.288s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Completed processing of 18639 sequences, 549880780 bp
Writing data to disk...  complete.
Database files completed. [2m31.032s]
Database construction complete. [Total: 2m51.403s]


# Test the database
Using existing simulated reads data to test the new database

In [29]:
export TEST_BASE_DIR=/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu_tests/

In [65]:
rm -rf ${TEST_BASE_DIR}
mkdir -p ${TEST_BASE_DIR}

In [66]:
kraken2 \
    --db ${BASE_DIR}/db1-allrefseq_wo_human \
    --output ${TEST_BASE_DIR}/output.kraken \
    --paired \
    --classified-out ${TEST_BASE_DIR}/class_seqs#.fq \
    --unclassified-out ${TEST_BASE_DIR}/unclass_seqs#.fq \
    --report ${TEST_BASE_DIR}/report.txt \
/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/sim_reads/art/mix/mix_of_all_01-1.fq \
/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/sim_reads/art/mix/mix_of_all_01-2.fq


Loading database information... done.
6620 sequences (1.99 Mbp) processed in 0.366s (1085.5 Kseq/m, 325.65 Mbp/m).
  6618 sequences classified (99.97%)
  2 sequences unclassified (0.03%)


In [67]:
cat ${TEST_BASE_DIR}/report.txt

  0.03	2	2	U	0	unclassified
 99.97	6618	0	R	1	root
 99.97	6618	0	D	10239	  Viruses
 99.97	6618	0	D1	2559587	    Riboviria
 99.97	6618	0	K	2732396	      Orthornavirae
 69.91	4628	0	P	2497569	        Negarnaviricota
 39.43	2610	0	P1	2497571	          Polyploviricotina
 39.43	2610	0	C	2497577	            Insthoviricetes
 39.43	2610	0	O	2499411	              Articulavirales
 39.43	2610	0	F	11308	                Orthomyxoviridae
 25.53	1690	0	G	197911	                  Alphainfluenzavirus
 25.53	1690	0	S	2955291	                    Alphainfluenzavirus influenzae
 25.53	1690	0	S1	11320	                      Influenza A virus
 12.84	850	0	S2	119210	                        H3N2 subtype
 12.84	850	0	S3	335341	                          Influenza A virus (A/New York/392/2004(H3N2))
  2.27	150	150	S4	3108271	                            NC_007373.1 Influenza A virus (A/New York/392/2004(H3N2)) segment 1, complete sequence
  2.27	150	150	S4	3108272	                            NC_007372.1 Influenza A