# Build custom databases with segment flu genomes using kraken_flu utility
A new utilty [kraken_flu](https://gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu) was created to generate Kraken2 databases with segmented flu genomes from any set of input files (either from kraken2 or directly from NCBI).

The tool peforms two tasks:
1. Fiter influenza genomes to keep only those that have 8 full-length segments
2. Create a reorganisee taxonomy with new taxa for segments of the influenza A viruses

This notebook builds a complete database with NCBI RefSeq plus the NCBI Influenza resource.

## directory paths and names

In [1]:
export DB_NAME=refseq_ncbiFlu_kfv2_102124
export BASE_DIR=/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu/

# taxonomy data is universal and ideally reused for every DB we build, so the direcotry is not specific to this DB
export TAX_PATH=${BASE_DIR}/downloads/taxonomy_download/

# the contents of this folder were copied from refseq_ncbiFlu_kfv2_091124
export LIB_PATH=${BASE_DIR}/downloads/sequence_download/refseq_ncbiFlu_kfv2/

export DB_PREP_DIR=${BASE_DIR}/db_prep/${DB_NAME}/
export DB_PATH=${BASE_DIR}/databases/${DB_NAME}

In [46]:
mkdir -p ${BASE_DIR}

rm -rf ${DB_PREP_DIR}
mkdir -p ${DB_PREP_DIR}
mkdir -p ${DB_PATH}

In [47]:
echo "BASE DIR: $BASE_DIR"
echo "TAXONOMY PATH: ${TAX_PATH}"
echo "LIBRARY (FASTA) PATH: ${LIB_PATH}"
echo "DB PREP DIR: ${DB_PREP_DIR}"
echo "FINAL DB PATH: ${DB_PATH}"

BASE DIR: /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu/
TAXONOMY PATH: /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//downloads/taxonomy_download/
LIBRARY (FASTA) PATH: /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//downloads/sequence_download/refseq_ncbiFlu_kfv2/
DB PREP DIR: /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db_prep/refseq_ncbiFlu_kfv2_102124/
FINAL DB PATH: /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//databases/refseq_ncbiFlu_kfv2_102124


## install the kraken_flu tool
The tool can be installed from local gitlab. Creating a venv for it.

In [4]:
python3 -m venv ~/kraken_flu

In [48]:
source ~/kraken_flu/bin/activate

In [5]:
pip install --upgrade pip



___This version of kraken-flu is not yet in main branch, checking out latest commit in dev branch___

In [49]:
COMMIT_SHA=90f58e2866c89056a36cce71b0c7563dbdcb3461

In [50]:
pip install kraken_flu@git+ssh://git@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git@$COMMIT_SHA

Collecting kraken_flu@ git+ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git@90f58e2866c89056a36cce71b0c7563dbdcb3461
  Cloning ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git (to revision 90f58e2866c89056a36cce71b0c7563dbdcb3461) to /tmp/pip-install-jnprbf80/kraken-flu_e60e79b9659b49278d087e5daea4bd80
  Running command git clone --filter=blob:none --quiet 'ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git' /tmp/pip-install-jnprbf80/kraken-flu_e60e79b9659b49278d087e5daea4bd80
  Running command git rev-parse -q --verify 'sha^90f58e2866c89056a36cce71b0c7563dbdcb3461'
  Running command git fetch -q 'ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git' 90f58e2866c89056a36cce71b0c7563dbdcb3461
  Running command git checkout -q 90f58e2866c89056a36cce71b0c7563dbdcb3461
  Resolved ssh://****@gitlab.internal.sanger.ac.uk/malariagen1/misc_utils/kraken_flu.git to commit 90f58e2866c89056

In [51]:
kraken_flu -v

kraken_flu 2.0.1.dev28+g90f58e2


## Download data
Get viral taxonomy and sequence data from NCBI

In [2]:
module load kraken2/2.1.2

	Module loaded. For more information run 'module help kraken2/2.1.2'.
	- By default, kraken2 will search for named databases in /data/pam/software/kraken2 and the current working directory. The full path to
	any database can also be used.
	- `kraken2-build` commands will sometimes experience an error suggesting that it is not possible to download a library using rsync. While the option --use-ftp
	may fix this, rsync tends to be faster and more reliable. Try re-running the `kraken2-build` command until the download is successful. Note that `kraken2-build`
	will produce checkpoints during the installation process and will restart at the last successful step.
	- If `kraken2-build build` freezes or hangs indefinitely, using the `--fast-build` option may help. An earlier version of the software (if available) may also
	allow the build step to proceed successfully.


### Taxonomy data
Use the kraken2 build tool to download taxonomy files from [NCBI](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).

___This only needs to be once, hence commented out here. Uncomment to re-run___

In [11]:
# kraken2-build --download-taxonomy --db ${TAX_PATH}

In [17]:
tree ${TAX_PATH}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//downloads/taxonomy_download/
└── taxonomy
    ├── accmap.dlflag
    ├── citations.dmp
    ├── delnodes.dmp
    ├── division.dmp
    ├── gc.prt
    ├── gencode.dmp
    ├── images.dmp
    ├── merged.dmp
    ├── names.dmp
    ├── nodes.dmp
    ├── nucl_gb.accession2taxid
    ├── nucl_wgs.accession2taxid
    ├── readme.txt
    ├── taxdump.dlflag
    ├── taxdump.tar.gz
    └── taxdump.untarflag

1 directory, 16 files


### Sequence data
Directly download from NCBI RefSeq release FTP website (file from 15/01/24

NCBI viral RefSeq (directly from NCBI now, not using the kraken2 pre-built)

___This step is not required to run every time and it takes a long time, hence commented out___

In [18]:
cd ${LIB_PATH}
#wget https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
#gunzip viral.1.1.genomic.fna.gz

NCBI Infuenza FTP  
___NOTE___ that the resource is no longer kept updated (since Oct 2020) so this is used as a starting point but we still need to add later influenza genomes using the new NCBI API for

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/

Here is a post showing how to programmatically interact with the new NCBI Virus site to obtain sequences:  
https://www.biostars.org/p/9562294/   

___This step is not required to run every time and it takes a long time, hence commented out___

In [14]:
cd ${LIB_PATH}
# wget https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna

In [20]:
tree ${LIB_PATH}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//downloads/sequence_download/refseq_ncbiFlu_kfv2/
├── influenza.fna
├── sequences_RSV_A.fasta
├── sequences_RSV_B.fasta
└── viral.1.1.genomic.fna

0 directories, 4 files


___RSV sequences___ were downloaded from https://nextstrain.org/rsv/a/genome/6y?d=tree,entropy&p=full

## Run the kraken-flu tool
The tool creates a new directory of taxonomy and sequence files.  

Using an exception to the "complete flu genomes" filter for the avian flu reference because we want this one in the DB but it does not have all 8 segments in RefSeq.

___NOTE___ The current implementation of the kraken-flu tool reads sequence and taxonomy files into RAM and therefore needs a significant amount of RAM. The process got killed on a 5GB node but work fine on 10GB.


In [53]:
rm ${DB_PREP_DIR}/kfdb.sqlite
~/kraken_flu/bin/kraken_flu \
    --taxonomy_path  ${TAX_PATH}/taxonomy \
    --fasta_path ${LIB_PATH}/influenza.fna ${LIB_PATH}/viral.1.1.genomic.fna \
    --out_dir ${DB_PREP_DIR} \
    --db_file ${DB_PREP_DIR}/kfdb.sqlite \
    --keep_db_file \
    --filter_flu \
    --filter_except "A/Goose/Guangdong/1/96(H5N1)" \
    --do_full_linkage \
    --rsv_a_sequences ${LIB_PATH}/sequences_RSV_A.fasta \
    --rsv_b_sequences ${LIB_PATH}/sequences_RSV_B.fasta \
    --rsv_size_filter \
    > ${DB_PREP_DIR}/log 2>&1

rm: cannot remove '/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db_prep/refseq_ncbiFlu_kfv2_102124//kfdb.sqlite': No such file or directory


In [3]:
tree ${DB_PREP_DIR}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db_prep/refseq_ncbiFlu_kfv2_102124/
├── kfdb.sqlite
├── library
│   └── library.fna
├── log
└── taxonomy
    ├── names.dmp
    └── nodes.dmp

2 directories, 5 files


## Prepare new kraken2 DB directory
Create the direcotry and copy the kraken-flu results into it, then use the kraken-build tool to add the library to the new DB

In [4]:
rm -rf ${DB_PATH}
mkdir -p ${DB_PATH}

Copy the taxonomy created by kraken-flu in the new dir

In [5]:
cp -r ${DB_PREP_DIR}/taxonomy ${DB_PATH}

also need the large NCBIU accession to tax ID file here for kraken-build

In [6]:
ln -s ${TAX_PATH}/taxonomy/nucl_gb.accession2taxid ${DB_PATH}/taxonomy

In [7]:
tree ${DB_PATH}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//databases/refseq_ncbiFlu_kfv2_102124
└── taxonomy
    ├── names.dmp
    ├── nodes.dmp
    └── nucl_gb.accession2taxid -> /lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//downloads/taxonomy_download//taxonomy/nucl_gb.accession2taxid

1 directory, 3 files


use kraken-tool to add the library file to the new DB

In [8]:
mkdir ${DB_PATH}/library

Run the new library through kraken2-build "add to library" process.
This is still required because this is the step that does the masking, which is important for the "build" step of kraken2-build (next)

In [9]:
kraken2-build \
    --add-to-library ${DB_PREP_DIR}/library/library.fna \
    --db ${DB_PATH}

Masking low-complexity regions of new file... done.
Added "/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//db_prep/refseq_ncbiFlu_kfv2_102124//library/library.fna" to library (/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//databases/refseq_ncbiFlu_kfv2_102124)


## Create the kraken2 DB

In [10]:
kraken2-build \
    --build \
    --db ${DB_PATH}

Creating sequence ID to taxonomy ID map (step 1)...
Found 0/4 targets, searched through 327566049 accession IDs, search complete.
lookup_accession_numbers: 4/4 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [25.820s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 669314192 bytes
Capacity estimation complete. [34.361s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 19 bits reserved for taxid.
Completed processing of 440158 sequences, 1255827612 bp
Writing data to disk...  complete.
Database files completed. [4m11.163s]
Database construction complete. [Total: 5m11.452s]


In [11]:
echo ${DB_PATH}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu//databases/refseq_ncbiFlu_kfv2_102124


## Test with simulated data
Run with a set of simulated reads for all the current viruses of interest.


In [12]:
export TEST_BASE_DIR=/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu_test_kfv2_102124

In [13]:
mkdir -p ${TEST_BASE_DIR}

In [14]:
echo ${TEST_BASE_DIR}

/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/krakenDBs/kraken_flu_test_kfv2_102124


In [15]:
kraken2 \
    --db ${DB_PATH} \
    --output ${TEST_BASE_DIR}/output.kraken \
    --paired \
    --classified-out ${TEST_BASE_DIR}/class_seqs#.fq \
    --unclassified-out ${TEST_BASE_DIR}/unclass_seqs#.fq \
    --report ${TEST_BASE_DIR}/report.txt \
/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/sim_reads/art/mix/mix_of_all_01-1.fq \
/lustre/scratch126/gsu/team112/personal/fs5/rvi_dev/sim_reads/art/mix/mix_of_all_01-2.fq

Loading database information... done.
6620 sequences (1.99 Mbp) processed in 0.428s (928.4 Kseq/m, 278.52 Mbp/m).
  4653 sequences classified (70.29%)
  1967 sequences unclassified (29.71%)


In [16]:
cat ${TEST_BASE_DIR}/report.txt 

 29.71	1967	1967	U	0	unclassified
 70.29	4653	0	R	1	root
 70.29	4653	0	D	10239	  Viruses
 69.92	4629	0	D1	2559587	    Riboviria
 69.92	4629	0	K	2732396	      Orthornavirae
 39.83	2637	0	P	2497569	        Negarnaviricota
 39.43	2610	0	P1	2497571	          Polyploviricotina
 39.43	2610	0	C	2497577	            Insthoviricetes
 39.43	2610	0	O	2499411	              Articulavirales
 39.43	2610	0	F	11308	                Orthomyxoviridae
 25.53	1690	0	G	197911	                  Alphainfluenzavirus
 25.53	1690	0	S	2955291	                    Alphainfluenzavirus influenzae
 25.53	1690	36	S1	11320	                      Influenza A virus
  4.53	300	268	S2	3121628	                        Influenza A segment 2
  0.02	1	1	S3	3420097	                          A/Pink-eared Duck/Victoria/PD001/2017(H3N1) segment 2
  0.02	1	1	S3	3228002	                          A/India/Nag132467/2013(H1N1) segment 2
  0.02	1	1	S3	3127340	                          A/red knot/NJ/325/1989(H7N7) segment 2
  0.02	1	1	S3	3123