# Create a genomics data index from sequences

First let's print the commit of the GDI software we are using

In [1]:
gdi_root = '../../dependencies/genomics-data-index/'
!git -C {gdi_root} rev-parse HEAD

bc2f81ea620e1319fd3a8b8d992d7965a7720d54


## Create project directory

In [2]:
gdi_project_dir = 'data/sars-cov-2-genbank-index'
!gdi init {gdi_project_dir}

Initializing empty project in [data/sars-cov-2-genbank-index]


## Create file listing all genome names

Create a file listing all genome names to index. You don't have to do this (you can pass do `data/fasta/*.fasta`) but if the number of genomes becomes too much for the shell to handle on the command-line then passing as a list of files is the only option.

In [3]:
from pathlib import Path

fasta_dir = Path('data/fasta')
gdi_fofn = Path('data/gdi.fofn')

with open(gdi_fofn, 'w') as fh:
    for file in fasta_dir.iterdir():
        if file.name.endswith('.fasta'):
            fh.write(f'{file}\n')
            
print(f'Wrote list of files to: {gdi_fofn}')

Wrote list of files to: data/gdi.fofn


## Create an index

The below creates a genomics index by aligning the assembeled genomes, identifying mutations, and loading into the project directory.

In [4]:
reference_file = 'MN996528.1-ref.fasta'
fasta_dir = 'data/fasta'

!/usr/bin/time -v gdi --ncores 32 --project-dir {gdi_project_dir} analysis assembly \
    --reference-file {reference_file} --assembly-input-file {gdi_fofn}

[32m2021-05-06 14:30:01[0m [1;30mINFO:[0m Processing 395 genomes to identify mutations
[32m2021-05-06 14:30:26[0m [1;30mINFO:[0m Indexing processed VCF files defined in [/home/CSCScience.ca/apetkau/workspace/gdi-evaluation/data/sars-cov-2-genbank/snakemake-assemblies.1620329401.778609/gdi-input.fofn]
Loading files listed in /home/CSCScience.ca/apetkau/workspace/gdi-evaluation/data/sars-cov-2-genbank/snakemake-assemblies.1620329401.778609/gdi-input.fofn
[32m2021-05-06 14:30:27[0m [1;30mINFO:[0m Proccessed 0% (0/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 2% (7/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 4% (14/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 5% (21/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 7% (28/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 9% (35/395) samples
[32m2021-05-06 14:30:30[0m [1;30mINFO:[0m Proccessed 11% (42/395) sampl

## Build phylogenetic tree

Let's build the phylogenetic tree separately.

In [7]:
# I set model here to make things run faster
# Appropriate model was one of the better-scoring ones determined previously on a different set of SARS-CoV-2 genomes
# I probably need to revisit this to make sure the model is appropriate for this data
tree_extra_params = '--fast -m GTR+F+R4'

# For some reason having more than one {variable} in a command-line string prevents variable substitution
# so I have to hard-code all of the paths/variables
!/usr/bin/time -v gdi --ncores 52 --project-dir data/sars-cov-2-genbank-index rebuild tree \
    --align-type full --extra-params '--fast -m GTR+F+R4' MN996528.1-ref

[32m2021-05-06 14:30:50[0m [1;30mINFO:[0m Started rebuilding tree for reference genome [MN996528.1-ref]
[32m2021-05-06 14:32:17[0m [1;30mINFO:[0m Finished rebuilding tree
	Command being timed: "gdi --ncores 52 --project-dir data/sars-cov-2-genbank-index rebuild tree --align-type full --extra-params --fast -m GTR+F+R4 MN996528.1-ref"
	User time (seconds): 3304.32
	System time (seconds): 341.99
	Percent of CPU this job got: 4079%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:29.39
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 182880
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1663925
	Voluntary context switches: 193585
	Involuntary context switches: 731423
	Swaps: 0
	File system inputs: 8
	File system outputs: 76560
	Socket messages sent: 0
	Socket messages received: 0
	Sign

Awesome. Now we're reading to search through our index and generate some visualizations.