# 1. Download SRA data

## 1.1. Read SRA data

The `SraRunTable.txt` was downloaded from all **SRA Experiements** in NCBI for [BioProject PRJNA305824](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA305824).

In [32]:
import pandas as pd

sra_df = pd.read_csv('SraRunTable.txt.gz')
print(f'Table rows: {len(sra_df)}')
sra_df.head(3)

Table rows: 65


Unnamed: 0,Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,Bytes,Center Name,collected_by,...,Phagetype,Platform,ReleaseDate,Sample Name,Serovar,SRA Study,STRAIN,sub_species,Host_disease,Host
0,SRR3028736,WGS,419,621431550,PRJNA305824,SAMN04334627,Pathogen.cl,367138294,MCGILL UNIVERSITY,hospital,...,19,ILLUMINA,2015-12-19T00:00:00Z,SH12-001,Heidelberg,SRP067504,SH12-001,enterica,Salmonella gastroenteritis,Homo sapiens
1,SRR3028737,WGS,426,433789482,PRJNA305824,SAMN04334628,Pathogen.cl,251988185,MCGILL UNIVERSITY,hospital,...,19,ILLUMINA,2015-12-19T00:00:00Z,SH12-002,Heidelberg,SRP067504,SH12-002,enterica,Salmonella gastroenteritis,Homo sapiens
2,SRR3028738,WGS,432,407605751,PRJNA305824,SAMN04334629,Pathogen.cl,238018601,MCGILL UNIVERSITY,hospital,...,19,ILLUMINA,2015-12-19T00:00:00Z,SH12-003,Heidelberg,SRP067504,SH12-003,enterica,Salmonella gastroenteritis,Homo sapiens


## 1.2. Reduce to single SRA run per genomic sample

Some samples have multiple SRA runs. We pick the largest run in this case and reduce the table down to 59 samples/SRA runs.

In [3]:
df = sra_df.sort_values('Bases', ascending=False).groupby('Sample Name').agg(
    {'Sample Name': 'count', 'Run': 'first'}, axis='columns')
df[df['Sample Name'] > 1]

Unnamed: 0_level_0,Sample Name,Run
Sample Name,Unnamed: 1_level_1,Unnamed: 2_level_1
SH12-003,2,SRR3684173
SH12-007,2,SRR3684194
SH13-004,2,SRR3711286
SH13-006,2,SRR3711252
SH14-009,2,SRR3711296
SH14-028,2,SRR3712208


In [4]:
sra_df_reduced = sra_df.set_index('Run').loc[df['Run'].tolist()].reset_index().set_index('Sample Name')
print(f'Table rows: {len(sra_df_reduced)}')
sra_df_reduced.head(3)

Table rows: 59


Unnamed: 0_level_0,Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,Bytes,Center Name,collected_by,...,PFGE_SecondaryEnzyme_pattern,Phagetype,Platform,ReleaseDate,Serovar,SRA Study,STRAIN,sub_species,Host_disease,Host
Sample Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SH08-001,SRR3028792,WGS,429,354123684,PRJNA305824,SAMN04334683,Pathogen.cl,197484364,MCGILL UNIVERSITY,hospital,...,SHBNI.0001,19,ILLUMINA,2015-12-19T00:00:00Z,Heidelberg,SRP067504,SH08-001,enterica,Salmonella gastroenteritis,Homo sapiens
SH09-29,SRR3028793,WGS,422,519366460,PRJNA305824,SAMN04334684,Pathogen.cl,288691068,MCGILL UNIVERSITY,hospital,...,SHBNI.0001,26,ILLUMINA,2015-12-19T00:00:00Z,Heidelberg,SRP067504,SH09-29,enterica,Salmonella gastroenteritis,Homo sapiens
SH10-001,SRR3028783,WGS,421,387145160,PRJNA305824,SAMN04334674,Pathogen.cl,233911529,MCGILL UNIVERSITY,hospital,...,SHBNI.0001,19,ILLUMINA,2015-12-19T00:00:00Z,Heidelberg,SRP067504,SH10-001,enterica,Salmonella gastroenteritis,Homo sapiens


## 1.3. Save as metadata

Save the table as `metadata.tsv`.

In [5]:
sra_df_reduced.to_csv('metadata.tsv', sep='\t', index=True)

In [33]:
!gzip metadata.tsv

# 2. Download SRA data

To download sra data, first we create the download command.

In [24]:
command_df = sra_df_reduced.reset_index()[['Sample Name', 'Run']]
command_df['command'] = command_df.apply(lambda x: f'fasterq-dump --threads 4 -O fastq -o {x["Sample Name"]} --split-files {x["Run"]}', axis='columns')
command_df.head(3)

Unnamed: 0,Sample Name,Run,command
0,SH08-001,SRR3028792,fasterq-dump --threads 4 -O fastq -o SH08-001 ...
1,SH09-29,SRR3028793,fasterq-dump --threads 4 -O fastq -o SH09-29 -...
2,SH10-001,SRR3028783,fasterq-dump --threads 4 -O fastq -o SH10-001 ...


Now write download command to a file.

In [29]:
command_df['command'].to_csv('commands.txt', index=False, header=False)

Now run these commands with parallel to schedule them all, downloading a few files at once

In [31]:
!conda run --name sra-tools parallel -j 2 -a commands.txt

spots read      : 824,262
reads read      : 1,648,524
reads written   : 1,648,524
spots read      : 1,230,549
reads read      : 2,461,098
reads written   : 2,461,098
spots read      : 919,490
reads read      : 1,838,980
reads written   : 1,838,980
spots read      : 797,406
reads read      : 1,594,812
reads written   : 1,594,812
spots read      : 963,726
reads read      : 1,927,452
reads written   : 1,927,452
spots read      : 1,390,852
reads read      : 2,781,704
reads written   : 2,781,704
spots read      : 1,505,500
reads read      : 3,011,000
reads written   : 3,011,000
spots read      : 799,810
reads read      : 1,599,620
reads written   : 1,599,620
spots read      : 1,482,963
reads read      : 2,965,926
reads written   : 2,965,926
spots read      : 1,016,422
reads read      : 2,032,844
reads written   : 2,032,844
spots read      : 1,300,892
reads read      : 2,601,784
reads written   : 2,601,784
spots read      : 2,115,600
reads read      : 4,231,200
reads written   : 4,231,200
sp

Now gzip all files

In [37]:
import glob

fastq_files = glob.glob('fastq/*.fastq')
!parallel -j 32 gzip ::: {' '.join(fastq_files)}

# 3. Create index

In [38]:
!gdi --version

gdi, version 0.2.0.dev0


In [39]:
!gdi init salmonella-project

Initializing empty project in [salmonella-project]


## 3.1. Create index of variants and kmers

In [40]:
!gdi --project-dir salmonella-project --ncores 48 analysis \
    --reference-file reference/NC_011083.gbk.gz --use-conda \
    --kmer-size 31 --kmer-size 51 --kmer-size 71 --kmer-scaled 100 fastq/*.fastq.gz

[32m2021-08-06 14:08:10[0m [1;30mINFO:[0m Automatically structuring 118 input files into assemblies/reads
[32m2021-08-06 14:08:10[0m [1;30mINFO:[0m Processing 59 genomes to identify mutations
[32m2021-08-06 14:08:10[0m [1;30mINFO:[0m Including snpeff annotations in snakemake results
[32m2021-08-06 14:08:10[0m [1;30mINFO:[0m Running Snakemake for rule all
[32m2021-08-06 14:45:10[0m [1;30mINFO:[0m Finished running snakemake.
[32m2021-08-06 14:45:10[0m [1;30mINFO:[0m Indexing processed VCF files defined in [/home/CSCScience.ca/apetkau/workspace/genomics-data-index-examples/examples/create-index/snakemake-assemblies.1628276890.8617988/gdi-input.fofn]
[32m2021-08-06 14:45:10[0m [1;30mINFO:[0m Attempting to load reference genome=[reference/NC_011083.gbk.gz]
[32m2021-08-06 14:45:12[0m [1;30mINFO:[0m Sample batch 1/1: Stage 1/2 (Insert): Processed 0% (0/59) samples
[32m2021-08-06 14:45:16[0m [1;30mINFO:[0m Sample batch 1/1: Stage 1/2 (Insert): Processed 8% (

## 3.2. Build ML tree from variants

In [41]:
!gdi --project-dir salmonella-project --ncores 48 rebuild tree \
    --align-type full --extra-params '--fast' NC_011083

[32m2021-08-06 14:51:35[0m [1;30mINFO:[0m Started rebuilding tree for reference genome [NC_011083]
[32m2021-08-06 14:55:33[0m [1;30mINFO:[0m Finished rebuilding tree


# 4. Zip up index

In [42]:
!zip -r salmonella-project.zip salmonella-project

  adding: salmonella-project/ (stored 0%)
  adding: salmonella-project/.gdi-data/ (stored 0%)
  adding: salmonella-project/.gdi-data/mlst/ (stored 0%)
  adding: salmonella-project/.gdi-data/kmer/ (stored 0%)
  adding: salmonella-project/.gdi-data/variation/ (stored 0%)
  adding: salmonella-project/.gdi-data/variation/c914594960253152ad5afe7f37c13f07.bed.gz (stored 0%)
  adding: salmonella-project/.gdi-data/variation/7caf8fe5b79d3fc197e0827b414aede2.vcf.gz (stored 0%)
  adding: salmonella-project/.gdi-data/variation/5ecb10f8301a39fba80d24b1c52fdb1f.bed.gz (stored 0%)
  adding: salmonella-project/.gdi-data/variation/5f092ab14de132a8a4aade3a1ceca559.bed.gz (deflated 0%)
  adding: salmonella-project/.gdi-data/variation/33ab27828bc93bd3bbc3043478489e54.vcf.gz (stored 0%)
  adding: salmonella-project/.gdi-data/variation/85f677a2be4239c5856666f6b9723acf.vcf.gz (stored 0%)
  adding: salmonella-project/.gdi-data/variation/2a12d0b784853134b15f6e605d75f2ab.vcf.gz (stored 0%)
  adding: salmonella-