# Minimal Genome Inference

## Introduction

Use comparative genomics to determine which elements of the genomes are highly conserved.

We will assume that all highly conserved elements are essential.

Subset the genome down to these essential sequences, and reconstruct a new path through the genome that will minimize the size to only include these regions of the genome that are inferred essential.

## Methods

Pseudo-algorithm:
- pull list of all genomes on NCBI
- build pangenome
- filter out dna sequences with low conservation
- walk shortest/maximum likelihood path(s)

In [1]:
DATE = "2022-01-15"
TASK = "minimal-genome"
DIR = mkpath("$(homedir())/workspace/$(DATE)-$(TASK)")
cd(DIR)

In [2]:
pkgs = [
"Graphs",
"MetaGraphs",
"BioSequences",
"uCSV",
"DataFrames",
"FASTX",
"HTTP",
"CodecZlib",
"Revise",
"FileIO",
"JLD2",
"StatsPlots",
"ProgressMeter"
]

import Pkg
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $(basename(pkg))"))
end

import Mycelia

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/dev/Mycelia/docs/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/dev/Mycelia/docs/Manifest.toml`


### Virus: COVID-19

In [3]:
# all nucleotide records
# https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049

In [4]:
# accession list downloaded from
# https://www.ncbi.nlm.nih.gov/sars-cov-2/
covid_accessions = DataFrames.DataFrame(uCSV.read(joinpath(Mycelia.METADATA, "sars-cov-2-accession-list.txt"), header=1)...)[!, "id"]

10000-element Vector{String}:
 "MW592645"
 "MW592651"
 "MZ504055"
 "MZ500902"
 "MZ501473"
 "MZ501471"
 "MZ500800"
 "MW590397"
 "MW592612"
 "MW592636"
 "MZ500802"
 "MZ500903"
 "MZ504037"
 ⋮
 "MZ497191"
 "MZ473712"
 "MZ473745"
 "MZ433955"
 "MZ350108"
 "MZ350111"
 "MZ473755"
 "MZ434008"
 "MZ351856"
 "MZ351853"
 "MW893496"
 "MZ473957"

In [5]:
d = mkpath(joinpath(DIR, "covid-genomes"))
accession_list = covid_accessions
# accession_list = covid_accessions
ProgressMeter.@showprogress for id in accession_list
    f = joinpath(d, "$(id).fasta")
    if !isfile(f)
        open(f, "w") do io
            fastx_io = FASTX.FASTA.Writer(io)
            for record in Mycelia.get_sequence(db="nuccore", accession=id)
                write(fastx_io, record)
            end
            close(fastx_io)
        end
    end
end

covid_genomes = joinpath(DIR, "covid-genomes.fasta")
open(covid_genomes, "w") do io
    ProgressMeter.@showprogress for id in accession_list
        f = joinpath(d, "$(id).fasta")
        for line in eachline(f)
            println(io, line)
        end
        println(io, "")
    end
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:02[39m
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:04[39m


In [10]:
md5 = first(split(read(`md5sum $(covid_genomes)`, String)))
covid_genomes_md5 = replace(covid_genomes, r"\.fasta" => ".$(md5).fasta")
if !isfile(covid_genomes_md5)
    mv(covid_genomes, covid_genomes_md5)
end

Build pangenome graph

In [11]:
executable = joinpath(pkgdir(Mycelia), "bin", "mycelia.jl")

"/home/jupyter-cjprybol/.julia/dev/Mycelia/bin/mycelia.jl"

In [13]:
fastxs = [covid_genomes_md5]

1-element Vector{String}:
 "/home/jupyter-cjprybol/workspace/2022-01-15-minimal-genome/covid-genomes.1639a53c979d07066ff86ecdca2c5135.fasta"

In [9]:
Mycelia.assess_kmer_saturation(fastxs, outdir="$(covid_genomes_md5)-saturation")

LoadError: UndefVarError: Primes not defined

Need to offload loads of methods from Mycelia.jl and replace with new algorithms from notebooks

In [None]:
# k = 3
# k = 5
# 10 genomes - 92.283553 seconds (2.06 k allocations: 40.953 KiB)
# 100 genomes - 283.205233 seconds (5.89 k allocations: 101.141 KiB)
# 1000 genomes - ?
# 10_000 genomes - ?
k = 7

out = "$(covid_genomes_md5).$(k).jld2"
if !isfile(out)
    @time run(`julia $(executable) construct --k $(k) --fastx $(fastx) --out $(out)`)
end

In [None]:
graph = FileIO.load(out)["graph"]

In [None]:
StatsPlots.histogram([graph.vprops[v][:weight] for v in Graphs.vertices(graph)])

In [None]:
# reduce and simplify

In [None]:
gfa_file = "$(out).gfa"
if !isfile(gfa_file)    
    @time run(`julia $(executable) convert --in $(out) --out $(gfa_file)`)
end

In [None]:
run(`/home/jupyter-cjprybol/software/bin/Bandage image $(gfa_file) $(gfa_file).svg --depwidth 1 --deppower 1`)

Find maximum likelihood path through the genome

### Bacteria: Mycobacterium spp.

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

In [None]:
# # db = "refseq"
# db = "genbank"
# ncbi_summary_url = "https://ftp.ncbi.nih.gov/genomes/$(db)/assembly_summary_$(db).txt"
# ncbi_summary_outfile = basename(ncbi_summary_url)
# if !isfile(ncbi_summary_outfile)
#     download(ncbi_summary_url, ncbi_summary_outfile)
# end
# ncbi_summary_table = DataFrames.DataFrame(uCSV.read(ncbi_summary_outfile, header=2, delim='\t')...)

Build pangenome graph

Find maximum likelihood path through the genome

### Eukaryote: S. cerevisae

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

Build pangenome graph

Find maximum likelihood path through the genome

## Results

## References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02130-z

https://www.science.org/doi/10.1126/science.aaf4557

http://syntheticyeast.org/software-development/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370935/

https://en.wikipedia.org/wiki/Minimal_genome

https://www.science.org/doi/10.1126/science.aad6253

https://www.pnas.org/content/103/2/425