# Minimal Genome Inference

## Introduction

Use comparative genomics to determine which elements of the genomes are highly conserved.

We will assume that all highly conserved elements are essential.

Subset the genome down to these essential sequences, and reconstruct a new path through the genome that will minimize the size to only include these regions of the genome that are inferred essential.

## Methods

Pseudo-algorithm:
- do the following for a virus, mycobacterium, yeast
    - pull list of all genomes on NCBI
    - build pangenome
    - filter out dna sequences with low conservation
    - walk shortest/maximum likelihood path(s)

In [1]:
DATE = "2022-01-15"
TASK = "minimal-genome"
DIR = mkpath("$(homedir())/workspace/$(DATE)-$(TASK)")
cd(DIR)

In [2]:
pkgs = [
"Graphs",
"MetaGraphs",
"BioSequences",
"uCSV",
"DataFrames",
"FASTX",
"HTTP",
"CodecZlib",
"Revise",
]

import Pkg
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $(basename(pkg))"))
end

import Mycelia

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/dev/Mycelia/docs/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/dev/Mycelia/docs/Manifest.toml`
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mMycelia
  1 dependency successfully precompiled in 22 seconds (272 already precompiled)


### Virus: COVID-19

In [None]:
# all nucleotide records
# https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049

In [23]:
# accession list downloaded from
# https://www.ncbi.nlm.nih.gov/sars-cov-2/
covid_accessions = DataFrames.DataFrame(uCSV.read(joinpath(Mycelia.METADATA, "sars-cov-2-accession-list.txt"), header=1)...)[!, "id"]

10000-element Vector{String}:
 "MW592645"
 "MW592651"
 "MZ504055"
 "MZ500902"
 "MZ501473"
 "MZ501471"
 "MZ500800"
 "MW590397"
 "MW592612"
 "MW592636"
 "MZ500802"
 "MZ500903"
 "MZ504037"
 ⋮
 "MZ497191"
 "MZ473712"
 "MZ473745"
 "MZ433955"
 "MZ350108"
 "MZ350111"
 "MZ473755"
 "MZ434008"
 "MZ351856"
 "MZ351853"
 "MW893496"
 "MZ473957"

In [26]:
d = mkpath(joinpath(DIR, "covid-genomes"))
accession_list = covid_accessions[1:3]
for id in accession_list
    f = joinpath(d, "$(id).fasta")
    if !isfile(f)
        open(f, "w") do io
            fastx_io = FASTX.FASTA.Writer(io)
            for record in Mycelia.get_sequence(db="nuccore", accession=covid_accessions[1, "id"])
                write(fastx_io, record)
            end
            close(fastx_io)
        end
    end
end

covid_genomes = joinpath(DIR, "covid-genomes.fasta")
open(covid_genomes, "w") do io
    for id in accession_list
        f = joinpath(d, "$(id).fasta")
        for line in eachline(f)
            println(io, line)
        end
        println(io, "")
    end
end

Build pangenome graph

Find maximum likelihood path through the genome

### Bacteria: Mycobacterium spp.

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

In [None]:
# # db = "refseq"
# db = "genbank"
# ncbi_summary_url = "https://ftp.ncbi.nih.gov/genomes/$(db)/assembly_summary_$(db).txt"
# ncbi_summary_outfile = basename(ncbi_summary_url)
# if !isfile(ncbi_summary_outfile)
#     download(ncbi_summary_url, ncbi_summary_outfile)
# end
# ncbi_summary_table = DataFrames.DataFrame(uCSV.read(ncbi_summary_outfile, header=2, delim='\t')...)

Build pangenome graph

Find maximum likelihood path through the genome

### Eukaryote: S. cerevisae

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

Build pangenome graph

Find maximum likelihood path through the genome

## Results

## References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02130-z

https://www.science.org/doi/10.1126/science.aaf4557

http://syntheticyeast.org/software-development/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370935/

https://en.wikipedia.org/wiki/Minimal_genome

https://www.science.org/doi/10.1126/science.aad6253

https://www.pnas.org/content/103/2/425