# Minimal Genome Inference

## Introduction

Use comparative genomics to determine which elements of the genomes are highly conserved.

We will assume that all highly conserved elements are essential.

Subset the genome down to these essential sequences, and reconstruct a new path through the genome that will minimize the size to only include these regions of the genome that are inferred essential.

## Methods

Pseudo-algorithm:
- do the following for a virus, mycobacterium, yeast
    - pull list of all genomes on NCBI
    - build pangenome
    - filter out dna sequences with low conservation
    - walk shortest/maximum likelihood path(s)

In [4]:
DATE = "2022-01-15"
TASK = "minimal-genome"
DIR = mkpath("$(homedir())/workspace/$(DATE)-$(TASK)")
cd(DIR)

In [3]:
pkgs = [
"Graphs",
"MetaGraphs",
"BioSequences",
"uCSV",
"DataFrames",
"FASTX",
"HTTP",
"CodecZlib",
"Revise",
]

import Pkg
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $(basename(pkg))"))
end

import Mycelia

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/dev/Mycelia/docs/Project.toml`
 [90m [944b1d66] [39m[92m+ CodecZlib v0.7.0[39m
 [90m [626554b9] [39m[92m+ MetaGraphs v0.7.1[39m
[32m[1m    Updating[22m[39m `~/.julia/dev/Mycelia/docs/Manifest.toml`
 [90m [e7412a2a] [39m[93m↑ Ogg_jll v1.3.5+0 ⇒ v1.3.5+1[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mStatsAPI[39m
[32m  ✓ [39m[90mAdapt[39m
[32m  ✓ [39m[90mRequires[39m
[32m  ✓ [39m[90mCrayons[39m
[32m  ✓ [39m[90mStatic[39m
[32m  ✓ [39m[90mlibfdk_aac_jll[39m
[32m  ✓ [39m[90mLAME_jll[39m
[32m  ✓ [39m[90mWayland_protocols_jll[39m
[32m  ✓ [39m[90mJpegTurbo_jll[39m
[32m  ✓ [39m[90mOpenSSL_jll[39m
[32m  ✓ [39m[90mOgg_jll[39m
[32m  ✓ [39m[90mOpenSpecFun_jll[39m
[3

### Virus: COVID-19

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

In [28]:
refseq_summary_url = "https://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt"
refseq_summary_outfile = basename(refseq_summary_url)
if !isfile(refseq_summary_outfile)
    download(refseq_summary_url, refseq_summary_outfile)
end
refseq_summary_table = DataFrames.DataFrame(uCSV.read(refseq_summary_outfile, header=2, delim='\t')...)

Unnamed: 0_level_0,# assembly_accession,bioproject,biosample,wgs_master,refseq_category
Unnamed: 0_level_1,String,String,String,String,String
1,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome
2,GCF_000001405.39,PRJNA168,,,reference genome
3,GCF_000001635.27,PRJNA169,,,reference genome
4,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome
5,GCF_000001905.1,PRJNA70973,SAMN02953622,AAGU00000000.3,representative genome
6,GCF_000001985.1,PRJNA32665,SAMN02953685,ABAR00000000.1,representative genome
7,GCF_000002035.6,PRJNA13922,SAMN06930106,,reference genome
8,GCF_000002075.1,PRJNA209509,SAMN02953658,AASC00000000.3,representative genome
9,GCF_000002235.5,PRJNA13728,SAMN00829422,AAGJ00000000.6,representative genome
10,GCF_000002285.5,PRJNA12384,SAMN02953603,AAEX00000000.4,na


Build pangenome graph

Find maximum likelihood path through the genome

### Bacteria: Mycobacterium spp.

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

Build pangenome graph

Find maximum likelihood path through the genome

### Eukaryote: S. cerevisae

Find all COVID-19 submissions in NCBI refseq and genbank. Start with refseq

Build pangenome graph

Find maximum likelihood path through the genome

## Results

## References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02130-z

https://www.science.org/doi/10.1126/science.aaf4557

http://syntheticyeast.org/software-development/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370935/

https://en.wikipedia.org/wiki/Minimal_genome

https://www.science.org/doi/10.1126/science.aad6253

https://www.pnas.org/content/103/2/425