# Notebook 1: Reference genomes

In this notebook we will download the fasta sequence file for several closely related reference genomes. These will be used in this study either to serve as an achor to map RAD-seq reads to, or as additional outgroup samples by using the `digest_genomes` tool to extract RAD-seq like pseudo-reads from the sequence files. 


#### Resources: 

Ivalú Cacho, N., A. Millie Burrell, Alan E. Pepper, and Sharon Y. Strauss. 2014. “Novel Nuclear Markers Inform the Systematics and the Evolution of Serpentine Use in Streptanthus and Allies (Thelypodieae, Brassicaceae).” Molecular Phylogenetics and Evolution 72 (March): 71–81. https://doi.org/10.1016/j.ympev.2013.11.018.

Nikolov, Lachezar A., Philip Shushkov, Bruno Nevado, Xiangchao Gan, Ihsan A. Al‐Shehbaz, Dmitry Filatov, C. Donovan Bailey, and Miltos Tsiantis. 2019. “Resolving the Backbone of the Brassicaceae Phylogeny for Investigating Trait Diversity.” New Phytologist 222 (3): 1638–51. https://doi.org/10.1111/nph.15732.

http://brassicadb.org/brad/

In [1]:
import ipyrad.analysis as ipa

### Genome 1: *Sisymbrium_irio*
This is the most closely related high quality genome assembly and so we will use it as the scaffold for our RAD-seq assembly. http://brassicadb.org/brad/datasets/pub/BrassicaceaeGenome/. The RAD dataset includes a Sisymbrium sample as well that will also serve as an outgroup. We create a digested copy in case we decide to include it as a sample in an analysis that uses a different reference genome. 

In [2]:
%%bash

# link to the Brassicaceae genome database
LINK="http://brassicadb.org/brad/datasets/pub/BrassicaceaeGenome"

# download the fasta genome file
wget -q \
     -O ../data_ref_genomes/S_irio.fa.gz \
     $LINK/Sisymbrium_irio/S.irio.fa.gz
   
# decompress the genome file
gunzip ../data_ref_genomes/S_irio.fa.gz -f 

In [3]:
digest = ipa.digest_genome(
    fasta="../data_ref_genomes/S_irio.fa",
    name="S_irio",
    workdir="../data_fastqs/demux_digested",
    re1="CTGCAG", 
    re2=None,
    paired=False,
    readlen=150,
    ncopies=10,
    min_size=300,
    max_size=800,
)
digest.run()

extracted reads from 26869 positions


### Genome 2: Arabidopsis thaliana
Super high quality but more distantly related. We will digest and use as an outgroup sequence. 

In [4]:
%%bash

# link to the Brassicaceae genome database
LINK="http://brassicadb.org/brad/datasets/pub/BrassicaceaeGenome"

# download the fasta genome file
wget -q \
     -O ../data_ref_genomes/A_thaliana.fa.gz \
     $LINK/Arabidopsis_thaliana/TAIR10_genome.fas.gz

# decompress the genome file
gunzip ../data_ref_genomes/A_thaliana.fa.gz -f

In [5]:
digest = ipa.digest_genome(
    fasta="../data_ref_genomes/A_thaliana.fa",
    name="A_thaliana_TAIR10",
    workdir="../data_fastqs/demux_digested",
    re1="CTGCAG", 
    re2=None,
    paired=False,
    readlen=150,
    ncopies=10,
    min_size=300,
    max_size=800,
)
digest.run()

extracted reads from 19684 positions


### Genome 3: *Caulanthus amplexicaulis var. barbarae*

This is a draft (low quality) reference genome assembly, but it is the closest relative for which a genome is "available". It is not yet published (I believe) but made available to us. https://www.ncbi.nlm.nih.gov/bioproject/?term=209542. We will digest and use as a closely related outgroup sequence. 

In [6]:
# located in ../data_ref_genomes/Caulanthus_amplexicaulis_var._barbarae.faa

In [7]:
digest = ipa.digest_genome(
    fasta="../data_ref_genomes/Caulanthus_amplexicaulis_var._barbarae.faa",
    name="C_amp_barbarae_REF",
    workdir="../data_fastqs/demux_digested",
    re1="CTGCAG", 
    re2=None,
    paired=False,
    readlen=150,
    ncopies=10,
    min_size=300,
    max_size=800,
)
digest.run()

extracted reads from 44669 positions
