<h2><span style="color:gray">ipyrad-analysis toolkit:</span> window_extracter</h2>

<h5><span style="color:red">(Reference only method)</span></h5>

Extract all sequence data within a genomic window, concatenate, and write to a phylip file. Useful for inferring the phylogeny near a specific gene/region of interest. 

Key features:

1. Automatically concatenates ref-mapped RAD loci in sliding windows.
2. Filter to remove sites by missing data.
3. Optionally remove samples from alignments.
4. Optionally use consensus seqs to represent clades of multiple samples.

### Required software

In [1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda

In [2]:
import ipyrad.analysis as ipa
import toytree


### Short Tutorial:

The `window_extracter()` tool takes the `.seqs.hdf5` database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the `.scaffold_table`. The first step is to load the data file to see which scaffolds are in your data set and their size. These will be in the same order they appear in your reference genome.

#### Load data file to get scaffold information

In [3]:
# the path to your HDF5 formatted seqs file
data = "/home/deren/Downloads/ref_pop2.seqs.hdf5"


In [4]:
# check scaffold idx (row) against scaffold names
ipa.window_extracter(data).scaffold_table.head()

Unnamed: 0,scaffold_name,scaffold_length
0,Qrob_Chr01,55068941
1,Qrob_Chr02,115639695
2,Qrob_Chr03,57474983
3,Qrob_Chr04,44977106
4,Qrob_Chr05,70629082


#### Load tool and select window 
Enter the `data` file, the `workdir` where files will be written to, and the `scaffold_idx` that you want to extract sequence data from. Use `start` and `end` to select the window. You can `exclude` samples to reduce missing data, and you can use `mincov` to filter sites from the alignment that contain too much missing data. 

The `.stats` attribute shows the information content of the selected window before and after filtering. When creating alignments this tool excludes any sites that have no data (e.g., the space between RAD markers, or the space between paired reads). In this case, we selected a 5Mb window which contained 51,474bp of RAD sequence data and 1,687 SNPs. After filtering this was reduced to 42,397bp and 1,416 SNPs. The number of samples remained the same because there were no samples containing all missing data, which would need to be excluded. To write the data to a file call the `.run()` function.

In [5]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=data,
    workdir="analysis-window_extracter",
    scaffold_idx=0,
    start=0,
    end=5000000,
    exclude=["CUMM5"],
    mincov=20,
)

# show stats of the window
ext.stats

Unnamed: 0,scaffold,start,end,sites,snps,missing,samples
prefilter,Qrob_Chr01,0,5000000,51474,1687,0.19,29
postfilter,Qrob_Chr01,0,5000000,42397,1416,0.11,29


#### Write selected window to a file

In [6]:
ext.run(force=True)

Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy


<h3><span style="color:red">Advanced:</span> Infer tree from phy output</h3>

You can pass in the file path that was created above to the `.raxml` analysis object in ipyrad, or use it any other phylogenetic software that accepts phylip format. We can see from the stats table above that this alignment contains 11,713 sites with 397 SNPs and about 25% missing data. 

In [7]:
# run raxml on the phylip file 
rax = ipa.raxml(data=ext.outfile, name="test", N=50, T=4)

# show the raxml command
print(rax.command)

raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy -p 54321 -N 50 -x 12345


In [8]:
# run job and wait to finish
rax.run(force=True)

job test finished successfully


In [9]:
# plot the tree for this genome window
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");

<h3><span style="color:red">Advanced:</span> Population/species sampling</h3>

When you have multiple samples per species you can use an `imap` dictionary to define them as a clade to create a consensus sequence to represent each clade as a single taxon. This can be useful for reducing the amount of missing data, and reducing the number of tips in the tree.

In [10]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data = "/home/deren/Downloads/ref_pop2.seqs.hdf5",
    workdir="analysis-window_extracter",
    scaffold_idx=0,
    start=0,
    end=5000000,
    mincov=4,
    imap={
        "reference": ["reference"],
        "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
        "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
        "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
        "bran": ["BJSL25", "BJSB3", "BJVL19"],
        "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
        "sagr": ["CUVN10", "CUCA4", "CUSV6", "CUMM5"],
        "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017", "CRL0001"],
    },
)

In [11]:
# write the phylip file
ext.run(force=True)

Wrote data to /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-window_extracter/scaf0-0-5000000.phy


In [12]:
# filtering now reduced from 30 to 8 samples
ext.stats

Unnamed: 0,scaffold,start,end,sites,snps,missing,samples
prefilter,Qrob_Chr01,0,5000000,51474,1690,0.21,30
postfilter,Qrob_Chr01,0,5000000,50251,451,0.02,8


In [13]:
# infer tree on imap
rax = ipa.raxml(data=ext.outfile, name="test2", N=50, T=4)
rax.run(force=True)

job test2 finished successfully


In [14]:
# plot the tree for this genome window
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");