# `panning-extended` demonstration

This notebook shows a demonstration of how to explore the **processed** `panning-extended` dataset using the `nbseq` library. If you have not already done so, you need to **download and extract** the processed dataset by following instructions in the [README](../../../README.md). You also need to follow the instructions to [install the `nbseq` library in the README for that repository](http://github.com/caseygrun/nbseq).

In [None]:
import nbseq
import os

# change working directory to `./panning-extended` for simplicity of access to feature tables, etc
# make sure we don't do this twice, or we'll end up in the wrong place and be very confused
if 'dir_changed' not in globals():
    os.chdir('../../')
    dir_changed = True

## Load data into `nbseq.Experiment`

Load experiment sample metadata, feature tables, and sequences. 

Note that several additional files such as various transformed feature tables, beta-diversity calculations, and the large `mmseqs2` database (which is needed to search the dataset for VHHs with similar sequences) are omitted from the processed dataset for the sake of simplicity and file size. Therefore you will receive some warnings (not errors!) when running the line below; those files are not needed for this demonstration. If, however, you receive a `FileNotFoundError`, you will not be able to proceed---please check that you have correctly followed instructions in the [README](../../../README.md) to download and extract the processed dataset.

Missing files can be regenerated by using the included snakemake workflow `workflow/downstream.smk`. The `mmseqs2` database can be re-generated on-demand by running `snakemake --use-conda --cores all -s workflow/downstream.smk -- intermediate/cdr3/features_db/` for the CDR3 feature space or `snakemake --use-conda --cores all -s workflow/downstream.smk -- intermediate/aa/features_db/` for the amino acid feature space.

In [None]:
ex = nbseq.Experiment.from_files(
    fd_cdr3='results/tables/cdr3/asvs.csv',
    metadata='config/metadata_full.csv'
) 

Print a summary of the data we have loaded:

In [None]:
ex

An `Experiment` is a collection of **feature tables** in different **feature spaces** (e.g. amino acid, `aa`; CDR3, `cdr3`; etc.). Each feature table is stored in [AnnData](https://anndata.readthedocs.io) format. The **sample metadata** (`obs`) is shared among the feature spaces, whereas each space has its own **feature metadata** (`var`). You can access the feature tables within `ex.fts`:

In [None]:
ex.fts.cdr3

View a summary of all **selections** in the experiment and the columns of metadata:

In [None]:
ex.summarize_selections()

Samples in the feature table may be separated into several sub-experiments (indicated by the `expt` column in `obs`); this is useful if multiple panning campaigns were conducted with different conditions.

In this case, several samples from the final round of the high-throughput panning experiment (expt `027i`) were re-sequenced, in addition to samples from the main extended panning experiment (expt `027j`). `024f` is an unrelated experiment. `027j.lib` contains the un-panned input library (e.g. the round 1 input phage) and was used to build the "null" enrichment model.

In [None]:
# ex._expt_metadata = nbseq.utils.sample_metadata_to_expt_metadata(ex.obs)
ex.summarize_expts()

Subsets of the feature tables can be extracted using `ex.query` or `nbseq.ft.query`:

In [None]:
ex.query("expt == '027j'", space='cdr3')

## Ordination

In [None]:
import nbseq.ordination
import nbseq.viz.ord
import nbseq.ft

import matplotlib.pyplot as plt

In [None]:
# read scran-normalized feature table
ft_cdr3_scran = nbseq.ft.query(
    # only consider the extended panning experiment and the input library
    nbseq.ft.read('results/tables/cdr3/transformed/scran/feature_table.biom', metadata=ex.obs),
    "expt == '027j' | expt == '027j.lib'",
    axis='sample'
)

# ordinate by truncated SVD with 100 components
ord_skl, ord_skbio = nbseq.ordination.ordinate(ft_cdr3_scran, method='TSVD', **{ 'n_components':100 })

In [None]:
# plot ordination
rs = sorted(ex.obs['r'].unique())
nbseq.viz.ord.ordination_mpl(
    ord_skbio, ex.fts.cdr3.obs, 
    s=10,
    color='r', color_order=rs,
    cmap=nbseq.viz.ord.discrete_cmap(len(rs),'viridis'),                      
    camera=dict(elev=30, azim=-45, roll=0), 
    fig_kw = dict(figsize=(8,6)))
plt.legend(title="round")

## Interactive dashboards

`nbseq` provides interactive "dashboard" visualizations using the [Panel](https://panel.holoviz.org) and [Altair](https://altair-viz.github.io) libraries.

In [None]:
import nbseq.viz
import nbseq.viz.syntax
import nbseq.viz.dash

# load styles to view collapsible accordions and color-coded amino acid and nucleic acid strings in the notebook
nbseq.viz.setup_accordion()
nbseq.viz.syntax.aa_highlighter.setup_notebook()
nbseq.viz.syntax.na_highlighter.setup_notebook()
 
# import libraries for interactive visualization
import altair as alt
alt.data_transformers.enable("default")
alt.data_transformers.disable_max_rows()

import panel as pn
pn.extension('tabulator','vega')

The `selection_group_dashboard` allows one group of selections (e.g. those that are positive for a given antigen) to be compared against another group of selections. 

Here we look at the selections that are FlgEKHL (flagellar hook-basal body) positive, versus all other selections. 

This view is interactive:
- mouse over a feature to view details about the feature (i.e. a CDR3);
- click on the feature to focus on that feature in other subplots within the visualization
- double-click to clear the selection and show all features

Note that when you select a feature, some subplots may disappear if the feature does not appear in that selection at all.

This plot is in the CDR3 feature space. It is possible to view the visualization in amino acid feature space, but >10 GB of server memory is required to run the visualization.

In [None]:
bad_samples_flgEHKL = []

nbseq.viz.dash.selection_group_dashboard(
    ex, starting_phenotype='FlgEHKL', 
    global_query=(
        # consider only samples from sub-experiment '027j'
        "expt == '027j' & io == 'i' & kind == '+'")
)

The `vhh_dashboard` examines the behavior of a single feature (e.g. an rVHH) across all selections. 

- Choose a phenotype (e.g. FlgEHKL) from the drop-down menu to color points on both plots
- Hold the alt (option) key and drag to select points in the right-hand (enrichment-abundance) plot; this will update the graph to the left
- You can select a different CDR3 by entering its hash in the text box (e.g. choose from the table in the dashboard above). 

In [None]:
nbseq.viz.dash.vhh_dashboard(ex, 
                             feature='6d72a8720c935bb6bb7cb02e03b5381f', 
                             global_query="expt == '027j' & kind == '+' & io == 'i'", space='cdr3')

## Picking and resynthesizing rVHHs

Say we have identified several promising CDR3s that we would like to reconstitute as full-length recombinant VHHs. We can use the `nbseq.resynth.Cart` object in `ex.cart` to collect these candidates, review their behavior across all samples, and generate amino acid and nucleic acid sequences to order them

In [None]:
import pandas as pd
from io import StringIO

df = pd.read_csv(
    StringIO("""CDR3ID	CDR3_mn	Antigen	Interest	pick	Notes
6d72a8720c935bb6bb7cb02e03b5381f	sonata nikita tourist	FlgEHKL	****	1	mean 50x enrichment, enriched in 6 FlgEHKL+ samples, basically all samples where enriched are FlgEHKL+, some FliC+ but not all and most enriched samples were not FliC+
49e0bad9177fcee66f22f56d74511b26	candid alert griffin	FlgEHKL	**	1	only other somewhat good looking one for FlgEHKL…  enriched in 2, dominates 1 sample; rarely enriched elsewhere 
0c97620726c0a010e74c44b1148149ba	ocean invest artist	FlgEHKL	**	1	2 FlgEHKL+ samples, hardly anywhere else
989eda0b48b0c47e024b0ccac3f61248	violet janet block	FlgEHKL	**	1	2* FlgEHKL samples, 2 FlgEHKL+ samples
"""), sep="\t"
)
dff = df.join(nbseq.ft.fortify_features(ex.fts.cdr3), on='CDR3ID').drop('abundance', axis='columns')

ex.cart.add_from_dataframe(dff,description_col='Notes', antigen_col='Antigen')

ex.cart.show_queue(sort=True)

Review the behavior of these candidates across all selections: this will show an enrichment-abundance plot for each selected CDR3:

In [None]:
ex.cart.visualize_queue()

Generate an amino acid sequence, reverse-transcribe to nucleic acid, and add adapter sequences

In [None]:
ex.cart.resynthesize()

View a rich report of the resynthesis algorithm: click on the box for a given CDR3 to display the report:

In [None]:
ex.cart.report_all()

View the generated nucleic acid sequences in a table:

In [None]:
ex.cart.show_rVHHs(highlight_na=['NA'])