# `panning-minimal` demonstration

This notebook serves as a small example of use of the `nbseq` library to interactively explore data processed using the Snakemake workflows in the [`phageseq-paper` repository](https://github.com/caseygrun/phage-seq). It contains some simple plots to explore the processed `panning-minimal` dataset, which is a small subset of the `panning-extended` dataset. 

Before exploring this notebook, make sure you have [run the `panning-minimal` Snakemake workflow as documented in the README](../../../README.md). In particular, you should have two directories `../../results` and `../../intermediate`, populated with the results of this workflow.

Note that conclusions may differ from those in the main manuscript as only a small subset of the samples/reads are included.

In [None]:
import nbseq
import os

# change working directory to `./panning-minimal` for simplicity of access to feature tables, etc
# make sure we don't do this twice, or we'll end up in the wrong place and be very confused
if 'dir_changed' not in globals():
    os.chdir('../../')
    dir_changed = True

## Load data into `nbseq.Experiment`

Load sample metadata, feature tables, and feature sequences. Load only CDR3 feature table to save time.

Important: if you receive `FileNotFoundError`s in this cell, you need to stop and ensure you have finished [running the `panning-minimal` workflow as documented in the README](../../../README.md). You may see `- Warning: sqlite database 'path_to_sqlite_db' does not exist`; this is fine, the demonstration here does not rely on the SQLite database.

In [None]:
ex = nbseq.Experiment.from_files(
    # skip loading the amino acid feature table
    ft_aa=None,
    metadata='config/metadata_full.csv'
) 

In [None]:
ex

In [None]:
import nbseq.viz.utils
for space in ex.fts:
    # add less verbose descriptions to a sample metadata column called 'desc_short'
    nbseq.viz.utils.shorten_descriptions(ex.fts[space].obs)

# force rebuild the selection metadata
# del ex._selection_metadata

ex.obs.loc[~ex.obs['desc_short'].duplicated(), ['expt','desc_short']].reset_index(drop=True)

This small dataset contains three biological replicates each of two selections: PAK _∆flhA_ vs. PAK _∆fleN_ (flagellar hook-basal body, e.g. FlgEHKL) and PA0397 ∆efflux vs PA0397 _+mexAB+oprM_ (e.g. the MexAB/OprM multidrug efflux system). Additionally, we sequenced the un-panned input library three times (sub-experiment `027j.lib`):

In [None]:
ex.summarize_expts()

In [None]:
ex.summarize_selections()

Plot an overview of the antigen matrix:

In [None]:
ex.viz.plot_selection_ag_matrix(description_col='desc_short', figsize=(4,4))

## View barplots of un-panned library

We can get an idea of the diversity expected due to technical replicates by looking at repeated sequencing of the un-panned input library.

`ex.viz` is the experiment visualizer and provides access to several methods for visualizing the results of the experiment. `top_feature_barplot` shows a barplot of the `n` most abundant features in some feature space (`cdr3`, by default). The `query` argument uses `nbseq.ft.query` to choose a subset of samples, in this case, those from the un-panned library.

Note the plot is interactive; click on a bar or a feature label from the legend to focus on that feature; double-click to clear the selection.

In [None]:
chart = ex.viz.top_feature_barplot("expt == '027j.lib'", x='name_full:N', select_from_round=None, n=200)
chart

## Examine feature abundance within selections

Examine the abundance of various features within the two sets of selections:

In [None]:
def plot_selections(condition):
    return ex.viz.top_feature_barplot(f"expt == '027j' & desc_short == '{condition}'", select_from_round=None, n=100).facet(column='selection').properties(title=condition)

### OprM

In [None]:
plot_selections('PA0397 ∆efflux / PA0397 +mexAB+oprM')

It looks like round 8 for selections 1.A2 and 1.C2 were taken over by CDR3s `2c7c51` and `ad6f8f`. Let's look at a trace plot omitting those samples.

The traceplot will, by default, show the `n` CDR3s which have the highest geometric mean enrichment across the chosen selections

In [None]:
condition = 'PA0397 ∆efflux / PA0397 +mexAB+oprM'
bad_samples = ['027j.1.A2.1.R8i', '027j.1.C2.1.R8i']
ex.viz.top_feature_traceplot(query = f"expt == '027j' & desc_short == '{condition}' & ~(name_full in {bad_samples})")

We can interactively explore various features and their degree of enrichment in these samples compared to the other three, using the `nbseq.viz.dash` package. First, import and setup the [Panel](https://panel.holoviz.org) package.

In [None]:
import nbseq.viz.dash
import panel as pn
pn.extension('tabulator','vega')

# by default, the dashboard shows a bunch warnings; hide these for simplicity
import warnings
from anndata import ImplicitModificationWarning
warnings.filterwarnings('ignore', category=RuntimeWarning)
warnings.filterwarnings('ignore', category=ImplicitModificationWarning)

In [None]:
nbseq.viz.dash.selection_group_dashboard(
    ex, 
    # global_query allows us to subset the data explored; we will again discard the last-round samples from selections 1.A2 and 1.C2;
    # we will also limit our consideration to samples in sub-experiment '027j'; the un-sequenced input library samples were used to
    # calculate the enrichment probabilities, but we don't want to specifically observe their abundances here.
    global_query=f"expt == '027j' & ~(name_full in {bad_samples})", 
    neg_query="~({phenotype} == 1)", starting_phenotype='OprM')

Let's focus on what happens with `d3c7bb99b10ff17f8d01a7cda90da94d` in all samples:

In [None]:
ex.viz.plot_selections_for_feature('d3c7bb99b10ff17f8d01a7cda90da94d', phenotype='OprM', global_query=f"~(name_full in {bad_samples})")

In [None]:
import nbseq.viz.utils
nbseq.viz.utils.setup_accordion()

We can see that this CDR3 is much more abundant in round 7--8 of these OprM+ samples:

In [None]:
ex.viz.summarize_top_samples(['d3c7bb99b10ff17f8d01a7cda90da94d'])

### Flagellar hook-basal body ('FlgEHKL')

We can perform a similar analysis for the hook-basal body selections:

In [None]:
plot_selections('PAK ∆flhA / PAK ∆fleN')

In [None]:
condition = 'PAK ∆flhA / PAK ∆fleN'
bad_samples = []
ex.viz.top_feature_traceplot(query = f"expt == '027j' & desc_short == '{condition}' & ~(name_full in {bad_samples})")

Let's investigate how the CDR3 `97861c` behaves across the entire dataset:

In [None]:
ex.viz.plot_selections_for_feature(ex.find_cdr3('97861c')[0], phenotype='FlgEHKL')