# Tutorial 1 - Analyzing scRNA-seq data at the Protein Activity Level

<code style="background:lightgreen;color:black">Add description of the goal of this tutorial. Make sure that ways to load/install the package are updated. Also correct the path to the data (or include a dataset on GitHub, Zenodo or others). Also, fix the number of the sections.</code>

### Setup path and import modules
Setup your path variables in the source as follows:

In [16]:
pyther_path = "/Users/lucazanella7/Desktop/ColumbiaProjects/" # where Pyther is located in the machine
import sys
sys.path.insert(1,pyther_path) # add path to pyther to sys.path

import pyther
import scanpy as sc
import anndata 
import pandas as pd

### Step 1. Load a gene expression "signature" at the single-cell level 
Load gene expression signature to be used as input to `pyther` for Protein Activity inference. We store the gene expression signature into an [AnnData](https://anndata.readthedocs.io/en/latest/) object to enable interoperability with [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/#). Display matrix dimensions with cells on rows and features (genes) on columns (after transposition). <code style="background:lightgreen;color:black">Gene expression signatures can be generated using [...] describe approaches</code>. 
The gene expression signature used in this tutorial was generated from a population of  <span style="color:red">malignant ductal cells</span> from publicly avaiable data from [Peng et al., 2019](https://www.nature.com/articles/s41422-019-0195-y).  

In [17]:
gene_expr_path = "/Users/lucazanella7/Desktop/ColumbiaProjects/pyther_test_data/subset_Peng/Ductal_2.csv"
gene_expr_signature = anndata.read_csv(gene_expr_path).T

gene_expr_signature

AnnData object with n_obs × n_vars = 11315 × 5000

### Step 2. Load a gene regulatory network inferred with ARACNe
Load and inspect lineage-specific gene regulatory network generated with the ARACNe. The current ARACNe network was inferred from <span style="color:red">malingant ductal cells</span> using ARACNe3.See the following manuscript [ARACNe3](https://www.mdpi.com/1099-4300/25/3/542) and [ARACNe-AP](https://pubmed.ncbi.nlm.nih.gov/27153652/) for additional information on current ARACNe implementations. 

In [18]:
network_path = "/Users/lucazanella7/Desktop/ColumbiaProjects/pyther_test_data/subset_Peng/pruned_Ductal_2.tsv"
                            # path to tsv-formatted ARACNe network
    
network = pd.read_csv(network_path, delimiter="\t")

Display the first 5 interactions in the regulatory network. `mor` and `likelihood` represent the mode of regulation of the given regulator-target pair and the likelihood of the interaction, respectively.

In [19]:
network.head()

Unnamed: 0,regulator,target,mor,likelihood
0,AATF,CDC42SE2,0.005831,0.999934
1,AATF,EIF4EBP2,0.0191,0.999868
2,AATF,DDB1,0.019182,0.999802
3,AATF,COPB2,0.064825,0.999736
4,AATF,MDM4,0.023039,0.99967


Other algorithms can be used to generate gene regulatory networks, but we recommend postprocessing the output to the suitable dataframe format, with columns displayed above.

Convert the pandas DataFrame interactome to an object of class `Interactome` to enable easier manipulation. Type `help(pyther.Interactome)` to see available methods.

In [20]:
network_interactome = pyther.Interactome('malignant_ductal_interactome', network)

As an example, display the number of targets of a couple of selected regulators, MYC and SERPINA12.

In [21]:
n_MYC = len(network_interactome.get_reg('MYC')) # number of MYC targets in the network
n_SERPINA12 = len(network_interactome.get_reg('SERPINA12')) # number of SERPINA12 targets in the network

print("The number of targets of MYC and SERPINA12 is " + str(n_MYC) + " and " + str(n_SERPINA12) + ", respectively.")

The number of targets of MYC and SERPINA12 is 517 and 589, respectively.


### Convert the gene expression signature into a protein activity matrix using pyther
`pyther` transforms a gene expression signature into a protein activity matrix by enriched regulon analysis. `pyther` allows using [aREA](https://www.nature.com/articles/ng.3593) (default) and [NaRnEA](https://www.mdpi.com/1099-4300/25/3/542) (by setting `enrichment="narnea"` ) as enrichment methods. We will see how to employ the two methodologies.

#### Method 1 - protein activity inference using aREA
As we have seen (section above), different regulators can potentially have different number of targets. We prune each regulon to have the same number of targets (50 in this case). This step is advisable to avoid regulators with an exceedingly number of targets to dominate those with fewer.

In [22]:
network_pruned = network_interactome.copy() # generate a copy of the unpruned network before pruning it
network_pruned.prune(cutoff=50,eliminate=True) # prune interactome to have exactly 50 targets

Now all the regulators in the network have exactly 50 transcriptional targets.

In [23]:
n_MYC = len(network_pruned.get_reg('MYC')) # number of MYC targets in the network
n_SERPINA12 = len(network_pruned.get_reg('SERPINA12')) # number of SERPINA12 targets in the network

print("Number of MYC targets: " + str(n_MYC) + "\nNumber of SERPINA12 targets: " + str(n_SERPINA12))

Number of MYC targets: 50
Number of SERPINA12 targets: 50


<div class="alert alert-block alert-success">
<b></b> Run `pyther` to compute the protein activity matrix (aREA method).
</div>
 
Mandatory inputs to `pyther` are a gene expression signature and a gene regulatory network. We will set the output output to be an `ndarray` (The default output would be an AnnData object).

In [43]:
ProtAct_aREA = pyther.pyther(gex_data=gene_expr_signature, # gene expression signature
                             interactome=network_pruned, # gene regulatory network
                             enrichment = "area",
                             output_type="ndarray",
                             njobs=1,
                             verbose=True)

Preparing the association scores
Computing regulons enrichment with aREA
Rank transforming the data
Computing the likelihood matrix
Computing the modes matrix
Preparing the 1-tailed / 2-tailed matrices
Computing enrichment
Integrating enrichment


`ProtAct_aREA` contains the activity of each regulatory protein - computed as a Normalized Enrichment Score (NES) - for each single cell.

In [25]:
ProtAct_aREA # display the protein activity matrix 

regulator,AAMP,AATF,ABCA1,ABCA12,ABCA3,ABCA7,ABCA8,ABCB1,ABCB11,ABCB4,...,ZSWIM5,ZSWIM6,ZSWIM7,ZWINT,ZXDA,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
T1_AACACGTCAATGACCT,0.280517,-0.504880,-0.028528,0.509438,-0.009629,1.353753,-0.003198,-0.007347,0.043244,0.006640,...,0.015680,-0.186038,-0.523560,-0.604258,-0.004439,-0.016693,-0.568268,-0.660309,-0.245437,-0.241670
T1_AACCATGCACAACTGT,-0.189905,-0.358789,0.291368,-0.037682,-0.008374,-1.065711,0.027710,0.000090,-0.010577,-0.016668,...,0.167288,-0.094879,-0.413764,-0.697323,-0.009053,-0.096277,-0.469211,-0.553629,0.185824,-0.007027
T1_AACCATGTCTGATTCT,-0.220470,0.461793,-0.586378,-0.389297,-0.010845,1.113217,-0.057777,-0.006155,0.018059,-0.019745,...,0.009457,-0.279092,-0.858434,-1.540381,-0.008052,0.013261,0.171754,-0.478961,-0.350549,-0.272049
T1_AACCATGTCTGTCCGT,-0.369943,-0.378325,-0.828241,0.207703,-0.009249,-1.152477,0.008250,0.014786,0.003495,-0.001988,...,0.224178,-0.110395,-0.430240,-0.462088,-0.002266,0.197551,-0.432281,0.785490,-0.043907,-0.007274
T1_AACCGCGGTACTCGCG,-0.265803,-0.509552,-0.175736,-0.019153,0.003066,1.198590,0.001903,-0.004296,-0.038345,0.008701,...,0.532062,-0.200094,-0.703519,-0.732466,-0.006939,0.003096,0.218358,-0.317084,-0.254043,-0.058576
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
T24_TTTCCTCCAGACAAGC,-0.078426,0.118488,-0.459558,0.741950,-0.024126,-0.827274,0.075903,-0.008518,0.009360,-0.658218,...,0.007967,-0.285443,-0.715757,5.680507,-0.011181,-0.007448,-0.282094,0.235866,-0.493437,-0.733182
T24_TTTGCGCCAATCGAAA,0.159533,0.522409,0.492222,0.518799,-0.000549,1.195306,0.027098,-0.000416,-0.015244,-0.015931,...,0.235037,0.357366,-0.708375,-0.696420,0.004600,-0.017936,-0.282999,-0.579619,-0.379366,-0.096958
T24_TTTGCGCTCTTATCTG,0.331929,-0.501772,-0.105514,0.764497,0.036819,-1.158549,0.021390,0.009482,-0.012954,-0.001166,...,0.004401,0.685029,-0.607375,-0.567571,-0.009340,0.004793,0.156958,-0.625656,-0.876895,-0.039213
T24_TTTGGTTAGACACTAA,0.344605,-0.435929,0.233628,0.020642,0.005667,-1.016624,0.018453,0.003626,-0.011440,0.000736,...,0.164652,-0.144340,-0.465114,-0.452943,-0.011583,0.012083,-0.310425,0.527961,-0.021574,0.486693


#### Method 2 - protein activity inference using (matrix)-NaRnEA

<div class="alert alert-block alert-success">
<b></b> Run `pyther` to compute the protein activity matrix (matrix-NaRnEA method).
</div>
 

Compute the protein activity using `pyther` with NaRnEA as the enrichment method. Unlike aREA, NaRnEA is designed to run with regulons of different sizes, i.e. different number of targets per regulator. <code style="background:lightgreen;color:black">Comment on modes to compute the qnorm with C++ or approx.</code>

In this example, we will set the output to be an `AnnData` object (default).

In [49]:
ProtAct_NaRnEA = pyther.pyther(gex_data=gene_expr_signature, # gene expression signature
                                 interactome=network_interactome, # gene regulatory network
                                 enrichment = "narnea",
                                 njobs=1,
                                 output_type="anndata",
                                 verbose=True)

Preparing the association scores
Computing regulons enrichment with NaRnEa
reordering genes
Calculating DES...
Calculating UES...
Calculating NES...
Calculating PES...


Show the output type and length of `ProtAct_NaRnEA`.

In [47]:
ProtAct_NaRnEA # display the protein activity matrix as AnnData object 

AnnData object with n_obs × n_vars = 11315 × 170
    layers: 'pes'

When using NaRnEA, `pyther` returns an additional layer, `pes` that stores the Proportional Enrichment Scores (PES) for regulator, a measure bound in the interval $-1 \le PES \le 1$ that can be used as a measure of effect size. For further details, see [Griffin et al., 2022](https://pubmed.ncbi.nlm.nih.gov/36981431/).
Display NES and PES protein activity matrices.


In [33]:
ProtAct_NaRnEA.to_df() # NES matrix

Unnamed: 0,AAMP,AATF,ABCA1,ABCA12,ABCA3,ABCA7,ABCA8,ABCB1,ABCB11,ABCB4,...,ZSWIM5,ZSWIM6,ZSWIM7,ZWINT,ZXDA,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
T1_AACACGTCAATGACCT,5.977011,-6.193017,-2.551499,-2.294304,-2.546585,7.395359,-2.176996,-1.914222,1.670248,-2.411389,...,1.217220,5.003190,-5.881310,-6.826503,1.770090,-3.676204,-3.395995,8.004831,3.486996,-5.331775
T1_AACCATGCACAACTGT,-7.915289,-7.676798,2.160289,-1.217329,-2.520414,-7.484751,-2.482566,1.310327,-1.106203,-2.133264,...,1.295151,-5.559730,-6.563847,-6.065267,-1.482044,-3.757934,2.575181,-6.759091,-3.831109,-6.660645
T1_AACCATGTCTGATTCT,-4.485068,-5.231963,-3.227796,-1.305770,1.828262,-6.973998,-2.746649,-1.238382,1.101084,2.531254,...,2.117518,6.266718,5.644038,-6.982967,2.391163,-3.767748,2.024379,6.781371,4.512636,-4.901477
T1_AACCATGTCTGTCCGT,6.235696,-6.986539,-2.625829,-1.784883,-2.213311,-9.080558,2.131461,1.080856,1.288077,2.197156,...,1.293085,5.146029,-6.457945,-7.257192,-0.872180,5.676072,-2.429734,7.786766,-3.079883,-5.405772
T1_AACCGCGGTACTCGCG,5.922857,-6.787462,-3.742824,1.850901,2.704998,-6.910479,-1.060913,1.485369,-1.402303,2.394306,...,1.287161,6.238270,7.092947,-6.509938,1.357276,3.338417,1.980990,6.328692,-4.009763,5.215246
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
T24_TTTCCTCCAGACAAGC,-3.197557,3.886974,-3.457554,1.798762,-1.660185,5.347885,2.177825,-0.895205,-2.571363,2.647628,...,-2.172940,-4.960516,-3.950721,11.954428,-1.051903,2.975031,-2.414318,2.905377,4.176318,3.634391
T24_TTTGCGCCAATCGAAA,4.166178,6.325027,-2.637480,-2.283097,1.306971,6.914522,-1.769684,1.505252,-1.247070,-2.160760,...,-3.037016,6.293897,-5.773691,-5.397300,-1.625462,-4.384336,-3.623218,-5.147055,4.686790,-4.802979
T24_TTTGCGCTCTTATCTG,7.861903,-7.117991,-2.524008,2.257352,-1.337185,-7.045333,-2.329967,-1.407283,1.240751,2.831887,...,-2.020369,-6.307931,5.493429,-6.123378,-1.846785,3.562004,-1.815231,6.013303,-5.612674,-5.555880
T24_TTTGGTTAGACACTAA,6.261608,6.415593,-2.708693,1.323513,1.902153,-8.140754,2.677264,2.508443,-1.221474,-2.143654,...,1.990721,6.964138,-6.376920,-5.530391,2.659144,4.925957,-2.646268,6.018507,-4.041078,5.563036


In [34]:
ProtAct_NaRnEA.to_df(layer="pes") # PES matrix

Unnamed: 0,AAMP,AATF,ABCA1,ABCA12,ABCA3,ABCA7,ABCA8,ABCB1,ABCB11,ABCB4,...,ZSWIM5,ZSWIM6,ZSWIM7,ZWINT,ZXDA,ZXDB,ZXDC,ZYX,ZZEF1,ZZZ3
T1_AACACGTCAATGACCT,0.307219,-0.409372,-0.112702,-0.099812,-0.099327,0.266215,-0.086555,-0.075331,0.061075,-0.090746,...,0.045268,0.199119,-0.377190,-0.244339,0.067423,-0.159357,-0.148746,0.351046,0.120815,-0.340070
T1_AACCATGCACAACTGT,-0.584061,-0.526358,0.076188,-0.051764,-0.096060,-0.390655,-0.096316,0.047635,-0.039642,-0.077086,...,0.046322,-0.313343,-0.437892,-0.225230,-0.055698,-0.164223,0.095125,-0.456784,-0.166673,-0.441515
T1_AACCATGTCTGATTCT,-0.305123,-0.329568,-0.142227,-0.057807,0.069921,-0.334363,-0.111310,-0.050020,0.041795,0.099517,...,0.081548,0.277193,0.286260,-0.237979,0.094680,-0.160632,0.084330,0.335918,0.170058,-0.297216
T1_AACCATGTCTGTCCGT,0.299575,-0.472178,-0.115596,-0.076534,-0.085071,-0.467187,0.076791,0.039975,0.045899,0.081970,...,0.046932,0.191602,-0.424193,-0.265616,-0.033195,0.189147,-0.105464,0.316212,-0.133430,-0.352945
T1_AACCGCGGTACTCGCG,0.309012,-0.446778,-0.165573,0.075742,0.099172,-0.346352,-0.042356,0.056899,-0.052328,0.091805,...,0.048158,0.251960,0.326407,-0.232023,0.052038,0.118633,0.078535,0.282227,-0.172694,0.239822
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
T24_TTTCCTCCAGACAAGC,-0.213187,0.215037,-0.152373,0.078573,-0.066520,0.225536,0.087266,-0.036628,-0.099849,0.105653,...,-0.085384,-0.254396,-0.235126,0.355688,-0.042718,0.117771,-0.107094,0.151853,0.163506,0.192828
T24_TTTGCGCCAATCGAAA,0.240663,0.336812,-0.115960,-0.101075,0.050161,0.280467,-0.071728,0.059852,-0.047777,-0.084511,...,-0.117730,0.281027,-0.348788,-0.182597,-0.065153,-0.186216,-0.159793,-0.308985,0.177885,-0.289086
T24_TTTGCGCTCTTATCTG,0.411274,-0.467461,-0.111541,0.092435,-0.052342,-0.352277,-0.092988,-0.055650,0.045670,0.108587,...,-0.076234,-0.342332,0.253530,-0.217733,-0.072182,0.126801,-0.079670,0.269028,-0.241366,-0.351944
T24_TTTGGTTAGACACTAA,0.314401,0.297050,-0.119657,0.053160,0.068439,-0.413558,0.098633,0.094575,-0.045027,-0.080031,...,0.073472,0.270844,-0.413110,-0.199825,0.100436,0.169918,-0.115674,0.256960,-0.174699,0.246002
