# Demo: pciSeq diagnostics
This notebook aims to help you monitor on-the-fly how the fit progresses.
<div class="alert alert-block alert-warning">
<b>Warning:</b> Download and run the notebook locally. Diagnostics run on the browser and the call will fail if the notebook is cloud-based. I couldnt also make redis run on Colab.
</div>

## Installation:

In [1]:
!pip install pciSeq==0.0.54



Other libraries needed for this notebook:

In [2]:
import os
import numpy as np
import pandas as pd
from scipy.sparse import load_npz, coo_matrix
import pciSeq

DIPlib -- a quantitative image analysis library
Version 3.3.0 (May 21 2022)
For more information see https://diplib.org


In [3]:
pciSeq.__version__

'0.0.54'

## 1. Diagnostics

Diagnostics is a dashboard aiming to help users understand, on the fly, whether the algorithm is heading towards the wrong direction. If that happends he/she can stop the run at the spot instead of waiting until the very end (which could take a while) to examine the results and realise that they do not make sense. He/she can then reconfigure the hyperparameters and start again pciSeq

### 1.1. Prerequisities

You must install redis if you are on Linux, or Memurai if you are on Windows. 

For redis do:
* sudo apt-get install redis-server redis-tools 

and then start the service: 

* sudo service redis-server start

You can get memurai from [here](https://www.memurai.com/get-memurai). Once installed, the service should start automatically but you can manually start it by: 

* memurai.exe --service-start 

If for whatever reason you cannot install redis (or memurai) then pass *'launch_diagnostics': False* in the options dict when you call *pciSeq.fit()*. Redis serves as layer to pass data from the main python loop to the dashboard. By setting *'launch_diagnostics': False* this communication should be prevented and redis shouldnt got into the landscape.

### 1.2. The dashboard

We skip the preliminairies, they have been covered on the other notebooks, and we go straight to the *fit()* method having first loaded some demo data

In [4]:
coo_file = pciSeq.utils.load_from_url('https://github.com/acycliq/pciSeq/blob/dev/pciSeq/data/mouse/ca1/segmentation/label_image.coo.npz?raw=true')
coo = load_npz(coo_file)

In [5]:
spots_file = pciSeq.utils.load_from_url('https://github.com/acycliq/pciSeq/blob/dev/pciSeq/data/mouse/ca1/iss/spots.csv?raw=true')
iss_spots = pd.read_csv(spots_file)

In [6]:
sc_file = pciSeq.utils.load_from_url('https://github.com/acycliq/pciSeq/blob/dev/pciSeq/data/mouse/ca1/scRNA/scRNAseq.csv.gz?raw=true')
scRNAseq = pd.read_csv(sc_file, header=None, index_col=0, compression='gzip', dtype=object)
scRNAseq = scRNAseq.rename(columns=scRNAseq.iloc[0], copy=False).iloc[1:]
scRNAseq = scRNAseq.astype(float).astype(np.uint32)

In [7]:
pciSeq.attach_to_log()

In [8]:
# Switch on the diagnostics
opts = {'launch_diagnostics': True} 

**Do not miss to read some useful comments at the bottom of this notebook, section 2.1. You can jump straight to them by clicking [here](#2.1-Reading-the-dashboard).**

In [9]:
cellData, geneData = pciSeq.fit(spots=iss_spots, coo=coo, scRNAseq=scRNAseq, opts=opts)

[2023-07-03 01:38:34] INFO    (app.py:226)  launch_diagnostics is set to True
[2023-07-03 01:38:34] INFO    (app.py:251) check_redis_server
[2023-07-03 01:38:36] INFO    (utils.py: 67)  enabling keyspace events... OK
[2023-07-03 01:38:36] INFO    (app.py:122) Launching the diagnostics dashboard
[2023-07-03 01:38:36] INFO    (app.py:126)  Preprocessing data
[2023-07-03 01:38:37] INFO    (spot_labels.py: 69)  Number of spots passed-in: 72336
[2023-07-03 01:38:37] INFO    (spot_labels.py: 70)  Number of segmented cells: 3481
[2023-07-03 01:38:37] INFO    (spot_labels.py: 71)  Segmentation array implies that image has width: 7602px and height: 5471px
[2023-07-03 01:38:51] INFO    (utils.py: 67)  enabling keyspace events... OK
[2023-07-03 01:38:51] INFO    (datatypes.py:460)  Single cell data passed-in have 27998 genes and 5712 cells
[2023-07-03 01:38:51] INFO    (datatypes.py:462)  Single cell data: Keeping counts for the gene panel of 92 only
[2023-07-03 01:38:51] INFO    (datatypes.py:47

#### 2.1 Reading the dashboard

The dashboard should now be running on another tab of your browser. It should show two plots similar to the ones below; They are getting refreshed on every loop as they algorithm tries to converge.

Gene efficiency can be thought of as the ratio between the observed and the theoretical gene counts. The default value for example, is 0.2 which means that on average for each gene we expect to have 5 times less counts than those expressed in the single cell data. If I had included a plot of the gene efficiency before the start of the iterations then we should have had a bar chart where all bars have the same length and equal to 0.2. This value however changes based on our prior views coupled with evidence that comes from the data and is crystallised when the algorithm converges. This gradual shift  from the data-agnostic state (where only the prior 0.2 is available) to the optimal one is shown in the left plot of the dashboard. It is easy to see for example that Neurod6 stands out with a gene efficiency close to 1.35. Technically, the plot show the mean of the approximated (ie variational) posterior distribution.

In my view, the most useful plot to check whether something has gone horrible wrong is the plot on the right. In each loop, it calculates how many cells have been assigned to each cell class and then plots those cell counts. The class that stands out is the Zero class, we see that about 1030 out of the total 3481 (around 30%) have been classed as Zero class. This is not uncommon, quite often the Zero class is the most populated class (could be more than 30% on some occasions). If something has gone wrong however then the vast majority (around 95%) of the cells will be Zero class. If that happens, then one reason could be that the gene efficiency needs to be adjusted. The user should reflect about the reference data (single cell data) and the experiment and set the inefficiency value to something that relfects the case.

To familiarise yourself, the user is encouraged to purposely kick-off *pciSeq.fit()* with a mis-specified value for inefficiency, set for example *'Inefficiency': 100* or even *'Inefficiency': 1000*. You will notice that gene efficiency hardly moves as the algorithms loops and all cells end up under the Zero class. With the demo data, the fit finishes after a few runs, your data however maybe more demanding; Hopefully these diagnostic plots will help you detect problems at the very starting stages and you will not have to wait until the end to realise that some has been mispecified. 

If you are confident that you have set a reasonable value for Inefficiency and despite that and despite that you get way too many Zero class cells, then a contributing factor could be the *MisreadDensity*. A higher value, (for example from 1e-5 to 1e-4) will tilt the likelihood towards a view that considers the spots as misreads and that will have a knock-on effect on the cell-to-cellClass assignments which will demonstrate itself by the abundance of Zero class cells in our results.

Most often however it is the *Inefficiency* that needs tweaking and not the *MisreadDensity*.

Notes: 
 - You can close the diagnostics tab while the algorithm converges, nothing will happen to the main loop, they are handled by different processes.
 - I am using the terms *Inefficiency* and *gene efficiency* interchangeably, they mean exaclty the same thing (which TBH doesnt look very effcient as a way of writing... Strictly speaking they are the complete opposite!)

<img src="diagnostics.gif" />