# pyUCell - some important parameters

This document describes some important parameters of the pyUCell algorithm, and how they can be adapated depending on your dataset.

For a simple pyUCell tutorial refer to: [pyUCell basics](https://pyucell.readthedocs.io/en/latest/notebooks/basic.html#)

## Load example dataset

In [26]:
import scanpy as sc
import matplotlib.pyplot as plt
import pyucell as uc

In [32]:
adata = sc.datasets.pbmc3k()

**Note:** becase UCell scores are based on relative gene ranks, it can be applied both on raw counts or normalized data. As long as the normalization preserves the relative ranks between genes, the results will be equivalent.

## 1. Positive and negative gene sets in signatures

pyUCell supports **positive and negative gene sets** within a signature. Simply append + or - signs to the genes to include them in positive and negative sets, respectively. For example:

In [47]:
signatures = {
    "CD8T": ["CD8A+", "CD8B+","CD4-"],
    "CD4T": ["CD4+", "CD40LG+", "CD8A-","CD8B-"]
}

pyUCell evaluates the positive and negative gene sets separately, then subtracts the scores. The parameter `w_neg` controls the relative weight of the negative gene set compared to the positive set (`w_neg=1.0` means equal weight). Note that the combined score is clipped to zero, to preserve UCell scores in the [0, 1] range.

In [48]:
uc.compute_ucell_scores(adata, signatures=signatures, w_neg=1)

In [49]:
adata.obs

Unnamed: 0_level_0,CD8T_UCell,CD4T_UCell
index,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACATACAACCAC-1,0.770771,0.000000
AAACATTGAGCTAC-1,0.000000,0.000000
AAACATTGATCAGC-1,0.000000,0.252586
AAACCGTGCTTCCG-1,0.000000,0.000000
AAACCGTGTATGCG-1,0.000000,0.000000
...,...,...
TTTCGAACTCTCAT-1,0.000000,0.000000
TTTCTACTGAGGCA-1,0.000000,0.000000
TTTCTACTTCCTCG-1,0.000000,0.000000
TTTGCATGAGAGGC-1,0.000000,0.000000


## 2. The max_rank parameter

## 3. Handling missing genes

If a subset of the genes in your signature are absent from the count matrix, how should they be handled?

pyUCell offers two alternative ways of handling missing genes:
- `missing_genes="impute"` (default): it assumes that absence from the count matrix means zero expression. All values for this gene are imputed to zero.
-  `missing_genes="skip"`: simply exclude all missing genes from the signatures; they won't contribute to the scores.

In [36]:
signatures = {
    "CD8T": ["CD8A+", "CD8B+","CD4-","notagene"]
}
uc.compute_ucell_scores(adata, signatures=signatures, missing_genes="impute")
adata.obs

Unnamed: 0_level_0,CD8T_UCell,CD4T_UCell
index,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACATACAACCAC-1,0.514019,0.000000
AAACATTGAGCTAC-1,0.000000,0.000000
AAACATTGATCAGC-1,0.000000,0.252586
AAACCGTGCTTCCG-1,0.000000,0.000000
AAACCGTGTATGCG-1,0.000000,0.000000
...,...,...
TTTCGAACTCTCAT-1,0.000000,0.000000
TTTCTACTGAGGCA-1,0.000000,0.000000
TTTCTACTTCCTCG-1,0.000000,0.000000
TTTGCATGAGAGGC-1,0.000000,0.000000


In [38]:
uc.compute_ucell_scores(adata, signatures=signatures, missing_genes="skip")
adata.obs

Unnamed: 0_level_0,CD8T_UCell,CD4T_UCell
index,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACATACAACCAC-1,0.770771,0.000000
AAACATTGAGCTAC-1,0.000000,0.000000
AAACATTGATCAGC-1,0.000000,0.252586
AAACCGTGCTTCCG-1,0.000000,0.000000
AAACCGTGTATGCG-1,0.000000,0.000000
...,...,...
TTTCGAACTCTCAT-1,0.000000,0.000000
TTTCTACTGAGGCA-1,0.000000,0.000000
TTTCTACTTCCTCG-1,0.000000,0.000000
TTTGCATGAGAGGC-1,0.000000,0.000000


## 4. Parallelization

Parallelization is handled internally by `joblib` and the `Parallel` module. You may control the number of jobs with the `n_jobs` parameter. By default all available cores are used (`n_jobs=-1`).

In [43]:
%time uc.compute_ucell_scores(adata, signatures=signatures, n_jobs=1)

CPU times: user 1.3 s, sys: 81.7 ms, total: 1.38 s
Wall time: 1.38 s


In [46]:
%time uc.compute_ucell_scores(adata, signatures=signatures, n_jobs=4)

CPU times: user 96.7 ms, sys: 97.3 ms, total: 194 ms
Wall time: 3.87 s
