#  Select Markers

* This tutorial shows how to select markers with the **scmags** package.


Let's perform these operations with the baron_h1 data set in the package.

For this, we first import the package.

In [1]:
import scmags as mg

Then we can start the operations by loading the dataset.

In [2]:
baron_h1 = mg.datasets.baron_h1()

## Filter Genes
First, redundant genes need to be filtered out for computational efficiency.

In [3]:
baron_h1.filter_genes()

-> Eliminating low expression genes
-> Log-Normalization is in Progress
-> Selecting cluster-specific genes


Here's how you can view the remaining genes after filtering\
At the output of this function, a dictionary structure is returned, and each key corresponds to a cluster.

In [4]:
rem_genes = baron_h1.get_filter_genes()
dict(list(rem_genes.items())[0:2])

{'acinar': ['ALB',
  'ALDOB',
  'CEL',
  'CTRL',
  'CUZD1',
  'GP2',
  'PDIA2',
  'PNLIPRP1',
  'PNLIPRP2',
  'REG1P'],
 'activated_stellate': ['ADAMTS12',
  'COL6A3',
  'CRLF1',
  'FBN1',
  'FMOD',
  'LAMC3',
  'PRRX1',
  'SFRP2',
  'THBS2',
  'VCAN']}

If you want, you can display the corresponding indexes in the data matrix.

In [5]:
baron_h1.get_filter_genes(ind_return=True)

{'acinar': array([  550,   572,  3055,  4021,  4062,  6872, 12577, 13082, 13083,
        14265]),
 'activated_stellate': array([  276,  3637,  3836,  5878,  6179,  9231, 13610, 15293, 17233,
        18923]),
 'alpha': array([ 2390,  3571,  6534,  7966,  8442,  8713, 10734, 11417, 12416,
        14361]),
 'beta': array([   68,   322,  1974,  2085,  5140,  6383,  7638,  7959, 11265,
        12967]),
 'delta': array([ 1449,  4516,  6407,  8364,  8453,  9345, 10318, 12504, 16021,
        16794]),
 'ductal': array([  556,   808,  3383,  6996,  9215, 10397, 13242, 13639, 14876,
        15787]),
 'endothelial': array([  357,  3402,  4968,  5360,  6159,  8756, 12396, 14586, 16167,
        19048]),
 'epsilon': array([ 6027,  6299,  6358,  6637,  9723, 11320, 12939, 15274, 16388,
        19024]),
 'gamma': array([  932,  1130,  5388,  9325,  9929, 11153, 13790, 16601, 18319,
        18672]),
 'macrophage': array([ 1971,  1973,  2765,  4127,  7541,  8490, 10454, 12935, 15487,
        17180]),
 'm

If you have not set an intra-cluster expression rate threshold, you can also view the automatically determined thresholds.

In [6]:
baron_h1.get_filt_cluster_thresholds

{'acinar': 0.6075000033527612,
 'activated_stellate': 0.6502941231429576,
 'alpha': 0.5977754232287407,
 'beta': 0.5841227070987225,
 'delta': 0.5828971944004298,
 'ductal': 0.61875,
 'endothelial': 0.5787692331522704,
 'epsilon': 0.647307697981596,
 'gamma': 0.5940000014752149,
 'macrophage': 0.6364285777509212,
 'mast': 0.61875,
 'quiescent_stellate': 0.5810869581252337,
 'schwann': 0.69300000295043,
 't_cell': 0.7425}

If you don't want automatic threshold determination for filtering, you can set a threshold yourself.\
This value should be between 0-1.

In [7]:
baron_h1.filter_genes(in_cls_thres=0.7)

-> Eliminating low expression genes
-> Log-Normalization is in Progress
-> Selecting cluster-specific genes


As can be seen, the threshold was set as 70% for all cells.\
In this case, genes expressed in less than 70% of the cells in the cluster are filtered out.

In [8]:
baron_h1.get_filt_cluster_thresholds

{'acinar': 0.693,
 'activated_stellate': 0.693,
 'alpha': 0.693,
 'beta': 0.693,
 'delta': 0.693,
 'ductal': 0.693,
 'endothelial': 0.693,
 'epsilon': 0.693,
 'gamma': 0.693,
 'macrophage': 0.693,
 'mast': 0.693,
 'quiescent_stellate': 0.693,
 'schwann': 0.693,
 't_cell': 0.693}

You can also set the number of genes that will remain after filtering. \
This may be necessary when selecting more marker genes.Because marker selection is carried out on the remaining genes after filtering.\
As can be seen in the example, when the parameter is set to 20, 20 genes remain after filtering for each cluster.

In [9]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.get_filter_genes(ind_return=True)

-> Eliminating low expression genes
-> Log-Normalization is in Progress
-> Selecting cluster-specific genes


{'acinar': array([  550,   572,  1495,  3055,  3057,  4020,  4021,  4062,  6872,
         7103,  7104,  8968, 10655, 10732, 12577, 13082, 13083, 14265,
        16742, 17007]),
 'activated_stellate': array([  276,  3632,  3637,  3836,  5460,  5878,  6179,  6624,  9231,
         9504,  9691, 12501, 13610, 15293, 16018, 16430, 17233, 17850,
        18923, 19025]),
 'alpha': array([ 2390,  3571,  6534,  6730,  6799,  7966,  8442,  8713, 10734,
        11417, 12291, 12339, 12416, 12508, 14361, 15275, 15835, 15991,
        16150, 17366]),
 'beta': array([   68,   322,  1974,  2085,  4675,  5140,  5218,  5342,  6383,
         7240,  7638,  7959, 11265, 12967, 14999, 15686, 15842, 16640,
        16797, 17860]),
 'delta': array([  959,  1449,  1557,  2353,  4516,  6407,  6543,  8364,  8453,
         9345, 10318, 12285, 12358, 12504, 12506, 13452, 14210, 16021,
        16729, 16794]),
 'ductal': array([  458,   556,   808,  2507,  3383,  4203,  4835,  6996,  9048,
         9077,  9215, 10397, 10

## Select Markers

After the filtering process, you can select the markers.

In [10]:
baron_h1.sel_clust_marker()

-> Selecting  markers for each cluster


You can view the selected markers as follows.

In [11]:
baron_h1.get_markers()

Unnamed: 0,Marker_1,Marker_2,Marker_3,Marker_4,Marker_5
C_acinar,PNLIPRP1,CEL,KLK1,SYCN,CTRC
C_activated_stellate,CRLF1,SFRP2,COL6A3,VCAN,THBS2
C_alpha,IRX2,GC,NPNT,MUC13,CAMK2G
C_beta,ADCYAP1,HADH,IGF2,G6PC2,ERO1B
C_delta,LEPR,MIR7.3HG,PCP4,BAIAP3,RBP4
C_ductal,S100A14,MMP7,ANXA3,ALDH1A3,KRT19
C_endothelial,PLVAP,CD93,PCAT19,FLT1,KDR
C_epsilon,GHRL,FRZB,NNMT,SPTSSB,VSTM2L
C_gamma,ARX,AQP3,STMN2,PTP4A3,ETV1
C_macrophage,PLA2G7,C1QC,ITGB2,SDS,C1QB


Or you can see the corresponding indexes in the data matrix.

In [12]:
baron_h1.get_markers(ind_return = True)

Unnamed: 0,Marker_1,Marker_2,Marker_3,Marker_4,Marker_5
C_acinar,13082,3055,8968,16742,4020
C_activated_stellate,3836,15293,3637,18923,17233
C_alpha,8442,6534,11417,10734,2390
C_beta,322,7240,7959,6383,5342
C_delta,9345,10318,12504,1449,14210
C_ductal,14876,10397,808,556,9048
C_endothelial,13041,2879,12396,6159,8756
C_epsilon,6637,6299,11320,16388,19024
C_gamma,1130,932,16601,13790,5388
C_macrophage,12935,1973,8490,15095,1971


You can also pull selected markers from the data matrix.

In [13]:
mark_data = baron_h1.get_marker_data()
mark_data

{'acinar': array([[4.9068906 , 3.5849625 , 4.5849625 , 6.857981  , 5.9068906 ],
        [2.80735492, 3.169925  , 4.857981  , 6.62935662, 5.42626475],
        [4.39231742, 4.70043972, 4.169925  , 5.4918531 , 5.28540222],
        ...,
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ]]),
 'activated_stellate': array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        ...,
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 'alpha': array([[0.       , 0.       , 0.       , 0.       , 1.5849625],
        [0.       , 0.       , 0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0.       ],
        ...,
        [0.       , 0.       , 0.       , 0.       , 0.       ],
        [0.       , 0.       , 0.       , 0.       , 0. 

5 markers are selected by default for each cluster. You can access the selected markers for each cluster with the dictionary keys.

In [14]:
schwann = mark_data['schwann']
schwann

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [15]:
schwann.shape

(1937, 5)

You can also perform marker selection with dynamic programming.

In [16]:
baron_h1.sel_clust_marker(dyn_prog=True)

-> Selecting  markers for each cluster
-> |⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛| 100% Number of Clusters With Selected Markers : 14  


If you want, you can increase the number of markers to be selected.

<div class="alert alert-info">

**Note**
    
If you are going to increase the number of markers, make sure that the number of genes remaining after filtering is more than the number of markers to be selected.

</div>  

In [17]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.sel_clust_marker(nof_markers=10)

-> Eliminating low expression genes
-> Log-Normalization is in Progress
-> Selecting cluster-specific genes
-> Selecting  markers for each cluster


In [18]:
baron_h1.get_markers(ind_return = True)

Unnamed: 0,Marker_1,Marker_2,Marker_3,Marker_4,Marker_5,Marker_6,Marker_7,Marker_8,Marker_9,Marker_10
C_acinar,13082,3055,8968,16742,4020,3057,12577,13083,4062,7103
C_activated_stellate,3836,15293,3637,18923,17233,6179,5878,3632,9231,13610
C_alpha,8442,6534,11417,10734,2390,8713,14361,17366,15991,12291
C_beta,322,7240,7959,6383,5342,5140,11265,7638,68,1974
C_delta,9345,10318,12504,1449,14210,6407,4516,6543,8453,959
C_ductal,14876,10397,808,556,9048,16828,3383,13242,9077,9215
C_endothelial,13041,2879,12396,6159,8756,16167,13115,17277,4968,19048
C_epsilon,6637,6299,11320,16388,19024,9723,6358,12939,6027,190
C_gamma,1130,932,16601,13790,5388,7869,10138,9929,18672,12375
C_macrophage,12935,1973,8490,15095,1971,3888,7541,2765,10454,4127
