### Overview
This notebook imports the **marker genes** (a curated list of marker genes for each cell-type), extracts the data and convert them into CSV for the next annotation step (SCINA). 

*the marker genes were extracted from PanglaoDB [2] and Allen Brain Map [3], and placed in a CSV where each column contains the list of marker genes per cell-type. 

**This notebook is written in Python.**

In [1]:
#Libraries and random state
import scanpy as sc
import pandas as pd
%matplotlib inline 

### Steps performed

1. Import the marker genes CSV and the unannotated scRNA-seq

2. Convert the gene names to Pascal case (or Upper Camel case). E.g. GENE1 to Gene1

3. Get matching marker genes to the unannotated scRNA-seq

4. Export filtered marker gene list

#### Functions to handle case conversion and getting matched marker genes

In [2]:
#Function to convert gene names to Pascal case (or Upper Camel case). E.g. GENE1 to Gene1
def convert_genes_to_Pascalcase(df_markers):
    for (colName, colData) in df_markers.iteritems():
        temp_colData = []

        for s_gene in colData:
            #Skip nan
            if s_gene == "NA":
                continue
            #Lowercase all 2nd characters onwards
            
            if isinstance(s_gene, str):
                s_gene = s_gene.title()
                temp_colData.append(s_gene)
        df_markers[colName]=temp_colData
    return df_markers

#Function to remove unmatched genes between the curated marker genes and the scRNA-seq to be annotated
def get_matching_curated_markers(adata, df_markers):

    df_markers = convert_genes_to_Pascalcase(df_markers)
    print(df_markers)
    df_markers_filtered = df_markers
    d_removed_markers = df_markers

    #Remove genes that are not found in the dataset
    int_gene_count = sum(df_markers[df_markers != ""].count())
    print("Before removing undetected genes: " + str(int_gene_count) + " genes")

    for (colName, colData) in df_markers.iteritems():
        print(colName)
        for s_gene in colData:
            print(s_gene)
            if adata.var_names.str.contains(s_gene).sum()==0:
                df_markers_filtered = df_markers_filtered[df_markers_filtered[colName] != s_gene]
            else:
                d_removed_markers = d_removed_markers[d_removed_markers[colName] != s_gene]
    
    int_gene_count = sum(df_markers_filtered[df_markers_filtered != ""].count())
    print("After removing undetected genes: " + str(int_gene_count) + " genes left")
    return df_markers_filtered, d_removed_markers

#### Import the marker genes CSV and the unannotated scRNA-seq

In [None]:
#Import curated markergenes and unannotated scRNA-seq
input_path_markergenes = "../../data/demo_public/input/markergenes/PanglaoAllen_markers_gabagluta_nonneu.csv"
input_path_unannotatedscRNAseq ="../../data/demo_public/input/AllenBrain_unannotated.h5ad"

#Read curated markergenes
df_markergenes = pd.read_csv(input_path_markergenes,sep=",",index_col=False,na_filter=False, keep_default_na=False, na_values=['_'])
print("Read curated markergenes")
print(df_markergenes)

#Read unannotated scRNA-seq
que_annotation = sc.read(input_path_unannotatedscRNAseq, cache=True)

#### Convert the gene names to Pascal case and get the matching marker genes to the unannotated scRNA-seq

In [3]:
#Filter out unmatched genes 
df_markergenes_filtered, df_markergenes_removed = get_matching_curated_markers(que_annotation, df_markergenes)

Read curated markergenes
   Glutamatergic Gabaergic Non-Neuronal
0        SLC17A7    SLC6A1           Qk
1        SLC17A6    GABBR2       Zbtb20
2          GRIN1      GAD1         Glul
3         GRIN2B   GADD45B      Tsc22d4
4            GLS      PAX2          Mt1
..           ...       ...          ...
95                                Gng12
96                                 Aspa
97                             Ppp1r14a
98                              Fam107a
99                                Phka1

[100 rows x 3 columns]
   Glutamatergic Gabaergic Non-Neuronal
0        Slc17A7    Slc6A1           Qk
1        Slc17A6    Gabbr2       Zbtb20
2          Grin1      Gad1         Glul
3         Grin2B   Gadd45B      Tsc22D4
4            Gls      Pax2          Mt1
..           ...       ...          ...
95                                Gng12
96                                 Aspa
97                             Ppp1R14A
98                              Fam107A
99                             

In [4]:
print("Removed markergenes")
print(df_markergenes_removed)

Removed markergenes
  Glutamatergic Gabaergic Non-Neuronal
3        Grin2B   Gadd45B      Tsc22D4


In [5]:
print("Kept curated markergenes")
print(df_markergenes_filtered)

Kept curated markergenes
   Glutamatergic Gabaergic Non-Neuronal
2          Grin1      Gad1         Glul
4            Gls      Pax2          Mt1
12                  Mybpc1         Gng5
13                   Parm1      Plekhg1
14                    Dlx1       Kcnj10
..           ...       ...          ...
92                              Carhsp1
94                                Grb14
95                                Gng12
96                                 Aspa
99                                Phka1

[71 rows x 3 columns]


In [6]:
#Export filtered marker genes
output_path = "../../data/demo_public/output/scina_filtered_markergenes.csv"
df_markergenes_filtered.to_csv(output_path, sep=",", index=False)

#### References
1. Chia, C. M., Roig Adam, A., & Moro, A. (2022). *In silico* multiple single-subject neural tissue screening using deconvolution on pseudo-bulk RNA-seq - a prototype. Bioinformatics and Systems Biology joint degree program. Vrije Universiteit Amsterdam and University of Amsterdam. 

2. Franzén, O., Gan, L. M., & Björkegren, J. L. M. (2019). PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database : the journal of biological databases and curation, 2019, baz046. https://doi.org/10.1093/database/baz046

3. Allen Institute for Brain Science (2004). Allen Mouse Brain Atlas, Mouse Whole Cortex and Hippocampus 10x. Available from mouse.brain-map.org. Allen Institute for Brain Science (2011).