# Single-cell RNA Sequencing of human scalp: Preprocessing

Data Source Acknowledgment: The dataset is sourced from [GSE212450](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE212450). This notebook uses sub-dataset which comprises single-cell RNA sequencing data from human scalp with alopecia areata (GSM6532922	AA8_scRNA) and control (GSM6532927	C_SD2_scRNA).

Reference: Ober-Reynolds B, Wang C, Ko JM, Rios EJ et al. Integrated single-cell chromatin and transcriptomic analyses of human scalp identify gene-regulatory programs and critical cell types for hair and skin diseases. Nat Genet 2023 Aug;55(8):1288-1300. PMID: 37500727

It's essential to emphasize that this dataset is exclusively utilized for Python practice purposes within this repository. This notebook will use this dataset to practice data cleaning techniques and clustering.

In [29]:
#using SCanalysis environment
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scanpy as sc
import scvi
import anndata as ad

In [2]:
import warnings
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", RuntimeWarning)

## 1. Data Loading

In [4]:
#loading control
adata_CON = sc.read_10x_mtx('./SD2')

In [23]:
adata_CON

AnnData object with n_obs × n_vars = 3313 × 33538
    var: 'gene_ids', 'feature_types'

In [32]:
#h5ad file will use for celllblender
adata_CON.write_h5ad('adata_CON.h5ad')

In [10]:
#loading case
adata_CASE = sc.read_10x_mtx('./AA2', prefix='GSM6532919_AA2_')

In [11]:
adata_CASE

AnnData object with n_obs × n_vars = 7503 × 33538
    var: 'gene_ids', 'feature_types'

In [33]:
adata_CASE.write_h5ad('adata_CASE.h5ad')

## 2. Ambient removal

In [None]:
#using cellblender environment (I dont have GPU it is super slow, so maybe I should try soupX)
#!cellbender remove-background --input adata_CASE.h5ad --output adata_CASE_cleaned.h5ad

In [None]:
#!cellbender remove-background --input adata_CON.h5ad --output adata_CON_cleaned.h5ad

## 3. Preprocessing

### 3.1 Quality control

In [None]:
def qc(adata):
    #label mitochondrial genes
    adata.var["mt"] = adata.var_names.str.startswith("MT-")
    #label ribosomal genes
    adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
    #label hemoglobin genes.
    adata.var["hb"] = adata.var_names.str.contains(("^HB[^(P)]"))

    sc.pp.calculate_qc_metrics(adata, qc_vars=["mt", "ribo", "hb"], inplace=True, percent_top=[20], log1p=True)

    #remove column we dont use
    remove_list = ['total_counts_mt', 'log1p_total_counts_mt', 'total_counts_ribo', 
                  'log1p_total_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb']
    adata.obs = adata.obs[[x for x in adata.obs.columns if x not in remove_list]]
    return adata


In [None]:
#CASE

In [None]:
#CONTROL

### 3.2 Filtering low quality cells based on qc matrix

In [2]:
# MAD (median absolute deviations)
from scipy.stats import median_abs_deviation as mad

In [None]:
def MAD_outlier(adata, matric, nmads):
    M = 

### 3.3 Doublet detection