# scRNA-seq Imputation

Data Availability Statement
Tabula Muris data

Smart-seq2 https://doi.org/10.6084/m9.figshare.5715040.v1 ( Consortium, The Tabula Muris, 2017a).

10X Chromium https://doi.org/10.6084/m9.figshare.5715040.v1 ( Consortium, The Tabula Muris, 2017b).

R packages

MAGIC: Rmagic (v0.1.0) https://github.com/KrishnaswamyLab/MAGIC

DrImpute: DrImpute (v1.0) https://github.com/ikwak2/DrImpute

scImpute: scImpute(v0.0.8) https://github.com/Vivianstats/scImpute

SAVER: SAVER(v1.0.0) https://github.com/mohuangx/SAVER

Knn-smooth: knn_smooth.R (Version 2) https://github.com/yanailab/knn-smoothing

Scater: scater(v1.6.3) : https://www.bioconductor.org/packages/release/bioc/html/scater.html

Splatter: splatter(v1.2.2) : https://bioconductor.org/packages/release/bioc/html/splatter.html

Permute: permute(v0.9-4) : https://cran.r-project.org/web/packages/permute/index.html

Python/anaconda packages:

Dca : dca(v0.2.2): https://github.com/theislab/dca

Custom scripts: https://github.com/tallulandrews/F1000Imputation

## Denomising scRNA-seq Data with DCA

### 1. Running the dca package on definite endoderm cells (DECs)

The single cell expression data of DECs are obtained from the GEO website (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75748). The dataset includes the RNA expression count data of 1018 single cells from snapshot progenitors.

The paper for analysis of the dataset: Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. https://pubmed.ncbi.nlm.nih.gov/27534536/

In [9]:
# !python -m dca.__main__ data/endoderm/endoderm.csv data/endoderm/

In [13]:
import pandas as pd
import os
cur_dir = os.getcwd()
file_path = cur_dir + '/data/endoderm/endoderm.csv'
endoderm = pd.read_csv(file_path)
endoderm.head()

Unnamed: 0.1,Unnamed: 0,H1_Exp1.001,H1_Exp1.002,H1_Exp1.003,H1_Exp1.004,H1_Exp1.006,H1_Exp1.007,H1_Exp1.008,H1_Exp1.009,H1_Exp1.010,...,TB_Batch2.135,TB_Batch2.136,TB_Batch2.137,TB_Batch2.138,TB_Batch2.139,TB_Batch2.140,TB_Batch2.141,TB_Batch2.142,TB_Batch2.143,TB_Batch2.144
0,MKL2,10,162,3,42,0,2,18,0,182,...,364,1,21,1127,2119,5,500,18,472,350
1,CD109,6,2,166,9,7,53,4,64,29,...,15,38,38,11,48,23,362,22,36,25
2,ABTB1,0,28,0,1,0,9,0,0,0,...,0,0,0,0,0,0,0,3,39,0
3,MAST2,0,133,41,0,0,2,0,0,0,...,175,41,32,3,6,206,43,2,1,99
4,KAT5,0,7,52,20,0,6,0,0,103,...,0,577,0,3,2,0,56,2,0,0


## scDMFK

In [18]:
!ls

LICENSE.txt      [34mdca-env[m[m          [34mreproducibility[m[m  setup.py
README.md        demo.ipynb       requirements.txt tutorial.ipynb
[34mdata[m[m             [34mdocs[m[m             [34mscDMFK[m[m
[34mdca[m[m              pytest.ini       [34mscripts[m[m


In [21]:
%pip install jgraph

Collecting jgraph
  Downloading jgraph-0.2.1-py2.py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: jgraph
Successfully installed jgraph-0.2.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [35]:
!python scDMFK/run.py --dataname "Young"

Instructions for updating:
non-resource variables are not supported in the long term
this line should show
begin the pretraining
2024-04-06 11:29:51.306863: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-06 11:29:51.337957: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
  super()._check_params_vs_input(X, default_n_init=10)
begin the funetraining
Traceback (most recent call last):
  File "scDMFK/run.py", line 56, in <module>
    accuracy, ARI, NMI = scClustering.funetrain(X, count_X, Y, size_factor, args.batch_size, args.funetrain_epoch, args.update_epoch, args.error)
  File "/Users/yufeideng/Documents/GitHub/dca/scDMFK/network.p

In [41]:
import numpy as np
from scipy.optimize import linear_sum_assignment as linear_assignment

def cluster_acc(y_true, y_pred):
    y_true = y_true.astype(np.int64)
    assert y_pred.size == y_true.size
    D = max(y_pred.max(), y_true.max()) + 1
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1
    # from sklearn.utils.linear_assignment_ import linear_assignment
    row_ind, col_ind = linear_assignment(w.max() - w)
    return sum([w[i, j] for i, j in zip(row_ind, col_ind)]) * 1.0 / y_pred.size

y_true = np.array([3, 1, 3, 3, 4, 1, 4, 4, 6])
y_pred = np.array([4, 0, 2, 3, 4, 1, 4, 4, 6])
cluster_acc(y_true, y_pred)

# D = max(y_pred.max(), y_true.max()) + 1
# w = np.zeros((D, D), dtype=np.int64)

# for i in range(y_pred.size):
#     w[y_pred[i], y_true[i]] += 1

# row_ind, col_ind = linear_assignment(w.max() - w)
# [w[i, j] for i, j in zip(row_ind, col_ind)]
# sum([w[i, j] for i, j in zip(row_ind, col_ind)]) * 1.0 / y_pred.size

0.6666666666666666

In [40]:
import scDMFK.utils as utils
import numpy as np
import h5py
import scipy as sp
import pandas as pd

def read_clean(data):
    assert isinstance(data, np.ndarray)
    if data.dtype.type is np.bytes_:
        data = utils.decode(data)
    if data.size == 1:
        data = data.flat[0]
    return data

def dict_from_group(group):
    assert isinstance(group, h5py.Group)
    d = utils.dotdict()
    for key in group:
        if isinstance(group[key], h5py.Group):
            value = dict_from_group(group[key])
        else:
            value = read_clean(group[key][...])
        d[key] = value
    return d

def read_data(filename, sparsify=False, skip_exprs=False):
    with h5py.File(filename, "r") as f:
        obs = pd.DataFrame(dict_from_group(f["obs"]), index=utils.decode(f["obs_names"][...]))
        var = pd.DataFrame(dict_from_group(f["var"]), index=utils.decode(f["var_names"][...]))
        uns = dict_from_group(f["uns"])
        if not skip_exprs:
            exprs_handle = f["exprs"]
            if isinstance(exprs_handle, h5py.Group):
                mat = sp.sparse.csr_matrix((exprs_handle["data"][...], exprs_handle["indices"][...],
                                               exprs_handle["indptr"][...]), shape=exprs_handle["shape"][...])
            else:
                mat = exprs_handle[...].astype(np.float32)
                if sparsify:
                    mat = sp.sparse.csr_matrix(mat)
        else:
            mat = sp.sparse.csr_matrix((obs.shape[0], var.shape[0]))
    return mat, obs, var, uns

def prepro(filename):
    data_path = "data/" + filename + "/data.h5"
    mat, obs, var, uns = read_data(data_path, sparsify=False, skip_exprs=False)
    if isinstance(mat, np.ndarray):
        X = np.array(mat)
    else:
        X = np.array(mat.toarray())
    cell_name = np.array(obs["cell_type1"])
    cell_type, cell_label = np.unique(cell_name, return_inverse=True)
    return X, cell_label

X, Y = prepro("Young")
Y

array([6, 6, 6, ..., 0, 0, 0])

###