# Single-cell RNA-seq imputation using DeepImpute

Here is a comprehensive tutorial to understand the functionnalities of DeepImpute.

In [1]:
from deepimpute.multinet import MultiNet
import pandas as pd

# Load dataset using pandas
data = pd.read_csv('test.csv',index_col=0)
print('Working on {} cells and {} genes'.format(*data.shape))

2024-09-06 06:43:32.618555: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-06 06:43:32.640216: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-06 06:43:32.646785: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-06 06:43:32.664372: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Working on 500 cells and 3000 genes


## Create a DeepImpute multinet

In [2]:
# Using default parameters
multinet = MultiNet() 

Using all the cores (64)


In [3]:
# Using custom parameters
NN_params = {
        'learning_rate': 1e-4,
        'batch_size': 64,
        'max_epochs': 200,
        'ncores': 5,
        'sub_outputdim': 512,
        'architecture': [
            {"type": "dense", "activation": "relu", "neurons": 200},
            {"type": "dropout", "activation": "dropout", "rate": 0.3}]
    }

multinet = MultiNet(**NN_params)

## Fit the networks

In [4]:
# Using all the data
multinet.fit(data,cell_subset=1,minVMR=0.5)

Input dataset is 500 cells (rows) and 3000 genes (columns)
First 3 rows and columns:
                  ENSG00000177954  ENSG00000197756  ENSG00000231500
AATTGTGACTACGA-1            826.0            674.0            694.0
TGACACGATTCGTT-1            617.0            618.0            594.0
TGTCAGGATTGTCT-1            525.0            550.0            540.0
3072 genes selected for imputation
Net 0: 639 predictors, 512 targets
Net 1: 593 predictors, 512 targets
Net 2: 591 predictors, 512 targets
Net 3: 594 predictors, 512 targets
Net 4: 555 predictors, 512 targets
Net 5: 632 predictors, 512 targets
Normalization
Building network
[{'type': 'dense', 'activation': 'relu', 'neurons': 200}, {'type': 'dropout', 'activation': 'dropout', 'rate': 0.3}]


2024-09-06 06:43:46.775225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22066 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9


ValueError: Argument(s) not recognized: {'lr': 0.0001}

In [None]:
# Using 80% of the data
multinet.fit(data,cell_subset=0.5)

In [None]:
# Using 200 cells (randomly selected)
multinet.fit(data,cell_subset=200)

In [5]:
# Custom fit
trainingData = data.iloc[100:250,:]
multinet.fit(trainingData)

Input dataset is 150 cells (rows) and 3000 genes (columns)
First 3 rows and columns:
                  ENSG00000177954  ENSG00000197756  ENSG00000231500
AATACCCTGGGACA-1            271.0            262.0            231.0
GGCGCATGCCTAAG-1            173.0            390.0            358.0
CGCACTTGAACCAC-1            367.0            406.0            354.0
3072 genes selected for imputation
Net 0: 1245 predictors, 512 targets
Net 1: 1269 predictors, 512 targets
Net 2: 1243 predictors, 512 targets
Net 3: 1282 predictors, 512 targets
Net 4: 1253 predictors, 512 targets
Net 5: 1250 predictors, 512 targets
Normalization
Building network
[{'type': 'dense', 'activation': 'relu', 'neurons': 200}, {'type': 'dropout', 'activation': 'dropout', 'rate': 0.3}]


ValueError: Argument(s) not recognized: {'lr': 0.0001}

## Imputation

The imputation can be done on any dataset as long as the gene labels are the same

In [None]:
imputedData = multinet.predict(data)

## Visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np

limits = [0,100]

fig,ax = plt.subplots()

jitter = np.random.normal(0,1,data.size) # Add some jittering to better see the point density
ax.scatter(data.values.flatten()+jitter,imputedData.values.flatten(),s=2)
ax.plot(limits,limits,'r-.',linewidth=2)
ax.set_xlim(limits)
ax.set_ylim(limits)

plt.show()


## Scoring
Display training metrics (MSE and Pearson's correlation on the test data)

In [None]:
multinet.test_metrics