# PBMCs Tutorial
PBMC stands for Peripheral Blood Mononuclear Cells, a vital group of immune system cells found in the blood

## 0. Initial setup
- Install LINGER according to the [GitHub](https://github.com/Durenlab/LINGER)

In [None]:
%%bash
# conda create -n LINGER python==3.10.0
# conda activate LINGER
# pip install LingerGRN==1.105
# conda install -c bioconda bedtools  # Requirement

- Register the kernel

In [None]:
%%bash
# pip install ipykernel
# python -m ipykernel install --user --name LINGER --display-name "Python (LINGER)"

- Verify installation

In [1]:
!pip show LingerGRN

Name: LingerGRN
Version: 1.105
Summary: Gene regulatory network inference
Home-page: https://github.com/Durenlab/LINGER
Author: Kaya Yuan
Author-email: qyyuan33@gmail.com
License: MIT
Location: /home/users/v/a/vangysel/miniconda3/envs/LINGER/lib/python3.10/site-packages
Requires: anndata, joblib, matplotlib, numpy, pandas, pybedtools, rpy2, scanpy, scikit-learn, scipy, seaborn, shap, statsmodels, torch, umap-learn
Required-by: 


In [2]:
!conda info --envs

# conda environments:
#
base                     /home/ucl/inma/vangysel/miniconda3
LINGER                *  /home/ucl/inma/vangysel/miniconda3/envs/LINGER



In [None]:
!conda list -n LINGER                    # lists all packages in LINGER env

In [17]:
!conda list -n LINGER LingerGRN          # look for package LingerGRN in LINGER env

# packages in environment at /home/ucl/inma/vangysel/miniconda3/envs/LINGER:
#
# Name                    Version                   Build  Channel
lingergrn                 1.105                    pypi_0    pypi


- Check ressources

In [3]:
!nproc --all

256


In [4]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:          754Gi        73Gi       572Gi       2.9Gi       108Gi       673Gi
Swap:         4.0Gi       1.3Gi       2.7Gi


## 1. Download the general gene regulatory network
This is the pretrained NN on bulk multiomics data across tissues : RNA-seq (gene expr.) and ATAC-seq (chrom. acc.) that will then be fine tuned with our single cell data. There is one pretrained NN per gene.

### About the bulk GRN

It contains three types of interactions (TF-RE-TG) : 
- TF &rarr; RE : biding strength (&alpha;)
- RE &rarr; TG : cis regulatory strength (&beta;)
- TF &rarr; TG : trans regulatory strength (&gamma;)

We obtain **&alpha;** by extracting the weights from the input layer to the second layer (each TF and RE are connected to the 64 hidden neurons of h1). An embedding of a TF/RE is a vector of weitghs, we can then measure how similar two embeddings are. If a TF and RE have similar learned representations, they are likely to interact and will have a high biding strength.<br><br>
We get **&beta;** and **&gamma;** using the average shapley value (that calculates the contribution of a feature to the prediction) over all cells.

In [5]:
!pwd

/home/users/v/a/vangysel/linger


In [6]:
%%bash
# Set directories and download general GRN
Datadir=$GLOBALSCRATCH/LINGER_data                     
mkdir -p $Datadir   
    
# Download general GRN from Google Drive
wget -nv -O $Datadir --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b" -O data_bulk.tar.gz
rm -rf /tmp/cookies.txt

2026-02-05 13:11:05 URL:https://drive.usercontent.google.com/download?export=download&confirm=&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b [20812483490/20812483490] -> "data_bulk.tar.gz" [1]


In [None]:
!tar -xzf data_bulk.tar.gz

## 2. Prepare the input data

- Download the h5 file (the matrix contains both RNA and ATAC data combined)

In [32]:
%%bash
mkdir -p data
wget --progress=bar:force -O data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5

--2026-02-04 15:24:27--  https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 104.18.0.173, 104.18.1.173, 2606:4700::6812:1ad, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.0.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 162282142 (155M) [binary/octet-stream]
Saving to: ‘data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5’


2026-02-04 15:24:29 (109 MB/s) - ‘data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5�� saved [162282142/162282142]



- Download cell annotation

In [42]:
%%bash
wget --progress=bar:force -O data --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_" -O PBMC_label.txt && rm -rf /tmp/cookies.txt

--2026-02-04 15:31:32--  https://docs.google.com/uc?export=download&confirm=&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_
Resolving docs.google.com (docs.google.com)... 74.125.206.138, 74.125.206.113, 74.125.206.102, ...
Connecting to docs.google.com (docs.google.com)|74.125.206.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_&export=download [following]
--2026-02-04 15:31:33--  https://drive.usercontent.google.com/download?id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 64.233.166.132, 2a00:1450:400c:c09::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|64.233.166.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 362958 (354K) [application/octet-stream]
Saving to: ‘PBMC_label.txt’


2026-02-04 15:31:36 (27.1 MB/s) - ‘PBMC_label.txt’ saved [36

In [6]:
import scanpy as sc
import scipy.sparse as sp
import pandas as pd

adata = sc.read_10x_h5('data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5', gex_only=False)
adata

  utils.warn_names_duplicates("var")


AnnData object with n_obs × n_vars = 11909 × 144978
    var: 'gene_ids', 'feature_types', 'genome'

In [7]:
len(adata.var.index) - len(adata.var.index.unique()) 

10

In [8]:
adata.var_names_make_unique()     # only 10 var are not unique, we make them unique [USEFULL ??]

In [9]:
matrix = adata.X.T    # linger expects k_features x n_cells
adata.var['gene_ids'] = adata.var.index

# features are genes and peaks grouped together (col1 for gene/peak name and col2 for category: gene or peak)
features = pd.DataFrame(adata.var['gene_ids'].values.tolist(),columns=[1])
features[2] = adata.var['feature_types'].values

barcodes = pd.DataFrame(adata.obs_names,columns=[0])
label = pd.read_csv('data/PBMC_label.txt',sep='\t',header=0)

In [10]:
from LingerGRN.preprocess import *
adata_RNA, adata_ATAC = get_adata(matrix,features,barcodes,label)     # adata_RNA and adata_ATAC are scRNA and scATAC

  adata_RNA.obs['label']=label.loc[adata_RNA.obs['barcode']]['label'].values
  adata_ATAC.obs['label']=label.loc[adata_ATAC.obs['barcode']]['label'].values


In [11]:
print(f"Features : \n\n{features.head()}")
print(f"{features.tail()}\n\n")
print(f"Barcodes: \n{barcodes.head()}\n\n")
print(f"Labels : \n{label.head()}\n")

Features : 

             1                2
0  MIR1302-2HG  Gene Expression
1      FAM138A  Gene Expression
2        OR4F5  Gene Expression
3   AL627309.1  Gene Expression
4   AL627309.3  Gene Expression
                             1      2
144973  KI270713.1:20444-22615  Peaks
144974  KI270713.1:27118-28927  Peaks
144975  KI270713.1:29485-30706  Peaks
144976  KI270713.1:31511-32072  Peaks
144977  KI270713.1:37129-37638  Peaks


Barcodes: 
                    0
0  AAACAGCCAAGGAATC-1
1  AAACAGCCAATCCCTT-1
2  AAACAGCCAATGCGCT-1
3  AAACAGCCACACTAAT-1
4  AAACAGCCACCAACCG-1


Labels : 
                           barcode_use               label
barcode_use                                               
AAACAGCCAAGGAATC-1  AAACAGCCAAGGAATC-1   naive CD4 T cells
AAACAGCCAATCCCTT-1  AAACAGCCAATCCCTT-1  memory CD4 T cells
AAACAGCCAATGCGCT-1  AAACAGCCAATGCGCT-1   naive CD4 T cells
AAACAGCCAGTAGGTG-1  AAACAGCCAGTAGGTG-1   naive CD4 T cells
AAACAGCCAGTTTACG-1  AAACAGCCAGTTTACG-1  memory CD4 T cel

### 2.1 About the `get_data()` function
**@inputs :** 
- matrix: sparse matrix with RNA and ATAC data stacked vertically
- features: gene/peak IDs and their types ('Gene Expression' or 'Peaks')
- barcodes: cell barcodes
- label: cell type labels/annotations 

**@outputs :**
- adata_RNA 
- adata_ATAC

In [13]:
# gene - cell 
print(adata.var.iloc[0], end="\n\n")
print(adata.obs.iloc[0])

gene_ids             MIR1302-2HG
feature_types    Gene Expression
genome                    GRCh38
Name: MIR1302-2HG, dtype: object

Series([], Name: AAACAGCCAAGGAATC-1, dtype: float64)


In [14]:
adata

AnnData object with n_obs × n_vars = 11909 × 144978
    var: 'gene_ids', 'feature_types', 'genome'

In [10]:
adata.obs.head()

AAACAGCCAAGGAATC-1
AAACAGCCAATCCCTT-1
AAACAGCCAATGCGCT-1
AAACAGCCACACTAAT-1
AAACAGCCACCAACCG-1


In [11]:
adata.var.head()

Unnamed: 0,gene_ids,feature_types,genome
MIR1302-2HG,MIR1302-2HG,Gene Expression,GRCh38
FAM138A,FAM138A,Gene Expression,GRCh38
OR4F5,OR4F5,Gene Expression,GRCh38
AL627309.1,AL627309.1,Gene Expression,GRCh38
AL627309.3,AL627309.3,Gene Expression,GRCh38


## 4. About the AnnData object

In [12]:
adata            # cells x (genes + peaks) 

AnnData object with n_obs × n_vars = 11909 × 144978
    var: 'gene_ids', 'feature_types', 'genome'

In [13]:
adata_RNA        # cells x genes 

View of AnnData object with n_obs × n_vars = 9543 × 36601
    obs: 'barcode', 'sample', 'label', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

In [14]:
adata_ATAC       # cells x peaks

View of AnnData object with n_obs × n_vars = 9543 × 108377
    obs: 'barcode', 'sample', 'label'
    var: 'gene_ids'

In [16]:
#adata_RNA.X[i, j]             # cell i, gene j : Gene expression count of gene j in cell i
#adata_ATAC.X[i, k]            # cell i, peak k : Chromatin accessibility count of peak k in cell i
adata_RNA.X[1, 44]             # cell 1 gene 44

0.0

In [15]:
adata_RNA.X[0, :10].toarray()     # cell 0, expr. of 10 first genes

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

In [16]:
adata_RNA.X[:10, 0].toarray()     # gene 0 expr. in 10 first cells

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]], dtype=float32)

In [19]:
adata_RNA.obs.iloc[:10]            # metadata of 10 first cells

Unnamed: 0,barcode,sample,label,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
1506,AGCACTAGTTACGCGG-1,1,memory CD4 T cells,1474,2765.0,0.0,0.0
4670,CGGCCATAGGGACTAA-1,1,memory B cells,1980,4381.0,0.0,0.0
6010,GACGCCTAGGGTGAGT-1,1,CD56 (bright) NK cells,1537,2934.0,0.0,0.0
4694,CGGGCTTAGAGGAGTC-1,1,classical monocytes,1475,2783.0,0.0,0.0
6716,GCATCCTTCAATCATG-1,1,CD56 (dim) NK cells,2520,6102.0,0.0,0.0
484,AAGTGCAAGCCAAATC-1,1,naive CD4 T cells,1619,3398.0,0.0,0.0
3087,CAATCCTGTCATAACG-1,1,CD56 (bright) NK cells,3081,7406.0,0.0,0.0
5119,CTAGGACGTGTTCCCA-1,1,plasmacytoid DC,3108,7025.0,0.0,0.0
1100,ACGCCACAGTAGCTTA-1,1,memory B cells,1907,3958.0,0.0,0.0
7808,GGGTGTTGTTCGCTCA-1,1,classical monocytes,1874,3686.0,0.0,0.0


In [20]:
adata_RNA.var.iloc[:10]            # metadata of 10 first genes

Unnamed: 0,gene_ids,mt,n_cells_by_counts,mean_counts,pct_dropout_by_counts,total_counts
MIR1302-2HG,MIR1302-2HG,False,0,0.0,100.0,0.0
FAM138A,FAM138A,False,0,0.0,100.0,0.0
OR4F5,OR4F5,False,0,0.0,100.0,0.0
AL627309.1,AL627309.1,False,61,0.006916,99.360788,66.0
AL627309.3,AL627309.3,False,0,0.0,100.0,0.0
AL627309.2,AL627309.2,False,0,0.0,100.0,0.0
AL627309.5,AL627309.5,False,408,0.046002,95.724615,439.0
AL627309.4,AL627309.4,False,41,0.004401,99.570366,42.0
AP006222.2,AP006222.2,False,1,0.000105,99.989521,1.0
AL732372.1,AL732372.1,False,0,0.0,100.0,0.0


## 5. Preprocess

In [17]:
# Filter low-count cells and genes

# Keep only cells that have ≥ 200 detected genes
sc.pp.filter_cells(adata_RNA, min_genes=200)

# Keep only genes expressed in ≥ 3 cells
sc.pp.filter_genes(adata_RNA, min_cells=3)

sc.pp.filter_cells(adata_ATAC, min_genes=200)
sc.pp.filter_genes(adata_ATAC, min_cells=3)

# Keep only cells present in both RNA and ATAC
selected_barcode = list(set(adata_RNA.obs['barcode'].values) & set(adata_ATAC.obs['barcode'].values))

barcode_idx = pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]

barcode_idx = pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]


  adata.obs['n_genes'] = number
  adata.obs['n_genes'] = number


### 5.1 Effect of preprocess
We had 9543 cells and 144 978 features :
- 36 601 genes
- 108 377 peaks


In [18]:
print(f"adata_RNA.shape : {adata_RNA.shape}")
print(f"adata_ATAC.shape : {adata_ATAC.shape}")

adata_RNA.shape : (9543, 25485)
adata_ATAC.shape : (9543, 107208)


### 5.2 Comparison with uncompressed data

- Uncompressed : 
    - adata_RNA.shape : (9543, 36601)
    - adata_ATAC.shape : (9543, 143887)
- Compressed (.h5 file only)
    - adata_RNA.shape : (9543, 36601)
    - adata_ATAC.shape : (9543, 108377)

**After preprocessing**

- Uncompressed : 
    - adata_RNA.shape : (9543, 25485)
    - adata_ATAC.shape : (9543, 143885)


- Compressed (.h5 file only)
    - adata_RNA.shape : (9543, 25485)
    - adata_ATAC.shape : (9543, 107208)


### 5.3 About pseudo-bulking

Pseudo-bulk means "Combine many single cells into a “fake bulk sample” by summing or averaging their counts"
Since single cell is noisy and extremely sparse, it is better to work with aggregated signals across groups of cells (=metacells).  
- ``singlepseudobulk = true`` : Collapse all cells in this sample into ONE pseudobulk profile. This gives following dimensions : 
    - TG_pseudobulk_temp : (n_genes × 1)
    - RE_pseudobulk_temp : (n_peaks × 1) <br><br>
      
- ``singlepseudobulk = false`` : First cluster cells → then make multiple pseudobulks (metacells), used when we don't have many samples. This will create *K* clusters of cells, or *K* metacells
    - TG_pseudobulk_temp : (n_genes × k_metacells)
    - RE_pseudobulk_temp : (n_peaks × k_metacells) <br><br>
      
- Why is this needed ? GRN inference needs many samples (columns).
    - If you already have many samples → 1 bulk per sample is enough
    - If you have few samples → create metacells to increase sample count
      


In [None]:
adata_RNA[adata_RNA.obs['sample' ] == tempsample]

In [19]:
# Generate pseudo-bulk/metacell
import os
from LingerGRN.pseudo_bulk import *

samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
tempsample=samplelist[0]

TG_pseudobulk=pd.DataFrame([])
RE_pseudobulk=pd.DataFrame([])

n_samples = adata_RNA.obs['sample'].nunique()
singlepseudobulk = (n_samples > 10)
#singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)

# here samplelist = [1], singlepseudobulk = False (there is only one sample)
for tempsample in samplelist:

    # get cells from only tempsample
    adata_RNAtemp = adata_RNA[adata_RNA.obs['sample' ] == tempsample]
    adata_ATACtemp = adata_ATAC[adata_ATAC.obs['sample'] == tempsample]

    TG_pseudobulk_temp, RE_pseudobulk_temp = pseudo_bulk(adata_RNAtemp, adata_ATACtemp, singlepseudobulk)  
    
    TG_pseudobulk = pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
    RE_pseudobulk = pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
    
    RE_pseudobulk[RE_pseudobulk > 100] = 100


  view_to_actual(adata)
  view_to_actual(adata)
  view_to_actual(adata)
  view_to_actual(adata)
  from .autonotebook import tqdm as notebook_tqdm


### 5.4 About `pseudo_bulk` function

**@inputs :**
- adata_RNAtemp: single-cell RNA expression for one sample
- adata_ATACtemp: single-cell chromatin accessibility for the same cells
- singlepseudobulk (bool): whether to make one pseudo-bulk or multiple metacells

**@outputs :**
- TG_pseudobulk
- RE_pseudobulk

From one sample, it will either create a single metacell (single pseudobulk) or many pseudobulks. <br>
Here, we will cluster 9,543 cells into 343 metacells (9543/343 ≃ 28 cells per metacell).

In [20]:
if not os.path.exists('data/'):
    os.mkdir('data/')
    
adata_ATAC.write('data/adata_ATAC.h5ad')
adata_RNA.write('data/adata_RNA.h5ad')

TG_pseudobulk=TG_pseudobulk.fillna(0)
RE_pseudobulk=RE_pseudobulk.fillna(0)

pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)

TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')

  df[key] = c
  df[key] = c


In [21]:
TG_pseudobulk        # 25 485 genes x 343 bulks (meta cells)

Unnamed: 0,GCACCTAAGGAGCAAC-1,GTTACGTAGCTTTGGG-1,GGACTAAAGTCAATCA-1,AAACGGATCATGGCTG-1,CGCTAACCATTAGCCA-1,CGCATGATCGCTATGG-1,GGTAATTGTTTGTCTA-1,GTATGTTCAGAGGGAG-1,ACGTTACAGCCATCAG-1,ATGACGAAGTCACCAG-1,...,TGCTTCCAGTAACCCG-1,CGGACCTAGCTAAAGG-1,CCTTAGTGTAGACAAA-1,CGCACAATCACCGGTA-1,CTAATCTTCTCGCCCA-1,CCCAACCGTAGCTGCG-1,AAGCGGGTCTTCAATC-1,CCTACTGGTTCGCGCT-1,GCTAAGAAGCTAAGTC-1,AAATGCCTCGATTTAG-1
AL627309.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.061358,0.000000,0.000000,0.112764,0.000000,0.000000,0.000000
AL627309.5,0.000000,0.000000,0.000000,0.000000,0.060099,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.282538,0.136926,0.185265,0.184987,0.119496,0.052138,0.075814,0.106592,0.146744,0.183203
AL627309.4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.065257,0.062803,0.000000,0.000000,0.000000,0.069169,0.000000,0.000000,0.000000
AL669831.2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
LINC01409,0.137802,0.000000,0.073181,0.000000,0.103896,0.144276,0.000000,0.067568,0.000000,0.000000,...,0.000000,0.164827,0.109030,0.060882,0.058678,0.000000,0.000000,0.000000,0.073396,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AL592183.1,0.146648,0.325133,0.072462,0.000000,0.247954,0.205121,0.147658,0.138384,0.000000,0.133816,...,0.000000,0.143415,0.062079,0.057773,0.151881,0.000000,0.073540,0.056966,0.000000,0.125495
AC240274.1,0.072462,0.171601,0.072462,0.067232,0.000000,0.000000,0.000000,0.067232,0.000000,0.067232,...,0.000000,0.063373,0.000000,0.072813,0.000000,0.000000,0.000000,0.000000,0.067697,0.000000
AC004556.3,0.076179,0.000000,0.076179,0.063236,0.072878,0.105052,0.000000,0.131637,0.082473,0.209778,...,0.000000,0.053963,0.000000,0.000000,0.000000,0.000000,0.000000,0.103969,0.073348,0.000000
AC007325.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [22]:
RE_pseudobulk         # 107 208 peaks x 343 bulks 

Unnamed: 0,GCACCTAAGGAGCAAC-1,GTTACGTAGCTTTGGG-1,GGACTAAAGTCAATCA-1,AAACGGATCATGGCTG-1,CGCTAACCATTAGCCA-1,CGCATGATCGCTATGG-1,GGTAATTGTTTGTCTA-1,GTATGTTCAGAGGGAG-1,ACGTTACAGCCATCAG-1,ATGACGAAGTCACCAG-1,...,TGCTTCCAGTAACCCG-1,CGGACCTAGCTAAAGG-1,CCTTAGTGTAGACAAA-1,CGCACAATCACCGGTA-1,CTAATCTTCTCGCCCA-1,CCCAACCGTAGCTGCG-1,AAGCGGGTCTTCAATC-1,CCTACTGGTTCGCGCT-1,GCTAAGAAGCTAAGTC-1,AAATGCCTCGATTTAG-1
chr1:10109-10357,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000
chr1:180730-181630,0.000000,0.000000,0.000000,0.000000,0.057822,0.036481,0.000000,0.057822,0.000000,0.000000,...,0.000000,0.000000,0.057822,0.036481,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000
chr1:191491-191736,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.036481,0.000000,0.000000,0.036481,0.00000
chr1:267816-268196,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.000000,0.036481,0.000000,...,0.057822,0.084707,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000
chr1:586028-586373,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
KI270713.1:20444-22615,0.445296,0.431637,0.815061,0.458523,0.404752,0.427587,0.209947,0.574166,0.485409,0.373816,...,0.495004,0.707519,0.823594,0.832770,0.246428,0.587825,0.530003,0.618761,0.837884,0.88611
KI270713.1:27118-28927,0.130785,0.000000,0.130785,0.000000,0.084707,0.036481,0.000000,0.057822,0.057822,0.057822,...,0.036481,0.000000,0.000000,0.000000,0.000000,0.057822,0.057822,0.057822,0.115643,0.00000
KI270713.1:29485-30706,0.057822,0.057822,0.057822,0.000000,0.057822,0.057822,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.057822,0.000000,0.000000,0.057822,0.057822,0.000000,0.072963,0.00000
KI270713.1:31511-32072,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.00000


In [23]:
!pwd

/home/users/v/a/vangysel/linger


## 6. Training the model

In [32]:
import os
from LingerGRN.preprocess import *

#Datadir = os.path.join(os.getcwd(), 'LINGER_data/')
Datadir = "/globalscratch/ucl/inma/vangysel/" + "LINGER_data/"
GRNdir = Datadir + 'data_bulk/'
genome = 'hg38'
#outdir = '/LINGER_output/'  # output directory
outdir = "/globalscratch/ucl/inma/vangysel/" + "LINGER_output/"
method = 'baseline'         # or 'LINGER'

In [35]:
preprocess(TG_pseudobulk, RE_pseudobulk, GRNdir, genome, method, outdir)

Overlap the regions with bulk data ...


In [36]:
import LingerGRN.LINGER_tr as LINGER_tr

activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
LINGER_tr.training(GRNdir,method,outdir,activef,'Human')

## 7. Cell population gene regulatory network

### 7.1 TF binding potential (TF-RE)
The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.

In [37]:
import LingerGRN.LL_net as LL_net
LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)

Generating cellular population TF binding strength ...


100%|██████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:57<00:00,  2.49s/it]


In [41]:
tf_re = pd.read_csv("/globalscratch/ucl/inma/vangysel/LINGER_output/cell_population_TF_RE_binding.txt", sep="\t", index_col=0)
tf_re.head()

Unnamed: 0,ZBTB6,ARNTL2,ELK3,E2F1,TCF7,TCF4,JUND,ZSCAN22,MEF2D,HLX,...,TBX21,MZF1,LHX4,KLF16,NFIC,FOXJ3,POU6F2,SIX3,ZIC1,CENPB
chr1:100028489-100029404,0.0,0.274914,0.0,0.0,0.471285,0.0,0.577039,0.558025,0.071134,0.0,...,0.855215,0.406899,0.0,0.400902,0.062382,0.0,0.541846,0.341378,0.0,0.269262
chr1:100034436-100035279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
chr1:100035922-100040109,0.174891,0.0,0.504716,0.517765,0.117817,0.194641,0.0,0.0,0.0,0.196663,...,0.0,0.0,0.724422,0.0,0.0,0.077789,0.0,0.0,0.0,0.0
chr1:100041493-100041927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
chr1:100046068-100047735,0.0,0.548376,0.0,0.0,0.623037,0.0,0.375218,0.350451,0.0,0.0,...,0.570573,0.493897,0.0,0.278602,0.0,0.0,0.382428,0.517982,0.0,0.073065


### 7.2 cis-regulatory network (RE-TG)
The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.

In [42]:
LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)

100%|██████████████████████████████████████████████████████████████████████████████████████████| 23/23 [02:09<00:00,  5.65s/it]


In [43]:
re_tg = pd.read_csv("/globalscratch/ucl/inma/vangysel/LINGER_output/cell_population_cis_regulatory.txt", sep="\t", index_col=0)
re_tg.head()

Unnamed: 0_level_0,SASS6,0.0009586007532124838
chr1:100028489-100029404,Unnamed: 1_level_1,Unnamed: 2_level_1
chr1:100028489-100029404,RTCA,4.597646e-06
chr1:100028489-100029404,SLC30A7,5.4007170000000006e-17
chr1:100028489-100029404,LRRC39,0.000153457
chr1:100028489-100029404,CDC14A,2.005422e-07
chr1:100028489-100029404,FRRS1,1.796895e-06


### 7.3 trans-regulatory network (TF-TG)
The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.

In [44]:
LL_net.trans_reg(GRNdir,method,outdir,genome)  

Generate trans-regulatory netowrk ...
Save trans-regulatory netowrk ...


In [45]:
tf_tg = pd.read_csv("/globalscratch/ucl/inma/vangysel/LINGER_output/cell_population_trans_regulatory.txt", sep="\t", index_col=0)
tf_tg.head()

Unnamed: 0,ZBTB6,ARNTL2,ELK3,E2F1,TCF7,TCF4,JUND,ZSCAN22,MEF2D,HLX,...,TBX21,MZF1,LHX4,KLF16,NFIC,FOXJ3,POU6F2,SIX3,ZIC1,CENPB
SASS6,1.594028e-06,6.235565e-05,6.701126e-05,0.0001361343,9.386463e-05,7.419297e-07,1.186786e-05,6.355007e-06,4.090155e-07,1.937167e-07,...,5.155972e-06,6.182847e-05,0.0001224008,4.554059e-05,8.231278e-08,1.481097e-07,5.551017e-05,1.111126e-05,2.872886e-08,2.529586e-06
RTCA,7.793802e-06,7.384969e-05,7.840596e-05,8.340468e-05,0.0002640587,4.4884e-06,7.010572e-05,2.946676e-05,1.26647e-05,1.264958e-06,...,4.269316e-05,0.000192749,0.0001339696,0.0001422249,2.04396e-05,1.231631e-06,8.481464e-05,2.72395e-05,5.167986e-06,1.797284e-05
SLC30A7,1.521859e-08,9.781391e-09,1.06228e-08,3.638231e-09,2.140661e-08,1.626306e-08,1.520017e-07,1.719834e-07,7.796982e-08,1.081337e-08,...,1.019648e-07,3.065578e-08,8.333568e-09,9.603236e-08,1.817609e-09,3.691107e-08,4.675161e-08,8.175803e-09,1.684074e-08,2.52688e-08
LRRC39,5.558691e-07,2.173287e-05,7.911914e-06,1.112355e-05,2.148481e-05,3.513813e-07,3.777721e-06,1.121292e-06,3.102803e-07,1.812656e-07,...,2.669588e-06,2.063336e-05,2.690603e-05,5.479154e-06,2.28893e-07,1.020716e-07,1.052551e-05,8.968327e-07,1.346092e-07,7.69285e-07
CDC14A,3.402137e-05,0.0002014284,0.0001970612,0.0002280473,9.99421e-05,4.465963e-05,0.0002470534,0.000112233,0.0001421436,1.335638e-05,...,0.0001932471,0.0001754173,0.0001120573,0.0001842412,0.0002385806,5.179631e-05,0.0001173447,1.260149e-05,7.508897e-05,0.000407861
