# PBMCs Tutorial
PBMC stands for Peripheral Blood Mononuclear Cells, a vital group of immune system cells found in the blood

## 1. Download the general gene regulatory network
This is the pretrained NN on bulk multiomics data across tissues : RNA-seq (gene expr.) and ATAC-seq (chrom. acc.) that will then be fine tuned with our single cell data. There is one pretrained NN per gene.

### About the bulk GRN

It contains three types of interactions (TF-RE-TG) : 
- TF &rarr; RE : biding strength (&alpha;)
- RE &rarr; TG : cis regulatory strength (&beta;)
- TF &rarr; TG : trans regulatory strength (&gamma;)

We obtain **&alpha;** by extracting the weights from the input layer to the second layer (each TF and RE are connected to the 64 hidden neurons of h1). An embedding of a TF/RE is a vector of weitghs, we can then measure how similar two embeddings are. If a TF and RE have similar learned representations, they are likely to interact and will have a high biding strength.<br><br>
We get **&beta;** and **&gamma;** using the average shapley value (that calculates the contribution of a feature to the prediction) over all cells.

In [5]:
!echo $(pwd)

/home/users/v/a/vangysel


In [None]:
%%bash
# Set directories and download general GRN
Datadir=$PWD/LINGER_data
mkdir -p $Datadir
cd $Datadir
echo $(pwd)

In [None]:
%%bash
# Download general GRN from Google Drive
wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b" -O data_bulk.tar.gz
rm -rf /tmp/cookies.txt

In [None]:
!tar -xzf data_bulk.tar.gz

## 2. Prepare the input data

Note : the matrix contains both RNA and ATAC data combined

In [11]:
%%bash
mkdir -p data
wget -O data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5

--2026-01-15 10:40:48--  https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
Resolving proxy.sipr.ucl.ac.be (proxy.sipr.ucl.ac.be)... 130.104.12.59, 2001:6a8:3081:10c1:0:82ff:fe68:c3b
Connecting to proxy.sipr.ucl.ac.be (proxy.sipr.ucl.ac.be)|130.104.12.59|:889... connected.
Proxy request sent, awaiting response... 200 OK
Length: 162282142 (155M) [binary/octet-stream]
Saving to: ‘data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5’

     0K .......... .......... .......... .......... ..........  0% 10.1M 15s
    50K .......... .......... .......... .......... ..........  0% 6.03M 21s
   100K .......... .......... .......... .......... ..........  0% 63.0M 14s
   150K .......... .......... .......... .......... ..........  0% 10.6M 15s
   200K .......... .......... .......... .......... ..........  0% 32.5M 13s
   250K .......... .......... .......... .......... ..........  0% 16.0M 12s
   300K

In [14]:
import scanpy as sc
import scipy.sparse as sp
import pandas as pd

adata = sc.read_10x_h5('data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5', gex_only=False)

matrix = adata.X.T
adata.var['gene_ids'] = adata.var.index

# features are genes and peaks grouped together (col1 for gene/peak name and col2 for category: gene or peak)
features = pd.DataFrame(adata.var['gene_ids'].values.tolist(),columns=[1])
features[2] = adata.var['feature_types'].values

barcodes = pd.DataFrame(adata.obs_names,columns=[0])

from LingerGRN.preprocess import *
adata_RNA, adata_ATAC = get_adata(matrix,features,barcodes,label)     # adata_RNA and adata_ATAC are scRNA and scATAC

  utils.warn_names_duplicates("var")
  adata_RNA.obs['label']=label.loc[adata_RNA.obs['barcode']]['label'].values
  adata_ATAC.obs['label']=label.loc[adata_ATAC.obs['barcode']]['label'].values


In [136]:
print(f"Features : \n\n{features.head()}")
print(f"{features.tail()}\n\n")
print(f"Barcodes: \n{barcodes.head()}\n\n")
print(f"Labels : \n{label.head()}\n")

Features : 

             1                2
0  MIR1302-2HG  Gene Expression
1      FAM138A  Gene Expression
2        OR4F5  Gene Expression
3   AL627309.1  Gene Expression
4   AL627309.3  Gene Expression
                             1      2
144973  KI270713.1:20444-22615  Peaks
144974  KI270713.1:27118-28927  Peaks
144975  KI270713.1:29485-30706  Peaks
144976  KI270713.1:31511-32072  Peaks
144977  KI270713.1:37129-37638  Peaks


Barcodes: 
                    0
0  AAACAGCCAAGGAATC-1
1  AAACAGCCAATCCCTT-1
2  AAACAGCCAATGCGCT-1
3  AAACAGCCACACTAAT-1
4  AAACAGCCACCAACCG-1


Labels : 
                           barcode_use               label
barcode_use                                               
AAACAGCCAAGGAATC-1  AAACAGCCAAGGAATC-1   naive CD4 T cells
AAACAGCCAATCCCTT-1  AAACAGCCAATCCCTT-1  memory CD4 T cells
AAACAGCCAATGCGCT-1  AAACAGCCAATGCGCT-1   naive CD4 T cells
AAACAGCCAGTAGGTG-1  AAACAGCCAGTAGGTG-1   naive CD4 T cells
AAACAGCCAGTTTACG-1  AAACAGCCAGTTTACG-1  memory CD4 T cel

### 2.1 About the `get_data()` function
**@inputs :** 
- matrix: sparse matrix with RNA and ATAC data stacked vertically
- features: gene/peak IDs and their types ('Gene Expression' or 'Peaks')
- barcodes: cell barcodes
- label: cell type labels/annotations 

**@outputs :**
- adata_RNA 
- adata_ATAC

In [77]:
# gene - cell 
print(adata.var.iloc[0], end="\n\n")
print(adata.obs.iloc[0])

gene_ids             MIR1302-2HG
feature_types    Gene Expression
genome                    GRCh38
Name: MIR1302-2HG, dtype: object

Series([], Name: AAACAGCCAAGGAATC-1, dtype: float64)


In [80]:
adata

AnnData object with n_obs × n_vars = 11909 × 144978
    var: 'gene_ids', 'feature_types', 'genome'

In [32]:
adata.obs.head()

AAACAGCCAAGGAATC-1
AAACAGCCAATCCCTT-1
AAACAGCCAATGCGCT-1
AAACAGCCACACTAAT-1
AAACAGCCACCAACCG-1


In [33]:
adata.var.head()

Unnamed: 0,gene_ids,feature_types,genome
MIR1302-2HG,MIR1302-2HG,Gene Expression,GRCh38
FAM138A,FAM138A,Gene Expression,GRCh38
OR4F5,OR4F5,Gene Expression,GRCh38
AL627309.1,AL627309.1,Gene Expression,GRCh38
AL627309.3,AL627309.3,Gene Expression,GRCh38


## 3. Install LINGIER 
Note : run in terminal, not in notebook

### 3.1 Set up CONDA env

In [10]:
%%bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc

--2026-01-13 15:21:20--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving proxy.sipr.ucl.ac.be (proxy.sipr.ucl.ac.be)... 130.104.12.59, 2001:6a8:3081:10c1:0:82ff:fe68:c3b
Connecting to proxy.sipr.ucl.ac.be (proxy.sipr.ucl.ac.be)|130.104.12.59|:889... connected.
Proxy request sent, awaiting response... 200 OK
Length: 156772981 (150M) [application/octet-stream]
Saving to: ‘Miniconda3-latest-Linux-x86_64.sh.1’

     0K .......... .......... .......... .......... ..........  0% 30.6M 5s
    50K .......... .......... .......... .......... ..........  0% 55.5M 4s
   100K .......... .......... .......... .......... ..........  0% 58.9M 3s
   150K .......... .......... .......... .......... ..........  0% 72.8M 3s
   200K .......... .......... .......... .......... ..........  0% 81.2M 3s
   250K .......... .......... .......... .......... ..........  0% 81.7M 3s
   300K .......... .......... .......... .......... ..........  0%  101M 2s
   350K .......... ......


Welcome to Miniconda3 py313_25.11.1-1

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>> MINICONDA END USER LICENSE AGREEMENT

Copyright Notice: Miniconda(R) (C) 2015, Anaconda, Inc.
All rights reserved. Miniconda(R) is licensed, not sold.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer;

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution;

3. The name Anaconda, Inc. or Miniconda(R) may not be used to endorse or promote products derived from this software without specific prior written permission from Anaconda, Inc.; and

4. Miniconda(R) may not

CalledProcessError: Command 'b'wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\nsource ~/.bashrc\nconda --version\n'' returned non-zero exit status 1.

In [35]:
%%bash

# create and activate LINGER env
conda create -n LINGER python=3.10 -y
conda activate LINGER

# install packages in LINGER env
conda install -c conda-forge r-base rpy2 scanpy pandas scanpy=1.9 scipy matplotlib=3.7 anndata=0.9 zlib -y
conda install -c bioconda pybedtools=0.10.0 -y

!pip install LingerGRN==1.105

'\n!conda create -n LINGER python=3.10 -y\n!conda activate LINGER\n\n!conda install -c conda-forge r-base rpy2 scanpy pandas scanpy=1.9 scipy matplotlib=3.7 anndata=0.9 zlib -y\n!conda install -c bioconda pybedtools=0.10.0 -y\n\n!pip install LingerGRN==1.105\n\n'

### 3.2 Check installation

In [89]:
!pip show LingerGRN

[0m

In [85]:
!conda info --envs        # lists conda envs 

/bin/bash: conda: command not found


In [86]:
!conda list -n LINGER     # lists installed packages in LINGER env

/bin/bash: conda: command not found


In [84]:
!free -h                  # RAM available

              total        used        free      shared  buff/cache   available
Mem:           15Gi       3.2Gi       7.1Gi       794Mi       5.1Gi        11Gi
Swap:            0B          0B          0B


In [88]:
!nproc --all              # cores available

4


## 4. About the AnnData object

In [129]:
adata            # cells x (genes + peaks) 

AnnData object with n_obs × n_vars = 11909 × 144978
    var: 'gene_ids', 'feature_types', 'genome'

In [90]:
adata_RNA        # cells x genes

View of AnnData object with n_obs × n_vars = 9543 × 36601
    obs: 'barcode', 'sample', 'label', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

In [91]:
adata_ATAC       # cells x peaks

View of AnnData object with n_obs × n_vars = 9543 × 108377
    obs: 'barcode', 'sample', 'label'
    var: 'gene_ids'

In [112]:
#adata_RNA.X[i, j]             # cell i, gene j : Gene expression count of gene j in cell i
#adata_ATAC.X[i, k]            # cell i, peak k : Chromatin accessibility count of peak k in cell i
adata_RNA.X[1, 44]             # cell 1 gene 44

2.0

In [114]:
adata_RNA.X[0, :10].toarray()     # cell 0, expr. of 10 first genes

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

In [116]:
adata_RNA.X[:10, 0].toarray()     # gene 0 expr. in 10 first cells

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]], dtype=float32)

In [123]:
adata_RNA.obs.iloc[:10]            # metadata of 10 first cells

Unnamed: 0,barcode,sample,label,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
4743,CGGTTATAGTTTGAGC-1,1,classical monocytes,1680,3776.0,0.0,0.0
9903,TCATCCATCCTCACTA-1,1,effector CD8 T cells,2911,6912.0,0.0,0.0
9225,TACATCAAGGTACCGC-1,1,naive CD8 T cells,1515,2593.0,0.0,0.0
3573,CATTCCTCACCTCAGG-1,1,classical monocytes,1951,4355.0,0.0,0.0
4152,CCTGTAACATGTCGCG-1,1,naive CD8 T cells,1396,2685.0,0.0,0.0
10896,TGGTCCTTCAAGCTTA-1,1,non-classical monocytes,3112,7610.0,0.0,0.0
7383,GGAATCTTCACAGGAA-1,1,classical monocytes,2256,5331.0,0.0,0.0
4046,CCTACTGGTGCCGCAA-1,1,classical monocytes,2793,7127.0,0.0,0.0
5871,GAAGGAACAACCGCCA-1,1,plasmacytoid DC,2092,3931.0,0.0,0.0
6776,GCCACAATCCGCATGA-1,1,naive CD8 T cells,1559,2958.0,0.0,0.0


In [124]:
adata_RNA.var.iloc[:10]            # metadata of 10 first genes

Unnamed: 0,gene_ids,mt,n_cells_by_counts,mean_counts,pct_dropout_by_counts,total_counts
MIR1302-2HG,MIR1302-2HG,False,0,0.0,100.0,0.0
FAM138A,FAM138A,False,0,0.0,100.0,0.0
OR4F5,OR4F5,False,0,0.0,100.0,0.0
AL627309.1,AL627309.1,False,61,0.006916,99.360788,66.0
AL627309.3,AL627309.3,False,0,0.0,100.0,0.0
AL627309.2,AL627309.2,False,0,0.0,100.0,0.0
AL627309.5,AL627309.5,False,408,0.046002,95.724615,439.0
AL627309.4,AL627309.4,False,41,0.004401,99.570366,42.0
AP006222.2,AP006222.2,False,1,0.000105,99.989521,1.0
AL732372.1,AL732372.1,False,0,0.0,100.0,0.0


## 5. Preprocess

In [125]:
# Filter low-count cells and genes

# Keep only cells that have ≥ 200 detected genes
sc.pp.filter_cells(adata_RNA, min_genes=200)

# Keep only genes expressed in ≥ 3 cells
sc.pp.filter_genes(adata_RNA, min_cells=3)

sc.pp.filter_cells(adata_ATAC, min_genes=200)
sc.pp.filter_genes(adata_ATAC, min_cells=3)

# Keep only cells present in both RNA and ATAC
selected_barcode = list(set(adata_RNA.obs['barcode'].values) & set(adata_ATAC.obs['barcode'].values))

barcode_idx = pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]

barcode_idx = pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]


  adata.obs['n_genes'] = number
  adata.obs['n_genes'] = number


### 5.1 Effect of preprocess
We had 9543 cells and 144 978 features :
- 36 601 genes
- 108 377 peaks


In [137]:
print(f"adata_RNA.shape : {adata_RNA.shape}")
print(f"adata_ATAC.shape : {adata_ATAC.shape}")

adata_RNA.shape : (9543, 25485)
adata_ATAC.shape : (9543, 107208)


### 5.2 Comparison with uncompressed data

- Uncompressed : 
    - adata_RNA.shape : (9543, 36601)
    - adata_ATAC.shape : (9543, 143887)
- Compressed (.h5 file only)
    - adata_RNA.shape : (9543, 36601)
    - adata_ATAC.shape : (9543, 108377)

**After preprocessing**

- Uncompressed : 
    - adata_RNA.shape : (9543, 25485)
    - adata_ATAC.shape : (9543, 143885)


- Compressed (.h5 file only)
    - adata_RNA.shape : (9543, 25485)
    - adata_ATAC.shape : (9543, 107208)


### 5.3 About pseudo-bulking

Pseudo-bulk means "Combine many single cells into a “fake bulk sample” by summing or averaging their counts"
Since single cell is noisy and extremely sparse, it is better to work with aggregated signals across groups of cells (=metacells).  
- ``singlepseudobulk = true`` : Collapse all cells in this sample into ONE pseudobulk profile. This gives following dimensions : 
    - TG_pseudobulk_temp : (n_genes × 1)
    - RE_pseudobulk_temp : (n_peaks × 1) <br><br>
      
- ``singlepseudobulk = false`` : First cluster cells → then make multiple pseudobulks (metacells), used when we don't have many samples. This will create *K* clusters of cells, or *K* metacells
    - TG_pseudobulk_temp : (n_genes × k_metacells)
    - RE_pseudobulk_temp : (n_peaks × k_metacells) <br><br>
      
- Why is this needed ? GRN inference needs many samples (columns).
    - If you already have many samples → 1 bulk per sample is enough
    - If you have few samples → create metacells to increase sample count
      


In [144]:
adata_RNA[adata_RNA.obs['sample' ] == tempsample]

View of AnnData object with n_obs × n_vars = 9543 × 25485
    obs: 'barcode', 'sample', 'label', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
    var: 'gene_ids', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'

In [131]:
# Generate pseudo-bulk/metacell
import os
from LingerGRN.pseudo_bulk import *

samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
tempsample=samplelist[0]

TG_pseudobulk=pd.DataFrame([])
RE_pseudobulk=pd.DataFrame([])

n_samples = adata_RNA.obs['sample'].nunique()
singlepseudobulk = (n_samples > 10)
#singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)

# here samplelist = [1], singlepseudobulk = False (there is only one sample)
for tempsample in samplelist:

    # get cells from only tempsample
    adata_RNAtemp = adata_RNA[adata_RNA.obs['sample' ] == tempsample]
    adata_ATACtemp = adata_ATAC[adata_ATAC.obs['sample'] == tempsample]

    TG_pseudobulk_temp, RE_pseudobulk_temp = pseudo_bulk(adata_RNAtemp, adata_ATACtemp, singlepseudobulk)  
    
    TG_pseudobulk = pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
    RE_pseudobulk = pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
    
    RE_pseudobulk[RE_pseudobulk > 100] = 100


  view_to_actual(adata)
  view_to_actual(adata)
  view_to_actual(adata)
  view_to_actual(adata)
  from .autonotebook import tqdm as notebook_tqdm


### 5.4 About `pseudo_bulk` function

**@inputs :**
- adata_RNAtemp: single-cell RNA expression for one sample
- adata_ATACtemp: single-cell chromatin accessibility for the same cells
- singlepseudobulk (bool): whether to make one pseudo-bulk or multiple metacells

**@outputs :**
- TG_pseudobulk
- RE_pseudobulk

From one sample, it will either create a single metacell (single pseudobulk) or many pseudobulks. <br>
Here, we will cluster 9,543 cells into 343 metacells (9543/343 ≃ 28 cells per metacell).

In [139]:
if not os.path.exists('data/'):
    os.mkdir('data/')
    
adata_ATAC.write('data/adata_ATAC.h5ad')
adata_RNA.write('data/adata_RNA.h5ad')

TG_pseudobulk=TG_pseudobulk.fillna(0)
RE_pseudobulk=RE_pseudobulk.fillna(0)

pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)

TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')

  df[key] = c
  df[key] = c


In [169]:
TG_pseudobulk        # 25 485 genes x 343 bulks (meta cells)

Unnamed: 0,ATCATGTCAGCTTAGC-1,CACCAACCACTGGCTG-1,ACAGTATGTCACACCC-1,GCTAGCTCAACCTAAT-1,TGGTGCATCAAGCCTG-1,GATCCGTCAGCAACCT-1,CTCATGACACAGCCAT-1,CATGAGGCATAACGGG-1,CTCGCTCCACTTCACT-1,GTTAAACGTACAAAGA-1,...,TCGTTACGTAATCCCT-1,GTCTAATCAATAACCT-1,CTAATGTCACATTGCA-1,TGGCCTGCAGGCTGTT-1,GCTCTGGCATGTTTGG-1,AAGCTCCCATTAAGCT-1,GTCATGCCATCGCTCC-1,CAGGGCTTCGCTTCTA-1,ATTACTGAGGATGATG-1,ATCGAGGCAAACATAG-1
AL627309.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.00000,0.079365,0.000000,0.000000,0.000000,0.000000
AL627309.5,0.000000,0.000000,0.000000,0.066677,0.000000,0.000000,0.081431,0.000000,0.000000,0.000000,...,0.081461,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000
AL627309.4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000
AL669831.2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.076402,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000
LINC01409,0.137885,0.000000,0.067568,0.076179,0.077122,0.155224,0.076179,0.209045,0.000000,0.071082,...,0.080168,0.066342,0.000000,0.000000,0.00000,0.000000,0.169646,0.070672,0.079976,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AL592183.1,0.075672,0.224499,0.278659,0.399043,0.000000,0.315668,0.234358,0.000000,0.224499,0.201752,...,0.240911,0.183525,0.352861,0.257000,0.46273,0.299137,0.319015,0.124378,0.303294,0.088885
AC240274.1,0.000000,0.171601,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.073917,0.000000,...,0.000000,0.146791,0.000000,0.065982,0.00000,0.000000,0.000000,0.000000,0.000000,0.081925
AC004556.3,0.072369,0.065944,0.131637,0.000000,0.087766,0.067568,0.000000,0.072979,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.00000,0.152061,0.000000,0.089612,0.000000,0.000000
AC007325.1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000


In [170]:
RE_pseudobulk         # 107 208 peaks x 343 bulks 

Unnamed: 0,ATCATGTCAGCTTAGC-1,CACCAACCACTGGCTG-1,ACAGTATGTCACACCC-1,GCTAGCTCAACCTAAT-1,TGGTGCATCAAGCCTG-1,GATCCGTCAGCAACCT-1,CTCATGACACAGCCAT-1,CATGAGGCATAACGGG-1,CTCGCTCCACTTCACT-1,GTTAAACGTACAAAGA-1,...,TCGTTACGTAATCCCT-1,GTCTAATCAATAACCT-1,CTAATGTCACATTGCA-1,TGGCCTGCAGGCTGTT-1,GCTCTGGCATGTTTGG-1,AAGCTCCCATTAAGCT-1,GTCATGCCATCGCTCC-1,CAGGGCTTCGCTTCTA-1,ATTACTGAGGATGATG-1,ATCGAGGCAAACATAG-1
chr1:10109-10357,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
chr1:180730-181630,0.057822,0.000000,0.057822,0.000000,0.000000,0.115643,0.000000,0.057822,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
chr1:191491-191736,0.000000,0.000000,0.000000,0.057822,0.000000,0.057822,0.000000,0.000000,0.057822,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
chr1:267816-268196,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.057822,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.057822
chr1:586028-586373,0.000000,0.057822,0.000000,0.000000,0.000000,0.000000,0.115643,0.000000,0.057822,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000,0.000000,0.057822
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
KI270713.1:20444-22615,0.560939,0.404752,0.631988,0.662924,0.876933,0.489459,0.289109,0.631988,0.534054,0.418410,...,0.267768,0.200351,0.676583,0.431637,0.649697,0.605103,0.689810,0.594853,0.761290,0.437183
KI270713.1:27118-28927,0.000000,0.057822,0.057822,0.000000,0.115643,0.057822,0.000000,0.000000,0.000000,0.057822,...,0.000000,0.000000,0.173465,0.000000,0.000000,0.000000,0.000000,0.000000,0.057822,0.000000
KI270713.1:29485-30706,0.000000,0.057822,0.000000,0.036481,0.000000,0.036481,0.000000,0.057822,0.072963,0.094303,...,0.084707,0.057822,0.057822,0.057822,0.000000,0.057822,0.000000,0.142529,0.000000,0.000000
KI270713.1:31511-32072,0.000000,0.000000,0.057822,0.057822,0.000000,0.057822,0.000000,0.000000,0.000000,0.057822,...,0.000000,0.000000,0.000000,0.057822,0.036481,0.000000,0.036481,0.000000,0.094303,0.057822


## 6. Training the model

In [None]:
import os
from LingerGRN.preprocess import *

Datadir = os.path.join(os.getcwd(), 'LINGER_data/')
GRNdir = Datadir + 'data_bulk/'
genome = 'hg38'
outdir = '/LINGER_output/'  # output directory
method = 'baseline'         # or 'LINGER'

preprocess(TG_pseudobulk, RE_pseudobulk, GRNdir, genome, method, outdir)

In [None]:
import LingerGRN.LINGER_tr as LINGER_tr

activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
LINGER_tr.training(GRNdir,method,outdir,activef,'Human')

## 7. Cell population gene regulatory network

### 7.1 TF binding potential (TF-RE)
The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.

In [None]:
import LingerGRN.LL_net as LL_net
LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)

### 7.2 cis-regulatory network (RE-TG)
The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.

In [None]:
LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)

### 7.3 trans-regulatory network (TF-TG)
The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.

In [None]:
LL_net.trans_reg(GRNdir,method,outdir,genome)