# Input data preparation
This notebooks prepares the data files needed for the cell-type specific gene regulatory network (GRN) inference pipeline.
## Preparation of individual input files
This section separately prepares each input file/folder as subsections. In each subsection, we will describe the expected input file, demonstate the preparation script with usage displayed when available, and briefly illustrate the content and/or format of the prepared input file. All these input files are placed in the `data` folder of this inference pipeline.

In [14]:
dictys_data_path = '/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data'
multiome_data_path = '/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/filtered_feature_bc_matrix/'

#### The helper function expression_mtx.py can also take in multiomic cell-ranger arc produced features that have two categories 'Gene Expression' and 'Peaks' and filter out all peak names and gene names which have : and . in them {peak name is of the type chrN:start-end}
#### I moved the expression.tsv file from htc_crc

In [21]:
# read and print head of a .tsv.gz file from the dictys_data_path/expression.tsv.gz
!zcat $dictys_data_path/expression.tsv.gz | head
#print the number of lines in the .tsv.gz file
!zcat $dictys_data_path/expression.tsv.gz | wc -l

	AAACAGCCAAACCTTG-1	AAACAGCCAAAGCTAA-1	AAACAGCCAAGCCACT-1	AAACAGCCAAGGTGCA-1	AAACAGCCAAGTTATC-1	AAACAGCCAATAGCCC-1	AAACAGCCAATTATGC-1	AAACAGCCAGTTAGCC-1	AAACAGCCATAATCCG-1	AAACAGCCATTCAGCA-1	AAACATGCAAAGCGCA-1	AAACATGCAAAGCTCC-1	AAACATGCAATAACGA-1	AAACATGCAATATAGG-1	AAACATGCACACAATT-1	AAACATGCACAGAAAC-1	AAACATGCACATAACT-1	AAACATGCAGAGAGCC-1	AAACATGCAGAGGGAG-1	AAACATGCAGATAGAC-1	AAACATGCAGCTCAAC-1	AAACATGCAGTTAAAG-1	AAACATGCATAAGCAA-1	AAACATGCATAAGTTC-1	AAACATGCATAATCCG-1	AAACATGCATGAAATG-1	AAACATGCATTAGCCA-1	AAACATGCATTCCTCG-1	AAACATGCATTGCGAC-1	AAACATGCATTTAAGC-1	AAACCAACAACACCTA-1	AAACCAACAAGCTAAA-1	AAACCAACAAGTGAAC-1	AAACCAACAATGAATG-1	AAACCAACAATTAAGG-1	AAACCAACACAATTAC-1	AAACCAACACAGCCAT-1	AAACCAACACAGGGAC-1	AAACCAACACCTGCCT-1	AAACCAACACGTAAGG-1	AAACCAACAGCCTGCA-1	AAACCAACAGGCTGTT-1	AAACCAACATAGCTGC-1	AAACCAACATAGCTTG-1	AAACCAACATTAAACC-1	AAACCAACATTCCTCG-1	AAACCGAAGACTTATG-1	AAACCGAAGAGAAGGG-1	AAACCGAAGAGCCGGA-1	AAACCGAAGAGGAGGA-1	AAACCGAAGCAAGGAC-1	AAACCGAAGCAGCTAT-1	AAACCGAAGCC

### Sort your bams to get aligned reads per cell (36k bam files)
#### a. Submitted array jobs to sort time-point wise bams

#### b. Subset your cell barcodes in the clusters you want the GRNs to be calculated for (cell-types from ledien clustering in the aggregated anndata)

In [4]:
import scanpy as sc
adata=sc.read_h5ad('/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/adata_aggregated_gene.leiden.h5ad')
adata.obs.leiden.to_csv(f'{dictys_data_path}/clusters.csv',header=['Cluster'],index=True,index_label='Barcode') #12 leiden cluster labels in the 33k cells

In [5]:
import pandas as pd
pd.DataFrame(adata.X)

Unnamed: 0,0
0,"(0, 117)\t2.0468006134033203\n (0, 207)\t2...."
1,"(0, 4)\t0.7918984889984131\n (0, 16)\t0.791..."
2,"(0, 26)\t1.1051886081695557\n (0, 35)\t1.10..."
3,"(0, 10)\t1.6259604692459106\n (0, 38)\t1.62..."
4,"(0, 10)\t1.6710577011108398\n (0, 12)\t1.67..."
...,...
32413,"(0, 38)\t1.7591832876205444\n (0, 100)\t2.3..."
32414,"(0, 60)\t1.9571365118026733\n (0, 100)\t1.9..."
32415,"(0, 3)\t0.6992017030715942\n (0, 12)\t0.699..."
32416,"(0, 5)\t0.8446400761604309\n (0, 22)\t0.844..."


#### Submit bash script to get subsets folders with the barcode names per cell-type for rna and atac data. input is clusters.csv (coming from the aggr_anndata's leiden clusters)

In [20]:
################# Check the subsets output #################
#Cell subset list
!head $dictys_data_path/subsets.txt
# #RNA cell barcodes for Subset 1
!head -n 4 $dictys_data_path/subsets/Subset1/names_rna.txt
#ATAC cell barcodes for Subset 1. They are identical because it's a joint profiling dataset.
!head -n 4 $dictys_data_path/subsets/Subset1/names_atac.txt

Subset0
Subset1
Subset10
Subset11
Subset2
Subset3
Subset4
Subset5
Subset6
Subset7
AAACAGCCAAGGTGCA-1
AAACATGCAAAGCTCC-1
AAACATGCACAGAAAC-1
AAACATGCAGAGAGCC-1
AAACAGCCAAGGTGCA-1
AAACATGCAAAGCTCC-1
AAACATGCACAGAAAC-1
AAACATGCAGAGAGCC-1


#### Use motifs from HOCOMOCO (wget-sbatch)

In [17]:
# see the output to check if gene names match TF names in anndata
!head -n 18 $dictys_data_path/motifs.motif

>dKhGCGTGh	AHR_HUMAN.H11MO.0.B	3.3775000000000004
0.262728374765856	0.1227600511842322	0.362725638699551	0.25178593535036087
0.07633328991810645	0.08258130543118362	0.22593295481662123	0.6151524498340887
0.14450570038747923	0.28392173880411337	0.13815442099009081	0.4334181398183167
0.023935814057894068	0.016203821748029118	0.9253278681170539	0.03453249607702277
0.007919544273173793	0.953597675415874	0.017308392078009837	0.021174388232942286
0.02956192959210962	0.012890110758086997	0.9474192747166682	0.010128684933135217
0.007919544273173797	0.029561929592109615	0.012337825593096645	0.9501807005416201
0.007919544273173793	0.007919544273173793	0.9762413671804787	0.007919544273173793
0.27886589130660366	0.4285328543459993	0.10955683916661985	0.18304441518077724
>hnnGGWWnddWWGGdbWh	AIRE_HUMAN.H11MO.0.C	5.64711
0.38551919443239085	0.2604245534178759	0.1353299124033618	0.21872633974637148
0.18745267949274294	0.18745267949274294	0.14575446582123766	0.4793401751932764
0.14575446582123777	0.145

#### Get the reference genome from homer directory - sbatch 

In [16]:
%%bash
#check the reference genome
ls -h1s /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data/genome | head

total 4.4G
4.0K annotations
 12K chrom.sizes
3.1G genome.fa
3.2M hg38.aug
 42M hg38.basic.annotation
673M hg38.full.annotation
164K hg38.miRNA
505M hg38.repeats
 24M hg38.rna


#### Get gene gtf from ensembl - inline > extract genes in bed format - sbatch

In [18]:
!head $dictys_data_path/gene.bed

chr1	11869	14409	DDX11L1	.	+
chr1	14404	29570	WASH7P	.	-
chr1	17369	17436	MIR6859-1	.	-
chr1	29554	31109	MIR1302-2HG	.	+
chr1	30366	30503	MIR1302-2	.	+
chr1	34554	36081	FAM138A	.	-
chr1	52473	53312	OR4G4P	.	+
chr1	57598	64116	OR4G11P	.	+
chr1	65419	71585	OR4F5	.	+
chr1	131025	134836	CICP27	.	+


#### Get blacklist regions of the genome from encode, to exclude them from the atac analysis

In [10]:
!wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O $dictys_data_path/blacklist.bed.gz

--2024-08-16 17:57:21--  https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2020/05/05/bc5dcc02-eafb-4471-aba0-4ebc7ee8c3e6/ENCFF356LFX.bed.gz?response-content-disposition=attachment%3B%20filename%3DENCFF356LFX.bed.gz&AWSAccessKeyId=ASIATGZNGCNXZD6QL5EC&Signature=%2BC%2B%2Bdz1d%2FFMSras1k46ru33iE9U%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEP7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJGMEQCIBfRqv1kvRIi3OyVke5xBShuVEkO6EQDVQmyzUjBW2vQAiAi7Ra166Ijsiij%2BajjvzBxxXTwQAVgu86d%2ByENEMZUIiq8BQj3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAAaDDIyMDc0ODcxNDg2MyIM22gFho2NRpSAVCAkKpAFZYQ2sb%2BUm%2BXqMdD6Q6PrIRUkJe6%2F15B%2FlWy%2F4GEOo0XMaXj8xK%2B5fIZMQS%2BaAjqchbByIJn9UBE6r%2FpO3s8vulrXIkLiwuoORYPRAeD0a

In [11]:
!gunzip $dictys_data_path/blacklist.bed.gz

In [19]:
!head $dictys_data_path/blacklist.bed

chr1	628903	635104
chr1	5850087	5850571
chr1	8909610	8910014
chr1	9574580	9574997
chr1	32043823	32044203
chr1	33818964	33819344
chr1	38674335	38674715
chr1	50017081	50017546
chr1	52996949	52997329
chr1	55372488	55372869
