# Input data preparation
This notebooks prepares the data files needed for the cell-type specific gene regulatory network (GRN) inference pipeline.
## Preparation of individual input files
This section separately prepares each input file/folder as subsections. In each subsection, we will describe the expected input file, demonstate the preparation script with usage displayed when available, and briefly illustrate the content and/or format of the prepared input file. All these input files are placed in the `data` folder of this inference pipeline.

In [1]:
dictys_data_path = '/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data'
multiome_data_path = '/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/filtered_feature_bc_matrix/'

#### The helper function expression_mtx.py can also take in multiomic cell-ranger arc produced features that have two categories 'Gene Expression' and 'Peaks' and filter out all peak names and gene names which have : and . in them {peak name is of the type chrN:start-end}

In [2]:
# read and print head of a .tsv.gz file from the dictys_data_path/expression.tsv.gz
!printf '%-10s%20s%20s%20s\n' '' $(cat /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data/expression.tsv.gz | gunzip | head -n 5 | awk -F "\t" '{print $1"\t"$2"\t"$3"\t"$4}')


gzip: stdout: Broken pipe
cat: write error: Broken pipe
            AAACAGCCAAACCTTG-1  AAACAGCCAAAGCTAA-1  AAACAGCCAAGCCACT-3
A1BG                         0                   0                   0
A1BG-AS1                     0                   0                   0
A1CF                         0                   0                   0
A2M                          0                   0                   0


In [3]:
#print the number of lines in the .tsv.gz file = number of genes
!zcat $dictys_data_path/expression.tsv.gz | wc -l

24027


### Sort your bams to get aligned reads per cell (36k bam files)
#### a. Submitted array jobs to sort time-point wise bams

#### b. Subset your cell barcodes in the clusters you want the GRNs to be calculated for (cell-types from ledien clustering in the aggregated anndata)

#### Submit bash script to get subsets folders with the barcode names per cell-type for rna and atac data. input is clusters.csv (coming from the aggr_anndata's leiden clusters)
Cell subsets are updated post running through your data to remove/aggregate clusters

In [4]:
################# Check the subsets output #################
#Cell subset list
!head $dictys_data_path/subsets.txt
# #RNA cell barcodes for Subset M
!head -n 4 $dictys_data_path/subsets/Day_1_Cells/names_rna.txt
# cell barcodes for Subset N. atac and rna are identical because it's a joint profiling dataset.
!head -n 4 $dictys_data_path/subsets/Plasma_Blast/names_rna.txt

Activated_B_Cells
Day_1_Cells
Day_3_Cells
Germinal_Center
Plasma_Blast
Undefined
AAACAGCCAAGTTATC-1
AAACAGCCAATAGCCC-1
AAACAGCCAGTTAGCC-1
AAACATGCAATAACGA-1
AAACCAACAAGCTAAA-3
AAACCAACAATTAAGG-3
AAACCGAAGAGAAGGG-3
AAACCGAAGTATTGTG-2


#### Use motifs from HOCOMOCO (wget-sbatch)

In [4]:
# see the output to check if gene names match TF names in anndata
!head -n 18 $dictys_data_path/motifs.motif

>dKhGCGTGh	AHR_HUMAN.H11MO.0.B	3.3775000000000004
0.262728374765856	0.1227600511842322	0.362725638699551	0.25178593535036087
0.07633328991810645	0.08258130543118362	0.22593295481662123	0.6151524498340887
0.14450570038747923	0.28392173880411337	0.13815442099009081	0.4334181398183167
0.023935814057894068	0.016203821748029118	0.9253278681170539	0.03453249607702277
0.007919544273173793	0.953597675415874	0.017308392078009837	0.021174388232942286
0.02956192959210962	0.012890110758086997	0.9474192747166682	0.010128684933135217
0.007919544273173797	0.029561929592109615	0.012337825593096645	0.9501807005416201
0.007919544273173793	0.007919544273173793	0.9762413671804787	0.007919544273173793
0.27886589130660366	0.4285328543459993	0.10955683916661985	0.18304441518077724
>hnnGGWWnddWWGGdbWh	AIRE_HUMAN.H11MO.0.C	5.64711
0.38551919443239085	0.2604245534178759	0.1353299124033618	0.21872633974637148
0.18745267949274294	0.18745267949274294	0.14575446582123766	0.4793401751932764
0.14575446582123777	0.145

#### Get the reference genome from homer directory - sbatch 

In [12]:
%%bash
#check the reference genome
ls -h1s /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data/genome | head

total 4.4G
4.0K annotations
 12K chrom.sizes
3.1G genome.fa
3.2M hg38.aug
 42M hg38.basic.annotation
673M hg38.full.annotation
164K hg38.miRNA
505M hg38.repeats
 24M hg38.rna


#### Get gene gtf from ensembl - inline > extract genes in bed format - sbatch

In [7]:
!head $dictys_data_path/gene.bed

chr1	11869	14409	DDX11L1	.	+
chr1	14404	29570	WASH7P	.	-
chr1	17369	17436	MIR6859-1	.	-
chr1	29554	31109	MIR1302-2HG	.	+
chr1	30366	30503	MIR1302-2	.	+
chr1	34554	36081	FAM138A	.	-
chr1	52473	53312	OR4G4P	.	+
chr1	57598	64116	OR4G11P	.	+
chr1	65419	71585	OR4F5	.	+
chr1	131025	134836	CICP27	.	+


#### Get blacklist regions of the genome from encode, to exclude them from the atac analysis

In [10]:
!wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O $dictys_data_path/blacklist.bed.gz

--2024-08-16 17:57:21--  https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2020/05/05/bc5dcc02-eafb-4471-aba0-4ebc7ee8c3e6/ENCFF356LFX.bed.gz?response-content-disposition=attachment%3B%20filename%3DENCFF356LFX.bed.gz&AWSAccessKeyId=ASIATGZNGCNXZD6QL5EC&Signature=%2BC%2B%2Bdz1d%2FFMSras1k46ru33iE9U%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEP7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJGMEQCIBfRqv1kvRIi3OyVke5xBShuVEkO6EQDVQmyzUjBW2vQAiAi7Ra166Ijsiij%2BajjvzBxxXTwQAVgu86d%2ByENEMZUIiq8BQj3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAAaDDIyMDc0ODcxNDg2MyIM22gFho2NRpSAVCAkKpAFZYQ2sb%2BUm%2BXqMdD6Q6PrIRUkJe6%2F15B%2FlWy%2F4GEOo0XMaXj8xK%2B5fIZMQS%2BaAjqchbByIJn9UBE6r%2FpO3s8vulrXIkLiwuoORYPRAeD0a

In [11]:
!gunzip $dictys_data_path/blacklist.bed.gz

In [8]:
!head $dictys_data_path/blacklist.bed

chr1	628903	635104
chr1	5850087	5850571
chr1	8909610	8910014
chr1	9574580	9574997
chr1	32043823	32044203
chr1	33818964	33819344
chr1	38674335	38674715
chr1	50017081	50017546
chr1	52996949	52997329
chr1	55372488	55372869


# Preparing data for dynamic run

## Comparison of cell barcodes across tsvs and anndatas

In [1]:
import anndata as ad
#import stream as st

#### Barcodes from anndata file (original)

In [2]:
adata_original_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/adata_aggregated_gene.leiden.h5ad"
adata_original = ad.read_h5ad(adata_original_file)
adata_original

AnnData object with n_obs × n_vars = 32418 × 23090
    obs: 'cell_type_major', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'topic_15', 'topic_16', 'topic_17', 'topic_18', 'topic_19', 'topic_20', 'topic_21', 'topic_22', 'topic_23', 'topic_24', 'topic_25', 'topic_26', 'topic_27', 'topic_28', 'topic_29', 'topic_30', 'topic_31', 'leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
    uns: 'cell_type_major_colors', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'topic_dendogram', 'umap'
    obsm: 'X_joint_umap_features', 'X_topic_compositions', 'X_umap', 'X_umap_features'
    varm: 'topic_feature_activations', 'topic_feature_compositions'
    layers: 'counts'
    obsp: 'connectivities',

In [None]:
#save the cell barcode to file
original_cells = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/original_anndata_cells.csv"
cell_barcodes1 = adata_original.obs.index
cell_barcodes1_df = pd.DataFrame(cell_barcodes1)
cell_barcodes1_df.to_csv(original_cells, index=False)

#### Barcodes for stream input anndata (removed cell-clusters, merged clusters)

In [None]:
import anndata as ad
adata_stream_input = ad.read_h5ad("/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/stream_input_adata.h5ad")
adata_stream_input

In [None]:
#save the cell obs with leiden_merged in a csv file
import pandas as pd
stream_input_cells = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/stream_input_cells.csv"
cell_barcodes = adata_stream_input.obs.index
cell_barcodes_df = pd.DataFrame(cell_barcodes)
cell_barcodes_df.to_csv(stream_input_cells, index=False)

#### Barcodes in expression.tsv barcodes (dictys built)

In [5]:
import pandas as pd
# load superset dataframe
expression_rna = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/dictys_outs/data/expression.tsv.gz"
# Load data
expression_rna = pd.read_csv(expression_rna, header=0, index_col=0, sep='\t')
# Display the first few rows of each DataFrame to confirm successful loading
print("exp_rna DataFrame:")
display(expression_rna.head())
print("exp_rna DataFrame shape:", expression_rna.shape)

exp_rna DataFrame:


Unnamed: 0,AAACAGCCAAACCTTG-1,AAACAGCCAAAGCTAA-1,AAACAGCCAAGCCACT-3,AAACAGCCAAGGTGCA-1,AAACAGCCAAGTTATC-1,AAACAGCCAATAGCCC-1,AAACAGCCAATTATGC-2,AAACAGCCAGTTAGCC-1,AAACAGCCATAATCCG-1,AAACAGCCATTCAGCA-3,...,TTTGTTGGTGTTGCAA-1,TTTGTTGGTTAAGGTT-3,TTTGTTGGTTAGCGTA-1,TTTGTTGGTTATCCGT-3,TTTGTTGGTTGACTTC-1,TTTGTTGGTTTACGTC-1,TTTGTTGGTTTAGTCC-1,TTTGTTGGTTTATGGG-2,TTTGTTGGTTTCCTCC-3,TTTGTTGGTTTGAGGC-2
A1BG,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
A1BG-AS1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A1CF,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A2M,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A2M-AS1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


exp_rna DataFrame shape: (24026, 36306)


In [6]:
#save the cell barcode to file
expression_rna_cells = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/expression_rna_cells.csv"
cell_barcodes2 = expression_rna.columns
cell_barcodes2_df = pd.DataFrame(cell_barcodes2)
cell_barcodes2_df.to_csv(expression_rna_cells, index=False)

In [7]:
# how many of the cell barcodes have suffix -1 and -2 and -3, print from cell_barcodes2_df
print("-1", cell_barcodes2_df[cell_barcodes2_df[0].str.contains("-1")].shape)
print("-2", cell_barcodes2_df[cell_barcodes2_df[0].str.contains("-2")].shape)
print("-3", cell_barcodes2_df[cell_barcodes2_df[0].str.contains("-3")].shape)

-1 (15285, 1)
-2 (11127, 1)
-3 (9894, 1)


#### Barcodes directly from cell-ranger arc aggr output

In [8]:
cell_ranger_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
cell_ranger_barcodes = pd.read_csv(cell_ranger_barcodes, header=None, index_col=None, sep='\t')
print("cell_ranger_barcodes DataFrame:")
display(cell_ranger_barcodes.head())
print("cell_ranger_barcodes DataFrame shape:", cell_ranger_barcodes.shape)

cell_ranger_barcodes DataFrame:


Unnamed: 0,0
0,AAACAGCCAAACCTTG-1
1,AAACAGCCAAAGCTAA-1
2,AAACAGCCAAGCCACT-3
3,AAACAGCCAAGGTGCA-1
4,AAACAGCCAAGTTATC-1


cell_ranger_barcodes DataFrame shape: (36306, 1)


In [9]:
# save to file cell_ranger_barcodes
cell_ranger_barcodes_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/cellranger_aggr_barcodes.csv"
cell_ranger_barcodes.to_csv(cell_ranger_barcodes_file, index=False)

In [10]:
# how many of the cell barcodes have suffix -1 and -2 and -3, print from cell_rnager_aggr barcodes
print("-1", cell_ranger_barcodes[cell_ranger_barcodes[0].str.contains("-1")].shape)
print("-2", cell_ranger_barcodes[cell_ranger_barcodes[0].str.contains("-2")].shape)
print("-3", cell_ranger_barcodes[cell_ranger_barcodes[0].str.contains("-3")].shape)

-1 (15285, 1)
-2 (11127, 1)
-3 (9894, 1)


#### Check intersections between sets of barcodes

In [None]:
# create sets of cell barcodes
stream_input_cells = set(cell_barcodes)
original_cells = set(cell_barcodes1)
expression_rna_cells = set(cell_barcodes2)
cell_ranger_barcodes = set(cell_ranger_barcodes[0])
print(f"Number of cells in stream_input_cells: {len(stream_input_cells)}")
print(f"Number of cells in original_cells: {len(original_cells)}")
print(f"Number of cells in expression_rna_cells: {len(expression_rna_cells)}")
print(f"Number of cells in cell_ranger_barcodes: {len(cell_ranger_barcodes)}")

In [None]:
# Check if stream_input_cells is a subset of original_cells
is_subset = stream_input_cells.issubset(original_cells)
# Display the result
if is_subset:
    print("All stream input cells are present in the original anndata cells.")
else:
    print("Some stream input cells are missing from the original cells.")
    # Optionally, display the missing cells
    missing_cells = stream_input_cells - original_cells
    print(f"Number of missing cells: {len(missing_cells)}")
    print(f"Example missing cells: {list(missing_cells)[:5]}")

In [None]:
# check if stream_input_cells is a subset of expression_rna_cells
is_subset1 = stream_input_cells.issubset(expression_rna_cells)
# Display the result
if is_subset1:
    print("All stream input cells are present in the expression cells.")
else:
    print("Some stream input cells are missing from the expression cells.")
    # Optionally, display the missing cells
    missing_cells = stream_input_cells - expression_rna_cells
    print(f"Number of missing cells: {len(missing_cells)}")
    print(f"Example missing cells: {list(missing_cells)[:5]}")
    
#All stream input cells are present in the expression cells.

In [None]:
# check if stream_input_cells is a subset of expression_rna_cells
is_subset2 = original_cells.issubset(expression_rna_cells)
# Display the result
if is_subset2:
    print("All original anndata cells are present in the expression cells.")
else:
    print("Some original anndata cells are missing from the expression cells.")
    # Optionally, display the missing cells
    missing_cells = original_cells - expression_rna_cells
    print(f"Number of missing cells: {len(missing_cells)}")
    print(f"Example missing cells: {list(missing_cells)[:5]}")
    
#All original anndata cells are present in the expression cells.

In [None]:
# check if stream_input_cells is a subset of expression_rna_cells
is_subset3 = original_cells.issubset(cell_ranger_barcodes)
# Display the result
if is_subset3:
    print("All original anndata cells are present in cell ranger barcodes.")
else:
    print("Some original anndata cells are missing from the cell ranger cells.")
    # Optionally, display the missing cells
    missing_cells = original_cells - cell_ranger_barcodes
    print(f"Number of missing cells: {len(missing_cells)}")
    print(f"Example missing cells: {list(missing_cells)[:5]}")

# All original anndata cells are present in cell ranger barcodes.

In [None]:
# check if stream_input_cells is a subset of expression_rna_cells
is_subset4 = expression_rna_cells.issubset(cell_ranger_barcodes)
# Display the result
if is_subset4:
    print("All expression cells are present in cell ranger barcodes.")
else:
    print("Some expression cells are missing from the cell ranger barcodes.")
    # Optionally, display the missing cells
    missing_cells = expression_rna_cells - cell_ranger_barcodes
    print(f"Number of missing cells: {len(missing_cells)}")
    print(f"Example missing cells: {list(missing_cells)[:5]}")

# All expression cells are present in cell ranger barcodes. (for the new expression.tsv file)

## Inspecting cell ranger output and bam files for number of cells, genes and peaks

In [11]:
# path to multi-omic cell-ranger matrix file 
cell_ranger_arc = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/outs/filtered_feature_bc_matrix"

In [5]:
import gzip
import os

def read_mtx(file_path, n=50):
    with gzip.open(file_path, 'rt') as f:
        line_count = 0
        for line in f:
            if line.startswith('%'):
                continue  # Skip comments
            print(line.strip())  # Process each line
            line_count += 1
            if line_count >= n:
                break

# Call the function with your file path
read_mtx(os.path.join(cell_ranger_arc, "matrix.mtx.gz"), 50)

227856 36306 349463587
45 1 1
60 1 1
63 1 4
74 1 2
87 1 1
98 1 2
147 1 1
171 1 9
191 1 2
209 1 2
217 1 3
220 1 6
225 1 1
234 1 1
242 1 1
244 1 2
247 1 2
251 1 1
253 1 2
262 1 3
265 1 4
266 1 3
269 1 1
272 1 1
298 1 4
299 1 1
339 1 7
360 1 1
371 1 1
386 1 1
387 1 2
389 1 1
399 1 2
408 1 1
417 1 1
432 1 2
433 1 1
434 1 1
435 1 1
439 1 1
440 1 1
475 1 3
476 1 3
478 1 3
486 1 1
493 1 3
509 1 1
525 1 23
529 1 1


In [6]:
# inspect the features matrix
features_path = os.path.join(cell_ranger_arc, "features.tsv.gz")
read_mtx(features_path, 50)

ENSG00000243485	MIR1302-2HG	Gene Expression	chr1	29553	30267
ENSG00000237613	FAM138A	Gene Expression	chr1	36080	36081
ENSG00000186092	OR4F5	Gene Expression	chr1	65418	69055
ENSG00000238009	AL627309.1	Gene Expression	chr1	120931	133723
ENSG00000239945	AL627309.3	Gene Expression	chr1	91104	91105
ENSG00000239906	AL627309.2	Gene Expression	chr1	140338	140339
ENSG00000241860	AL627309.5	Gene Expression	chr1	149706	173862
ENSG00000241599	AL627309.4	Gene Expression	chr1	160445	160446
ENSG00000286448	AP006222.2	Gene Expression	chr1	266854	266855
ENSG00000236601	AL732372.1	Gene Expression	chr1	360056	360057
ENSG00000284733	OR4F29	Gene Expression	chr1	451696	451697
ENSG00000235146	AC114498.1	Gene Expression	chr1	587628	587629
ENSG00000284662	OR4F16	Gene Expression	chr1	686672	686673
ENSG00000229905	AL669831.2	Gene Expression	chr1	760910	760911
ENSG00000237491	LINC01409	Gene Expression	chr1	778757	803934
ENSG00000177757	FAM87B	Gene Expression	chr1	817370	817371
ENSG00000228794	LINC01128	Gene Expre

In [7]:
# get unique entries in column 3 using pandas
import pandas as pd
features_df = pd.read_csv(features_path, sep='\t', header=None)
unique_entries = features_df[2].unique()
print(unique_entries)
print(features_df.shape)
# print number of rows that have Gene Expression in column 2
print(features_df[2].value_counts())

['Gene Expression' 'Peaks']
(227856, 6)
2
Peaks              191255
Gene Expression     36601
Name: count, dtype: int64


In [8]:
# First, filter rows where column 3 is 'Gene Expression'
gene_expression_df = features_df[features_df[2] == 'Gene Expression']
# Then, count how many rows in column 2 contain a '.'
dot_count = gene_expression_df[1].str.contains(r'\.').sum()
print(dot_count)

12555


#### check day specific barcode numbers

In [12]:
import pandas as pd
day0_2_cell_ranger_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day0_2/outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
day0_2_cell_ranger_barcodes = pd.read_csv(day0_2_cell_ranger_barcodes, header=None, index_col=None, sep='\t')
print("day0_2_cell_ranger_barcodes DataFrame:")
display(day0_2_cell_ranger_barcodes.head())
print("day0_2_cell_ranger_barcodes DataFrame shape:", day0_2_cell_ranger_barcodes.shape)

day0_2_cell_ranger_barcodes DataFrame:


Unnamed: 0,0
0,AAACAGCCAAACCTTG-1
1,AAACAGCCAAAGCTAA-1
2,AAACAGCCAAGGTGCA-1
3,AAACAGCCAAGTTATC-1
4,AAACAGCCAATAGCCC-1


day0_2_cell_ranger_barcodes DataFrame shape: (15285, 1)


In [13]:
# save to file day0_2_cell_ranger_barcodes
day0_2_cell_ranger_barcodes_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/cellranger_day0_2_barcodes.csv"
day0_2_cell_ranger_barcodes.to_csv(day0_2_cell_ranger_barcodes_file, index=False)

In [14]:
import pandas as pd
day3_4_cell_ranger_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day3_4/outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
day3_4_cell_ranger_barcodes = pd.read_csv(day3_4_cell_ranger_barcodes, header=None, index_col=None, sep='\t')
print("day3_4_cell_ranger_barcodes DataFrame:")
display(day3_4_cell_ranger_barcodes.head())
print("day3_4_cell_ranger_barcodes DataFrame shape:", day3_4_cell_ranger_barcodes.shape)

day3_4_cell_ranger_barcodes DataFrame:


Unnamed: 0,0
0,AAACAGCCAATTATGC-1
1,AAACATGCAATAACGA-1
2,AAACATGCACATAACT-1
3,AAACATGCAGATAGAC-1
4,AAACATGCATAATCCG-1


day3_4_cell_ranger_barcodes DataFrame shape: (11127, 1)


In [15]:
# save to file day3_4_cell_ranger_barcodes
day3_4_cell_ranger_barcodes_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/cellranger_day3_4_barcodes.csv"
day3_4_cell_ranger_barcodes.to_csv(day3_4_cell_ranger_barcodes_file, index=False)

In [16]:
import pandas as pd
day5_6_cell_ranger_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day5_6/outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
day5_6_cell_ranger_barcodes = pd.read_csv(day5_6_cell_ranger_barcodes, header=None, index_col=None, sep='\t')
print("day5_6_cell_ranger_barcodes DataFrame:")
display(day5_6_cell_ranger_barcodes.head())
print("day5_6_cell_ranger_barcodes DataFrame shape:", day5_6_cell_ranger_barcodes.shape)

day5_6_cell_ranger_barcodes DataFrame:


Unnamed: 0,0
0,AAACAGCCAAGCCACT-1
1,AAACAGCCATTCAGCA-1
2,AAACATGCACACAATT-1
3,AAACATGCAGTTAAAG-1
4,AAACATGCATAAGTTC-1


day5_6_cell_ranger_barcodes DataFrame shape: (9894, 1)


In [17]:
# save to file day5_6_cell_ranger_barcodes
day5_6_cell_ranger_barcodes_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/cellranger_day5_6_barcodes.csv"
day5_6_cell_ranger_barcodes.to_csv(day5_6_cell_ranger_barcodes_file, index=False)

In [18]:
# check if all day cells are mutually exclusive
day0_2_cells = set(day0_2_cell_ranger_barcodes[0])
day3_4_cells = set(day3_4_cell_ranger_barcodes[0])
day5_6_cells = set(day5_6_cell_ranger_barcodes[0])


In [20]:
unique_cells = day0_2_cells.union(day3_4_cells, day5_6_cells)
total_unique_cells = len(unique_cells)
sum_of_individual_sizes = len(day0_2_cells) + len(day3_4_cells) + len(day5_6_cells)

print(f"Total unique cells: {total_unique_cells}")
print(f"Sum of individual sizes: {sum_of_individual_sizes}")

#save the total_unique_cells to a file
unique_cells_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/total_unique_cells.csv"
with open(unique_cells_file, 'w') as f:
    f.write(str(unique_cells))



Total unique cells: 35732
Sum of individual sizes: 36306


### Check barcodes of per day atac bams

In [3]:
# path to day specific bam files
day0_2bam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day0_2/outs/atac_possorted_bam.bam"
day3_4bam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day3_4/outs/atac_possorted_bam.bam"
day5_6bam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day5_6/outs/atac_possorted_bam.bam"

In [1]:
############# bam file helper functions ################

import pysam
import csv
from concurrent.futures import ThreadPoolExecutor

# Function to count the total number of reads (rows) in a BAM file
def count_total_reads(bam_file_path):
    # Open the BAM file
    with pysam.AlignmentFile(bam_file_path, "rb") as bam_file:
        # Get the number of mapped and unmapped reads
        total_mapped = bam_file.mapped
        total_unmapped = bam_file.unmapped
        
        # Total reads is the sum of mapped and unmapped reads
        total_reads = total_mapped + total_unmapped
    
    return total_reads

In [4]:
# get total rows to decide chunk sizes
day5_6_total_reads = count_total_reads(day5_6bam_file)
print("Total reads in BAM file:", day5_6_total_reads)

Total reads in BAM file: 508357352


[W::hts_idx_load3] The index file is older than the data file: /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_day5_6/outs/atac_possorted_bam.bam.bai


#### Get SAMs from BAMS using samtools to read files conveniently

In [5]:
# path to day specific sam files
day0_2sam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day0_2/atac_possorted.sam"
day3_4sam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day3_4/atac_possorted.sam"
day5_6sam_file = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day5_6/atac_possorted.sam"

In [15]:
############# Sam file helper functions ################

import pysam
import csv

# Function to view the first few reads (head) of a SAM file
def view_sam_file(sam_file_path, num_reads=10):
    with pysam.AlignmentFile(sam_file_path, "r") as sam_file:
        # Iterate over each read in the SAM file and limit by num_reads
        for i, read in enumerate(sam_file):
            if i >= num_reads:
                break
            print(read)


In [8]:
print(f"\nHead of {day0_2sam_file}:\n")
view_sam_file(day0_2sam_file, num_reads=50)


Head of /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day0_2/atac_possorted.sam:

A00522:234:HMMCHDRX2:2:2263:27642:11381	99	#0	9997	0	50M	#0	10179	231	CCCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 25, 37])	[('NM', 1), ('MD', '2G47'), ('AS', 47), ('XS', 47), ('CR', 'TGAGATTCATATGGTG'), ('CY', 'FFFFFFFFFFFFFFFF'), ('CB', 'CTTGCATGTTTATTCG-1'), ('BC', 'GGCGTTTC'), ('QT', 'FFFFFFFF'), ('RG', 'multiome_1st_donor_UPMC_day0_2:MissingLibrary:1:HMMCHDRX2:2')]
A00522:234:HMMCHDRX2:1:2117:20021:1908	147	#0	9997	0	15S34M	#0	10004	-27	CTAAGGCTAACGATACCGATAACACTAACCCTAACCATAACCCTAACCC	array('B', [11, 25, 37, 11, 11, 11, 37, 37, 11, 37, 37, 11, 11, 37, 11, 11, 11, 11, 37, 37, 11, 11, 11, 11, 25, 37, 37, 37, 11, 11, 11, 25, 37, 11, 11, 25,

In [9]:
print(f"\nHead of {day3_4sam_file}:\n")
view_sam_file(day3_4sam_file, num_reads=50)


Head of /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day3_4/atac_possorted.sam:

A00522:234:HMMCHDRX2:2:2125:7464:13636	83	#0	9997	0	17S33M	#0	10004	-26	CTTATAGATTTGAATAACCGATAACCCTAACCCTAACCCTAACCCTAACC	array('B', [25, 11, 11, 11, 11, 37, 25, 37, 37, 11, 25, 37, 11, 11, 37, 37, 37, 11, 11, 11, 11, 11, 37, 11, 25, 37, 11, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 11, 25, 37, 37])	[('NM', 0), ('MD', '33'), ('AS', 33), ('XS', 31), ('CR', 'AGCCTAGAGGCATGGC'), ('CY', 'FFFFF,FFFFFFFFFF'), ('CB', 'GCACGGTTCTCACAAA-1'), ('BC', 'AGGCTACC'), ('QT', 'FFFFFFFF'), ('RG', 'multiome_1st_donor_UPMC_day3_4:MissingLibrary:1:HMMCHDRX2:2')]
A00522:234:HMMCHDRX2:1:2220:20148:16564	147	#0	9997	2	15S34M	#0	10004	-27	CTAAGCCTTTGCCTACCGATAACCCTAACCCTAACCCTAACCCTAACCC	array('B', [11, 37, 11, 37, 11, 37, 11, 37, 11, 11, 11, 11, 25, 37, 37, 11, 11, 11, 11, 11, 37, 11, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37

In [16]:
print(f"\nHead of {day5_6sam_file}:\n")
view_sam_file(day5_6sam_file, num_reads=50)


Head of /ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_day5_6/atac_possorted.sam:

A00522:234:HMMCHDRX2:2:2261:7401:5071	147	#0	9995	0	3M1I45M	#0	10021	-22	GTCACGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC	array('B', [11, 37, 11, 11, 25, 11, 11, 11, 11, 37, 37, 37, 37, 25, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NM', 1), ('MD', '48'), ('AS', 45), ('XS', 43), ('CR', 'GTATTCGTCAGCCTTA'), ('CY', 'FFFFFFFFFFFFFFFF'), ('CB', 'GTGTCCAAGGTTTGCG-1'), ('BC', 'TTCTACAG'), ('QT', 'FFFFFFFF'), ('RG', 'multiome_1st_donor_UPMC_day5_6:MissingLibrary:1:HMMCHDRX2:2')]
A00522:234:HMMCHDRX2:1:2203:9941:4398	99	#0	9997	0	50M	#0	10114	160	CCCATAACCCTAACCCTAACCCTAACCCTAAACCAAACCCAAACCCTAAA	array('B', [37, 11, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 11, 37, 11, 25, 37, 25, 11, 37, 37, 11,

#### Count unique barcodes in the SAM files using samtools (sbatch)

In [1]:
# files to load up atac_barcodes
day0_2sam_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_barcodes_day0_2/cell_barcodes.txt"
day3_4sam_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_barcodes_day3_4/cell_barcodes.txt"
day5_6sam_barcodes = "/ocean/projects/cis240075p/asachan/datasets/B_Cell/multiome_1st_donor_UPMC_aggr/sorting_atac_outs/sam_barcodes_day5_6/cell_barcodes.txt"

In [2]:
# load in pandas to inspect
import pandas as pd
day0_2sam_barcodes = pd.read_csv(day0_2sam_barcodes, header=None, index_col=None)
print(day0_2sam_barcodes.head())
print(day0_2sam_barcodes.shape)

                    0
0  AAACAGCCAAACAACA-1
1  AAACAGCCAAACATAG-1
2  AAACAGCCAAACCTAT-1
3  AAACAGCCAAACCTTG-1
4  AAACAGCCAAACGCGA-1
(538262, 1)


In [3]:
day3_4sam_barcodes = pd.read_csv(day3_4sam_barcodes, header=None, index_col=None)
print(day3_4sam_barcodes.head())
print(day3_4sam_barcodes.shape)

                    0
0  AAACAGCCAAACAACA-1
1  AAACAGCCAAACATAG-1
2  AAACAGCCAAACCCTA-1
3  AAACAGCCAAACCTAT-1
4  AAACAGCCAAACCTTG-1
(555633, 1)


In [4]:
day5_6sam_barcodes = pd.read_csv(day5_6sam_barcodes, header=None, index_col=None)
print(day5_6sam_barcodes.head())
print(day5_6sam_barcodes.shape)

                    0
0  AAACAGCCAAACAACA-1
1  AAACAGCCAAACATAG-1
2  AAACAGCCAAACCCTA-1
3  AAACAGCCAAACCTAT-1
4  AAACAGCCAAACCTTG-1
(530700, 1)


In [5]:
# check unique cell barcodes
day0_2_cells = set(day0_2sam_barcodes[0])
day3_4_cells = set(day3_4sam_barcodes[0])
day5_6_cells = set(day5_6sam_barcodes[0])

unique_cells = day0_2_cells.union(day3_4_cells, day5_6_cells)
print(f"Total unique cells: {len(unique_cells)}")

Total unique cells: 676269


### Prepare expression.tsv for each time point, for the sorting code to be able to slice bams using it as a reference