# Import module

The link to get [ImageAnalysis3](https://github.com/zhengpuas47/ImageAnalysis3) 

Or from the Zhuang lab archived [source_tools](https://github.com/ZhuangLab/Chromatin_Analysis_2020_cell/tree/master/sequential_tracing/source)

## ImageAnalysis3 and basic modules

In [1]:
%run "C:\Users\shiwei\Documents\ImageAnalysis3\required_files\Startup_py3.py"
sys.path.append(r"C:\Users\shiwei\Documents")

import ImageAnalysis3 as ia
from ImageAnalysis3 import *
from ImageAnalysis3.classes import _allowed_kwds

import h5py
import ast
import pandas as pd

print(os.getpid())

18992


## Chromatin_analysis_tools etc

See **functions** in the repository for [AnalysisTool_Chromatin](../../README.md)

In [2]:
# Chromatin_analysis_tools (ATC)
# Get path for the py containing functions
import os
import sys
import importlib
module_path =r'C:\Users\shiwei\Documents\AnalysisTool_Chromatin'
if module_path not in sys.path:
    sys.path.append(module_path)
    
# import relevant modules
import gene_selection 
importlib.reload(gene_selection)
import gene_to_loci
importlib.reload(gene_to_loci)
import gene_activity
importlib.reload(gene_activity)
import loci_1d_features
importlib.reload(loci_1d_features)  

import atac_to_loci
importlib.reload(atac_to_loci)

<module 'atac_to_loci' from 'C:\\Users\\shiwei\\Documents\\AnalysisTool_Chromatin\\atac_to_loci.py'>

# Define folders

In [3]:
# main folder for postanalysis
postanalysis_folder = r'L:\Shiwei\postanalysis_2024\v0'
# input files for postanalysis
input_folder = os.path.join(postanalysis_folder, 'resources_from_preprocess')

# output file to be generated
output_main_folder = os.path.join(postanalysis_folder, 'locus_annotation')

output_analysis_folder = os.path.join(output_main_folder, 'analysis')
output_figure_folder = os.path.join(output_main_folder, 'figures')

# make new folder if needed
make_output_folder = True

if make_output_folder and not os.path.exists(output_analysis_folder):
    os.makedirs(output_analysis_folder)
    print(f'Generating analysis folder: {output_analysis_folder}.')
elif os.path.exists(output_analysis_folder):
    print(f'Use existing analysis folder: {output_analysis_folder}.')
    
if make_output_folder and not os.path.exists(output_figure_folder):
    os.makedirs(output_figure_folder)
    print(f'Generating figure folder: {output_figure_folder}.')
elif os.path.exists(output_figure_folder):
    print(f'Use existing figure folder: {output_figure_folder}.')

Use existing analysis folder: L:\Shiwei\postanalysis_2024\v0\locus_annotation\analysis.
Use existing figure folder: L:\Shiwei\postanalysis_2024\v0\locus_annotation\figures.


# Plotting parameters

In [4]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['pdf.fonttype'] = 42
import matplotlib.pyplot as plt
plt.rc('font', family='serif')
plt.rc('font', serif='Arial')

from ImageAnalysis3.figure_tools import _double_col_width, _single_col_width, _font_size, _ticklabel_size,_ticklabel_width

import seaborn as sns
sns.set_context("paper", rc={"font.size":_font_size,"axes.titlesize":_font_size+1,"axes.labelsize":_font_size})  

# Load data relevant information

## load codebook with peak annotation

annotated codebook can be generated similarly with ATAC data using the notebook below:


[external/scripts/pair_tag/2_adjacent_h3k27me3_peak_annotation_for_merfish_loci](../../../external/scripts/pair_tag/2_adjacent_h3k27me3_peak_annotation_for_merfish_loci.ipynb)

information related to ATAC can be found from the folder in the repository:
[ATACseq_MOp_folder](../../../external/scripts/sn_atac)

In [5]:
# load codebook
codebook_folder = output_analysis_folder
target_mode = 'H3K9me3'
# Load codebook and sort
codebook_fname = os.path.join(codebook_folder,f'MERFISH_loci_adjacent_{target_mode}_center.csv')
codebook_df = pd.read_csv (codebook_fname, index_col=0)
codebook_df = loci_1d_features.sort_loci_df_by_chr_order (codebook_df)

codebook_df.head()

Unnamed: 0_level_0,name,id,chr,chr_order,library,adjacent_peaks_2000kb_center
loci_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
chr1_3742742_3759944,1:3742742-3759944,1,1,0,CTP11,chr1_3000000_3005000; chr1_3005000_3010000; ch...
chr1_6245958_6258969,1:6245958-6258969,2,1,1,CTP11,chr1_4245000_4250000; chr1_4250000_4255000; ch...
chr1_8740008_8759916,1:8740008-8759916,3,1,2,CTP11,chr1_10000000_10005000; chr1_10005000_10010000...
chr1_9627926_9637875,1:9627926-9637875,1,1,3,CTP13,chr1_10000000_10005000; chr1_10005000_10010000...
chr1_9799472_9811359,1:9799472-9811359,2,1,4,CTP13,chr1_10000000_10005000; chr1_10005000_10010000...


## load analyzed paired-Tag annData

anndata can be generated using the notebook below, for example for H3K27ac:

[1_prepare_pairtag_fc_h3k27ac_adata](1_prepare_pairtag_fc_h3k27ac_adata.ipynb)

for other histone marks, use the corresponding notebook(s) in the same folder as above

In [6]:
# Get loaded adata from other notebook
import os
import scanpy as sc
# L drive is Crick Pu_SSD_0
scRNA_folder = r'L:\Shiwei\DNA_MERFISH_analysis\Paired_tag\anndata'
# load from here for saved h5ad
#adata = sc.read(os.path.join(scRNA_folder,r'MOp_ATAC_combined_preprocessed.h5ad'))
adata = sc.read(os.path.join(scRNA_folder,f'FC_pairtag_{target_mode}.h5ad'))

In [7]:
print(np.max(adata.X))

76.0


In [8]:
adata.obs.head()

Unnamed: 0_level_0,Tissue,Rep,Target,Total_RNA_Reads,Mapped_RNA_Reads,Uniquely_Mapped_RNA_Reads,UMI_RNA,nGene_RNA,Total_DNA_Reads,Mapped_DNA_Reads,Uniquely_Mapped_DNA_Reads,nFragments_DNA,Membership,Annotation,RNA_UMAP_1,RNA_UMAP_2,DNA_UMAP_1,DNA_UMAP_2,cluster
Cell_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
21:02:36:07,FC,1,H3K9me3,34739,30315,29003,18466,2640,39652,30243,11976,14847,4,FC_ExNeu_PT,1.185773,-4.536413,1.002368,1.100115,L5 ET
21:02:38:09,FC,3,H3K9me3,25685,22796,21976,14307,2422,11218,7159,2134,3873,1,FC_ExNeu_L23,-2.796159,-8.111693,2.278006,-1.729979,L2/3 IT
21:02:53:07,FC,1,H3K9me3,5343,4322,3908,2797,683,35357,32062,14372,16174,14,BR_InNeu_Pvalb,-5.499069,2.459688,0.244727,2.480219,Pvalb
21:03:16:09,FC,3,H3K9me3,13878,11969,11428,7649,1579,13872,11038,3753,5550,20,BR_NonNeu_Microglia,0.676187,2.337654,-4.075477,0.700106,Micro
21:03:19:07,FC,1,H3K9me3,5217,4412,4175,2820,704,15960,13151,5214,6700,12,BR_InNeu_CGE,-3.82065,3.251819,-0.88357,1.537417,CGE


In [9]:
adata.var.head()

chr10_10000000_10005000
chr10_100000000_100005000
chr10_100005000_100010000
chr10_100010000_100015000
chr10_100015000_100020000


# Extract peak numbers

In [10]:
# change index name as imaged loci to find nearby genes
import gene_to_loci as gl

imaged_loci_df = codebook_df.copy(deep=True)

imaged_loci_df.index.name = 'Imaged_loci'
imaged_loci_df = gl.direct_get_genes_near_gene_dataframe (imaged_loci_df,
                                   codebook_df, 
                                   adjacent_gene_col = None)


imaged_loci_df

Get all existing adjacent gene columns.


Unnamed: 0_level_0,name,id,chr,chr_order,library,adjacent_peaks_2000kb_center
Imaged_loci,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
chr1_3742742_3759944,1:3742742-3759944,1,1,0,CTP11,chr1_3000000_3005000; chr1_3005000_3010000; ch...
chr1_6245958_6258969,1:6245958-6258969,2,1,1,CTP11,chr1_4245000_4250000; chr1_4250000_4255000; ch...
chr1_8740008_8759916,1:8740008-8759916,3,1,2,CTP11,chr1_10000000_10005000; chr1_10005000_10010000...
chr1_9627926_9637875,1:9627926-9637875,1,1,3,CTP13,chr1_10000000_10005000; chr1_10005000_10010000...
chr1_9799472_9811359,1:9799472-9811359,2,1,4,CTP13,chr1_10000000_10005000; chr1_10005000_10010000...
...,...,...,...,...,...,...
chrX_166247682_166259932,X:166247682-166259932,1059,X,60,CTP11,chrX_164250000_164255000; chrX_164255000_16426...
chrX_167157164_167167452,X:167157164-167167452,990,X,61,CTP13,chrX_165155000_165160000; chrX_165160000_16516...
chrX_168746045_168757590,X:168746045-168757590,1060,X,62,CTP11,chrX_166745000_166750000; chrX_166750000_16675...
chrX_169963295_170005197,X:169963295-170005197,991,X,63,CTP13,chrX_167965000_167970000; chrX_167970000_16797...


In [11]:
imaged_loci_df.columns

Index(['name', 'id', 'chr', 'chr_order', 'library',
       'adjacent_peaks_2000kb_center'],
      dtype='object')

## process for subclasses

In [12]:
#groupby_adata = 'subclass_label_new'
groupby_adata = 'cluster'
# pass/change variable name
adata_ori = adata

np.unique(list(adata_ori.obs[groupby_adata]))
sel_class_to_process = [c for c in np.unique(list(adata_ori.obs[groupby_adata])) if c!='nan']
sel_class_to_process

['Astro',
 'CGE',
 'Endo',
 'L2/3 IT',
 'L4/5 IT',
 'L5 ET',
 'L5 IT',
 'L5/6 NP',
 'L6 CT',
 'Micro',
 'OPC',
 'Oligo',
 'Pvalb',
 'Sst']

In [13]:
# output_folder
output_folder = os.path.join(output_analysis_folder, r'Pairtag\subclass')
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
    print ('Generate output folder')

In [14]:
%matplotlib inline

import gene_activity
import loci_1d_features
from scipy import stats
import seaborn as sns
from tqdm import tqdm

bin_size =2000 # extend both direction
adjcent_col = f'adjacent_peaks_{bin_size}kb_center'
activity_type = 'sum' # sum of all gene associated to a loci for each single cell

expression_res_df_dict = {}
sel_class_to_process = [c for c in np.unique(list(adata_ori.obs[groupby_adata])) if c!='nan']

# simple loop
for _group in sel_class_to_process[:]:
    _group_fname = _group.replace('/','_').replace(' ','_')
    
    print (f'Process pairtag data for {_group}')
    
    expression_res_dict={}
    sorted_group_order = [_group]

    imaged_loci_df_group = loci_1d_features.codebook_chr_order_for_loci_dataframe (imaged_loci_df, 
                                               codebook_df, 
                                               sel_cols =['chr','chr_order','id'], 
                                               sort_df = True,
                                               sort_by_chr=True)

    loci_key_list = loci_1d_features.sorted_loci_keys_for_loci_dataframe(imaged_loci_df_group)
    loci_ori_ind = loci_1d_features.find_chr_loci_iloc_from_loci_keys (codebook_df, loci_key_list)

    # for loci along the chromosome, append the measurements for each single cell
    for _ind, sel_loci_ind in tqdm(enumerate(imaged_loci_df_group.index.tolist()[:])):

        # get adjacent gene expression (which are peaks in this case)
        sel_genes=imaged_loci_df_group.loc[sel_loci_ind][adjcent_col].split('; ')
        sel_adata =  adata_ori[:,adata_ori.var.index.isin(sel_genes)]
        marker_expressions = gene_activity.gene_activity_raw_groups(sel_genes,
                            sel_adata, 
                            sorted_group_order,
                            groupby_adata,
                            ref_norm_list = [],
                            report_type =activity_type)

        expression_res_dict[_ind]=list(marker_expressions[_group])
        
    # convert dict to df as loci by cell
    expression_res_df = pd.DataFrame.from_dict(expression_res_dict, orient='index')
    expression_res_df['loci_name']=codebook_df.index.tolist()
    expression_res_df = expression_res_df.set_index ('loci_name')
    expression_res_df_dict[_group]=expression_res_df
    # save
    output_df_fname = os.path.join(output_folder, f'MERFISH_loci_{target_mode}_2X_{bin_size}kb_for_{_group_fname}.csv')
    expression_res_df.to_csv(output_df_fname)
    print ('=========================================================')


Process pairtag data for Astro


1982it [01:24, 23.44it/s]


Process pairtag data for CGE


1982it [01:18, 25.13it/s]


Process pairtag data for Endo


1982it [01:10, 28.06it/s]


Process pairtag data for L2/3 IT


1982it [02:12, 14.91it/s]


Process pairtag data for L4/5 IT


1982it [01:37, 20.43it/s]


Process pairtag data for L5 ET


1982it [01:27, 22.70it/s]


Process pairtag data for L5 IT


1982it [01:52, 17.56it/s]


Process pairtag data for L5/6 NP


1982it [01:10, 28.19it/s]


Process pairtag data for L6 CT


1982it [01:36, 20.52it/s]


Process pairtag data for Micro


1982it [01:16, 26.06it/s]


Process pairtag data for OPC


1982it [01:10, 28.21it/s]


Process pairtag data for Oligo


1982it [01:10, 28.02it/s]


Process pairtag data for Pvalb


1982it [01:24, 23.32it/s]


Process pairtag data for Sst


1982it [01:19, 24.89it/s]




In [15]:
expression_res_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,263,264,265,266,267,268,269,270,271,272
loci_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chr1_3742742_3759944,6.0,4.0,0.0,2.0,5.0,0.0,14.0,6.0,1.0,5.0,...,35.0,15.0,4.0,0.0,3.0,6.0,0.0,0.0,5.0,0.0
chr1_6245958_6258969,4.0,10.0,7.0,28.0,16.0,5.0,19.0,11.0,2.0,5.0,...,69.0,37.0,7.0,10.0,16.0,15.0,4.0,2.0,23.0,5.0
chr1_8740008_8759916,6.0,4.0,3.0,23.0,7.0,5.0,7.0,11.0,3.0,1.0,...,34.0,27.0,8.0,17.0,25.0,13.0,8.0,3.0,11.0,4.0
chr1_9627926_9637875,7.0,2.0,5.0,15.0,4.0,6.0,7.0,12.0,3.0,1.0,...,20.0,19.0,9.0,20.0,16.0,11.0,8.0,2.0,12.0,4.0
chr1_9799472_9811359,7.0,2.0,5.0,15.0,4.0,6.0,7.0,12.0,2.0,1.0,...,20.0,18.0,9.0,19.0,16.0,11.0,8.0,1.0,7.0,4.0
