# Import module

The link to get [ImageAnalysis3](https://github.com/zhengpuas47/ImageAnalysis3) 

Or from the Zhuang lab archived [source_tools](https://github.com/ZhuangLab/Chromatin_Analysis_2020_cell/tree/master/sequential_tracing/source)

## ImageAnalysis3 and basic modules

In [1]:
%run "C:\Users\shiwei\Documents\ImageAnalysis3\required_files\Startup_py3.py"
sys.path.append(r"C:\Users\shiwei\Documents")

import ImageAnalysis3 as ia
from ImageAnalysis3 import *
from ImageAnalysis3.classes import _allowed_kwds

import h5py
import ast
import pandas as pd

print(os.getpid())

23784


## Chromatin_analysis_tools etc

See **functions** in the repository for [AnalysisTool_Chromatin](../../README.md)

In [2]:
# Chromatin_analysis_tools (ATC)
# Get path for the py containing functions
import os
import sys
import importlib
module_path =r'C:\Users\shiwei\Documents\AnalysisTool_Chromatin'
if module_path not in sys.path:
    sys.path.append(module_path)
    
# import relevant modules
import gene_selection 
importlib.reload(gene_selection)
import gene_to_loci
importlib.reload(gene_to_loci)
import gene_activity
importlib.reload(gene_activity)
import loci_1d_features
importlib.reload(loci_1d_features)  

import atac_to_loci
importlib.reload(atac_to_loci)

<module 'atac_to_loci' from 'C:\\Users\\shiwei\\Documents\\AnalysisTool_Chromatin\\atac_to_loci.py'>

# Define folders

In [3]:
# main folder for postanalysis
postanalysis_folder = r'L:\Shiwei\postanalysis_2024\v0'
# input files for postanalysis
input_folder = os.path.join(postanalysis_folder, 'resources_from_preprocess')

# output file to be generated
output_main_folder = os.path.join(postanalysis_folder, 'locus_annotation')

output_analysis_folder = os.path.join(output_main_folder, 'analysis')
output_figure_folder = os.path.join(output_main_folder, 'figures')

# make new folder if needed
make_output_folder = True

if make_output_folder and not os.path.exists(output_analysis_folder):
    os.makedirs(output_analysis_folder)
    print(f'Generating analysis folder: {output_analysis_folder}.')
elif os.path.exists(output_analysis_folder):
    print(f'Use existing analysis folder: {output_analysis_folder}.')
    
if make_output_folder and not os.path.exists(output_figure_folder):
    os.makedirs(output_figure_folder)
    print(f'Generating figure folder: {output_figure_folder}.')
elif os.path.exists(output_figure_folder):
    print(f'Use existing figure folder: {output_figure_folder}.')

Use existing analysis folder: L:\Shiwei\postanalysis_2024\v0\locus_annotation\analysis.
Use existing figure folder: L:\Shiwei\postanalysis_2024\v0\locus_annotation\figures.


# Plotting parameters

In [4]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['pdf.fonttype'] = 42
import matplotlib.pyplot as plt
plt.rc('font', family='serif')
plt.rc('font', serif='Arial')

from ImageAnalysis3.figure_tools import _double_col_width, _single_col_width, _font_size, _ticklabel_size,_ticklabel_width

import seaborn as sns
sns.set_context("paper", rc={"font.size":_font_size,"axes.titlesize":_font_size+1,"axes.labelsize":_font_size})  

# Load data relevant information

## load codebook with peak annotation

annotated codebook can be generated similarly with ATAC data using the notebook below:


[external/scripts/pair_tag/2_adjacent_h3k27me3_peak_annotation_for_merfish_loci](../../../external/scripts/pair_tag/2_adjacent_h3k27me3_peak_annotation_for_merfish_loci.ipynb)

information related to ATAC can be found from the folder in the repository:
[ATACseq_MOp_folder](../../../external/scripts/sn_atac)

In [5]:
# load codebook
codebook_folder = output_analysis_folder
target_mode = 'H3K4me3'
# Load codebook and sort
codebook_fname = os.path.join(codebook_folder,f'MERFISH_loci_adjacent_{target_mode}_center.csv')
codebook_df = pd.read_csv (codebook_fname, index_col=0)
codebook_df = loci_1d_features.sort_loci_df_by_chr_order (codebook_df)

codebook_df.head()

Unnamed: 0_level_0,name,id,chr,chr_order,library,adjacent_peaks_2000kb_center
loci_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
chr1_3742742_3759944,1:3742742-3759944,1,1,0,CTP11,chr1_3000000_3001000; chr1_3003000_3004000; ch...
chr1_6245958_6258969,1:6245958-6258969,2,1,1,CTP11,chr1_4248000_4249000; chr1_4249000_4250000; ch...
chr1_8740008_8759916,1:8740008-8759916,3,1,2,CTP11,chr1_10001000_10002000; chr1_10002000_10003000...
chr1_9627926_9637875,1:9627926-9637875,1,1,3,CTP13,chr1_10001000_10002000; chr1_10002000_10003000...
chr1_9799472_9811359,1:9799472-9811359,2,1,4,CTP13,chr1_10001000_10002000; chr1_10002000_10003000...


## load analyzed paired-Tag annData

anndata can be generated using the notebook below, for example for H3K27ac:

[1_prepare_pairtag_fc_h3k27ac_adata](1_prepare_pairtag_fc_h3k27ac_adata.ipynb)

for other histone marks, use the corresponding notebook(s) in the same folder as above

In [6]:
# Get loaded adata from other notebook
import os
import scanpy as sc
# L drive is Crick Pu_SSD_0
scRNA_folder = r'L:\Shiwei\DNA_MERFISH_analysis\Paired_tag\anndata'
# load from here for saved h5ad
#adata = sc.read(os.path.join(scRNA_folder,r'MOp_ATAC_combined_preprocessed.h5ad'))
adata = sc.read(os.path.join(scRNA_folder,f'FC_pairtag_{target_mode}.h5ad'))

In [7]:
print(np.max(adata.X))

17.0


In [8]:
adata.obs.head()

Unnamed: 0_level_0,Tissue,Rep,Target,Total_RNA_Reads,Mapped_RNA_Reads,Uniquely_Mapped_RNA_Reads,UMI_RNA,nGene_RNA,Total_DNA_Reads,Mapped_DNA_Reads,Uniquely_Mapped_DNA_Reads,nFragments_DNA,Membership,Annotation,RNA_UMAP_1,RNA_UMAP_2,DNA_UMAP_1,DNA_UMAP_2,cluster
Cell_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
41:03:23:10,FC,2,H3K4me3,16876,16109,15331,3936,508,5040,4602,4490,908,12,BR_InNeu_CGE,-3.804527,3.530817,-0.179855,-0.882941,CGE
41:03:57:11,FC,2,H3K4me3,2851,2583,2515,633,127,2519,2407,2278,411,15,BR_NonNeu_OPC,0.084456,5.195771,-0.614379,1.548475,OPC
41:04:22:04,FC,1,H3K4me3,4496,4106,3877,1029,177,10707,10317,9217,1990,2,FC_ExNeu_L4,-3.956812,-4.057654,-0.285618,-1.621552,L4/5 IT
41:04:36:11,FC,2,H3K4me3,2690,2505,2297,644,147,6422,6212,5920,983,19,BR_NonNeu_Astro_Nnat,-2.640411,10.285779,0.313026,3.098745,Astro
41:04:51:05,FC,1,H3K4me3,3275,3037,2886,770,143,8367,7945,7545,1630,13,BR_InNeu_Sst,-4.874347,2.274326,-0.430692,-0.232554,Sst


In [9]:
adata.var.head()

chr10_100002000_100003000
chr10_100004000_100005000
chr10_100007000_100008000
chr10_100008000_100009000
chr10_100009000_100010000


# Extract peak numbers

In [10]:
# change index name as imaged loci to find nearby genes
import gene_to_loci as gl

imaged_loci_df = codebook_df.copy(deep=True)

imaged_loci_df.index.name = 'Imaged_loci'
imaged_loci_df = gl.direct_get_genes_near_gene_dataframe (imaged_loci_df,
                                   codebook_df, 
                                   adjacent_gene_col = None)


imaged_loci_df

Get all existing adjacent gene columns.


Unnamed: 0_level_0,name,id,chr,chr_order,library,adjacent_peaks_2000kb_center
Imaged_loci,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
chr1_3742742_3759944,1:3742742-3759944,1,1,0,CTP11,chr1_3000000_3001000; chr1_3003000_3004000; ch...
chr1_6245958_6258969,1:6245958-6258969,2,1,1,CTP11,chr1_4248000_4249000; chr1_4249000_4250000; ch...
chr1_8740008_8759916,1:8740008-8759916,3,1,2,CTP11,chr1_10001000_10002000; chr1_10002000_10003000...
chr1_9627926_9637875,1:9627926-9637875,1,1,3,CTP13,chr1_10001000_10002000; chr1_10002000_10003000...
chr1_9799472_9811359,1:9799472-9811359,2,1,4,CTP13,chr1_10001000_10002000; chr1_10002000_10003000...
...,...,...,...,...,...,...
chrX_166247682_166259932,X:166247682-166259932,1059,X,60,CTP11,chrX_164254000_164255000; chrX_164255000_16425...
chrX_167157164_167167452,X:167157164-167167452,990,X,61,CTP13,chrX_165169000_165170000; chrX_165178000_16517...
chrX_168746045_168757590,X:168746045-168757590,1060,X,62,CTP11,chrX_166746000_166747000; chrX_166748000_16674...
chrX_169963295_170005197,X:169963295-170005197,991,X,63,CTP13,chrX_167963000_167964000; chrX_167965000_16796...


In [11]:
imaged_loci_df.columns

Index(['name', 'id', 'chr', 'chr_order', 'library',
       'adjacent_peaks_2000kb_center'],
      dtype='object')

## process for subclasses

In [12]:
#groupby_adata = 'subclass_label_new'
groupby_adata = 'cluster'
# pass/change variable name
adata_ori = adata

np.unique(list(adata_ori.obs[groupby_adata]))
sel_class_to_process = [c for c in np.unique(list(adata_ori.obs[groupby_adata])) if c!='nan']
sel_class_to_process

['Astro',
 'CGE',
 'Endo',
 'L2/3 IT',
 'L4/5 IT',
 'L5 ET',
 'L5 IT',
 'L5/6 NP',
 'L6 CT',
 'Micro',
 'OPC',
 'Oligo',
 'Pvalb',
 'Sst']

In [13]:
# output_folder
output_folder = os.path.join(output_analysis_folder, r'Pairtag\subclass')
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
    print ('Generate output folder')

In [14]:
%matplotlib inline

import gene_activity
import loci_1d_features
from scipy import stats
import seaborn as sns
from tqdm import tqdm

bin_size =2000 # extend both direction
adjcent_col = f'adjacent_peaks_{bin_size}kb_center'
activity_type = 'sum' # sum of all gene associated to a loci for each single cell

expression_res_df_dict = {}
sel_class_to_process = [c for c in np.unique(list(adata_ori.obs[groupby_adata])) if c!='nan']

# simple loop
for _group in sel_class_to_process[:]:
    _group_fname = _group.replace('/','_').replace(' ','_')
    
    print (f'Process pairtag data for {_group}')
    
    expression_res_dict={}
    sorted_group_order = [_group]

    imaged_loci_df_group = loci_1d_features.codebook_chr_order_for_loci_dataframe (imaged_loci_df, 
                                               codebook_df, 
                                               sel_cols =['chr','chr_order','id'], 
                                               sort_df = True,
                                               sort_by_chr=True)

    loci_key_list = loci_1d_features.sorted_loci_keys_for_loci_dataframe(imaged_loci_df_group)
    loci_ori_ind = loci_1d_features.find_chr_loci_iloc_from_loci_keys (codebook_df, loci_key_list)

    # for loci along the chromosome, append the measurements for each single cell
    for _ind, sel_loci_ind in tqdm(enumerate(imaged_loci_df_group.index.tolist()[:])):

        # get adjacent gene expression (which are peaks in this case)
        sel_genes=imaged_loci_df_group.loc[sel_loci_ind][adjcent_col].split('; ')
        sel_adata =  adata_ori[:,adata_ori.var.index.isin(sel_genes)]
        marker_expressions = gene_activity.gene_activity_raw_groups(sel_genes,
                            sel_adata, 
                            sorted_group_order,
                            groupby_adata,
                            ref_norm_list = [],
                            report_type =activity_type)

        expression_res_dict[_ind]=list(marker_expressions[_group])
        
    # convert dict to df as loci by cell
    expression_res_df = pd.DataFrame.from_dict(expression_res_dict, orient='index')
    expression_res_df['loci_name']=codebook_df.index.tolist()
    expression_res_df = expression_res_df.set_index ('loci_name')
    expression_res_df_dict[_group]=expression_res_df
    # save
    output_df_fname = os.path.join(output_folder, f'MERFISH_loci_{target_mode}_2X_{bin_size}kb_for_{_group_fname}.csv')
    expression_res_df.to_csv(output_df_fname)
    print ('=========================================================')


Process pairtag data for Astro


1982it [02:59, 11.04it/s]


Process pairtag data for CGE


1982it [02:41, 12.27it/s]


Process pairtag data for Endo


1982it [02:37, 12.57it/s]


Process pairtag data for L2/3 IT


1982it [02:36, 12.63it/s]


Process pairtag data for L4/5 IT


1982it [02:37, 12.57it/s]


Process pairtag data for L5 ET


1982it [02:37, 12.58it/s]


Process pairtag data for L5 IT


1982it [02:37, 12.62it/s]


Process pairtag data for L5/6 NP


1982it [02:37, 12.59it/s]


Process pairtag data for L6 CT


1982it [02:37, 12.55it/s]


Process pairtag data for Micro


1982it [02:39, 12.43it/s]


Process pairtag data for OPC


1982it [02:40, 12.37it/s]


Process pairtag data for Oligo


1982it [02:46, 11.92it/s]


Process pairtag data for Pvalb


1982it [02:38, 12.51it/s]


Process pairtag data for Sst


1982it [02:39, 12.46it/s]




In [15]:
expression_res_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,49,50,51,52,53,54,55,56,57,58
loci_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chr1_3742742_3759944,3.0,0.0,0.0,0.0,2.0,0.0,0.0,3.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,1.0
chr1_6245958_6258969,3.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0,0.0,2.0
chr1_8740008_8759916,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,3.0,3.0,0.0,1.0
chr1_9627926_9637875,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,3.0,1.0,0.0,1.0
chr1_9799472_9811359,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,3.0,1.0,0.0,1.0
