# Task 8: Predict the Relevance of cCREs in Biological Mechanisms Shared by Genes with Common Promoter-Enhancer Interactions
## To-do:
- Add arrows to indicate gene directionality;  
- Apply **cosine similarity** and/or other similarity metrics to identify genes that share commonalities with ACE2:  
    - Tissue-expression;  
    - XCI-status;  
    - Sex-diffential expression;  
    
### Tissue expression similarity:
- Construct a score based on two factors:  

1) Similarity with ACE2 at the tissue level: E.g. If ACE2 is highlhy expressed in tissues **x**, **y**, and **z**, all other scores are out of *3*. All other genes will recieve a count score out of three based on expression in tissues **x**, **y**, and **z**.  
    - This score may or may not consider expression levels in TPM.  
    
2) Overall expression pattern: E.g. if a gene is widely expressed, it will recieve a low score. On the other hand, if it has a narrow, more tissue-specific expression pattern it will recieve a higher score.  

### Adding arrows using ggplot's geom_segment:
**arrow** - (default: NULL) the arrow to draw at the **end** point of the line segment  
1) Obtain genes in chrX:15,200,000-15,800,000 using UCSC Table Browser (**hg38 GENCODE v32 knownGene**): The chromStart or trxStart is in reference to **+** strand genes.  

strand|chromStart|chromEnd
:--:|:--:|:--:
+|Start/5'-end|End/3'-end
-|End/3'-end|Start/5'-end

**In R:** 

if(strand == "+"){  
start <- chromStart  
end <- chromEnd  
}else{  
start <- chromEnd  
end <- chromStart  
}  

![Rplot01-arrow+type+genes+score.png](attachment:Rplot01-arrow+type+genes+score.png)

# July 9th Meeting: ACE2 Sex-Bias

### Hypothesis I: Sex hormones androgens and estrogens determine sex differences in ACE2 mRNA expression between males and females.   
[Sex differences in renal angiotensin converting enzyme 2 (ACE2) activity are 17β-oestradiol-dependent and sex chromosome-independent](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3010099/): Sex differences in renal ACE2 activity in intact mice are due, at least in part, to the presence of 17β-estradiol (E2) in the ovarian hormone milieu and not to the testicular milieu or to differences in sex chromosome dosage (2X versus 1X; 0Y versus 1Y).  

[Landscape of X chromosome inactivation across human tissues](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5685192/#SD1): In the analysis of sex bias across GTEx tissues, ACE2 is an established escape gene that showed uncharacteristically heterogeneous patterns of male-female expression differences. The significant male-biased expression in several
tissues can arise due to either subtler or more tissue-specific escape from XCI, yet we also find a few potential alternative explanations. Hormone-dependent gene regulation can hamper the detection of the expected female bias in escape gene expression. For instance, the predominant male-biased expression of ACE2 is in line with the demonstrated higher ACE2 activity in males partially driven by sex steroids. We also note that a cluster of escape genes with less consistent sex bias profiles resides in the chromosomal region telomeric from the X-inactivation center, in the evolutionarily older region of the chromosome.

### Hypothesis II: Localization of ACE2 on the X chromosome determines sex differences in ACE2 mRNA expression between males and females.
[COVID-19 and Individual Genetic Susceptibility/Receptivity: Role of ACE1/ACE2 Genes, Immunity, Inflammation and Coagulation. Might the Double X-Chromosome in Females Be Protective against SARS-CoV-2 Compared to the Single X-Chromosome in Males?](https://www.mdpi.com/1422-0067/21/10/3474/htm): X-linked heterozygous alleles could activate in females a mosaic advantage and a greater sexual dimorphism that might counteract viral infection, local inflammation due to cytokine storms and severe outcomes.  
> ADAM17 by promoting the detaching of ACE2 cell receptor might contribute by downregulating the ACE2/Ang1-7/Mas axis, and in a sex-oriented perspective, SRY (Y-chromosome) and SOX3 (X-chromosome) both by upregulating AGT, and downregulating ACE2, AT2, and MAS. Conversely, SRY upregulates, whilst SOX3 downregulates, the REN promoter, thus being a potentially detrimental step in limiting the global rate of the RAS system that is particularly frail in males.

[Landscape of X chromosome inactivation across human tissues](https://www.nature.com/articles/nature24265?sf122377809=1): A systematic survey of XCI that integrated transcriptomes with genomic data [127] identified ACE2 as a tissue-specific escape gene that showed moderate male-biased expression in lungs, higher male-biased expression in the small intestine, and weak male-biased expression in Epstein–Barr virus (EBV)-transformed lymphocytes.  

### Hypothesis III: Protective, immune-related X-linked genes appear to be more activated in female immune cells.
[COVID-19 and Individual Genetic Susceptibility/Receptivity: Role of ACE1/ACE2 Genes, Immunity, Inflammation and Coagulation. Might the Double X-Chromosome in Females Be Protective against SARS-CoV-2 Compared to the Single X-Chromosome in Males?](https://www.mdpi.com/1422-0067/21/10/3474/htm)

### Hypothesis IV: There is no sex-bias in ACE2 expression.
[Sex difference and smoking predisposition in patients with COVID-19](https://pubmed.ncbi.nlm.nih.gov/32171067/): Sex difference and smoking predisposition in patients with COVID-19)

# Similarity Methods: [FIVE MOST POPULAR SIMILARITY MEASURES IMPLEMENTATION IN PYTHON](https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)
1) Euclidean distance: Distance between two points measured as the shortest length of the Pythegorean path connecting them.  
2) Manhattan distance: Distance between two points measured along axes at right angles.  
3) Minkowski distance: Generalized metric form of Euclidean and Manhattan distances.  
4) Cosine similarity: Normalized dot product of two attributes. Thus, two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.  
*Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].*  
5) Jaccard similarity: Similarity between finite sample sets defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets.  
> ***Sets:*** A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.  
> ***Cardinality:*** The cardinality of A denoted by |A| which counts how many elements are in A.  
> ***Intersection:*** The intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.  
> ***Union:*** Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.  

### Useful resources:
1. [On the selection of appropriate distances for gene expression data clustering](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S2-S2)  
2. [What methods exist to calculate RNA expression profile similarity](https://bioinformatics.stackexchange.com/questions/441/what-methods-exist-to-calculate-rna-expression-profile-similarity): **Pattern similarity** vs. **absolute differences**  
3. [WGCNA: an R package for weighted correlation network analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2631488/):  
> Tutorials for the WGCNA package: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html  

4. [Determinants of expression variability](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3973347/)


# Gene Similarity Ranking Based on Tissue Expression:
### Genes:
(ASB9|ASB11|PIGA|VEGFD|PIR|BMX|ACE2|AP1S2|PIR-FIGF|CLTRN|CA5BP1|AC112497.1|CA5B|ZRSR2|INE2)  

## Method 1: Scoring based on absolute differences (i.e. pairwise tissue-level/sample-level comparison to ACE2) and patternicity (i.e. gene-specific expression variability)

#### STEP 1: Compile expression data:
- Using [EMBL-EBI](https://www.ebi.ac.uk/gxa/FAQ.html), GTEx data was dowloaded and filtered for the genes of interest.  
> Grey box: expression level is below cutoff (0.5 FPKM or 0.5 TPM)  
> Light blue box: expression level is low (between 0.5 to 10 FPKM or 0.5 to 10 TPM)  
> Medium blue box: expression level is medium (between 11 to 1000 FPKM or 11 to 1000 TPM)  
> Dark blue box: expression level is high (more than 1000 FPKM or more than 1000 TPM)  
> White box: there is no data available

#### STEP 2: Set up two scoring criteria:
**Absolute Differences (AD):**
> - For each gene, calculate the # of samples with **BELOW CUTOFF**, **LOW**, **MEDIUM**, and **HIGH** expression levels based on the values above.  
> - For each gene, obtain the ratio of each category in relation to ACE2 (e.g. #LOW_gene/#LOW_ACE2, #MED_gene/#MED_ACE2, etc.).  
>> - Use **fold-change** and change # of sample to 0.1 instead of 0 to avoid errors but still reflect fold change. In the end, **1** represents closest similarity in tissue-wise expression relative to ACE2, with larger values indicating greater divergence.  
>> - Sum values for all levels to obtain total fold-change in comparison to ACE2.  
>> - In R, this can be normalized to [0.1,1], with **4 = 1 = closest ACE2 similarity*** and the maximum value = 0.1 (not 0, in order to visualize results.  

**Patternicity (P):**
> - For each gene, calculate the *mean*, *variance*, and *coefficient of variation*.  
>> - In R, this can be normalized to [0.1,1], with **1 = minimum value = most reliable representation/narrowly expressed gene**.  

*The final score is AD x P.*    

### Results of Method 1:
Based on visual inspection from EMBL-EBI GTEx data, it is expected that **ASB9, ASB11, and CLTRN** have a similar pattern of expression to ACE2. This is confirmed by the R plot:  

![EBI-GTEx-query.PNG](attachment:EBI-GTEx-query.PNG)
![Rplot-ACE2-similarity.png](attachment:Rplot-ACE2-similarity.png)

# Gene Similarity Ranking Based on Sex-Biased Expression:
### Genes:
(ASB9|ASB11|PIGA|VEGFD|PIR|BMX|ACE2|AP1S2|PIR-FIGF|CLTRN|CA5BP1|AC112497.1|CA5B|ZRSR2|INE2)  

- [The landscape of sex-differential transcriptome and its consequent selection in human adults](https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0352-z): Sex-differential gene expression (SDE) scores of 0 for non-differentially expressed genes, positive values for women-biased genes, and negative values men-biased genes.  

Gene|Tissue|SDE|Sex-Bias
:-:|:-:|:-:|:-:
ACE2|Breast-Mammary Tissue|-0.09138|Male
TMEM27 (CLTRN)|Breast-Mammary Tissue|0.006495|Female
FIGF (VEGFD)|Breast - Mammary Tissue|0.288175|Female
ZRSR2|Adipose-Subcutaneous, Artery-Aorta, Artery-Tibial, Breast-Mammary Tissue, Cells-EBV.transformed_lymphocytes, Cells-Transformed, Muscle-Skeletal, Nerve-Tibial, Skin-Not_Sun_Exposed, Skin-Sun_Exposed, Thyroid|>0|Female

- [Sex Differences in Gene Expression and Regulatory Networks across 29 Human Tissues](https://www.cell.com/cell-reports/fulltext/S2211-1247(20)30776-2): TF sex-biased targeting of genes. Differentially Expressed Genes across 29 Tissues (Voom Analysis, Absolute Fold Change ≥ 1.5, and FDR < 0.05), Positive values indicate male-biased gene expression; and negative values indicate female-biased gene expression.

Gene|Tissue|Sex-Bias
:-:|:-:|:-:
ACE2|Breast|Male
TMEM27 (CLTRN)|Breast|Female
ASB11|Breast|Male
ZRSR2|Intestine terminal ileum|Female

## Gathering Common cCRE Data:
- Once pairwise-similarity with ACE2 is established based on different criteria, the cCREs or general regions that are shared between ACE2 and a gene were gathered:

All|Common
:-:|:-:
![Rplot01-gene+cCRE+direction.png](attachment:Rplot01-gene+cCRE+direction.png)|![Rplot-reg-common-ACE2.png](attachment:Rplot-reg-common-ACE2.png)

# Relation Region to Function:
1. [Understanding Tissue-Specific Gene Regulation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5828531/):  
> -  Network edges (transcription factor to target gene connections) have higher tissue specificity than network nodes (genes) and that regulating nodes (transcription factors) are less likely to be expressed in a tissue-specific manner as compared to their targets (genes).  
> - Regulation of tissue-specific function is largely independent of transcription factor expression, and rather driven by context-dependent regulatory paths, providing transcriptional control of tissue-specific processes.  
>> *We characterized tissue-specific gene regulation starting with **GTEx gene expression data** and then use **PANDA** to integrate this information with **protein-protein interaction (PPI) and transcription factor (TF) target information**, producing 38 inferred gene regulatory networks, one for each tissue. We identified tissue-specific genes, transcription factors, and regulatory network edges, and we analyzed their properties within and across these networks.*  

2. [An integrative view of the regulatory and
transcriptional landscapes in mouse hematopoiesis](https://www.biorxiv.org/content/10.1101/731729v1.full.pdf):  
> - The transitions in epigenetic states of cCREs across cell types provides
insights into mechanisms of regulation, including decreases in numbers of active cCREs during
differentiation, transitions from poised to active or inactive states, and shifts in
nuclease accessibility of CTCF-bound elements.  
>> *The **regression modeling** of epigenetic states at cCREs and gene expression produced a versatile resource to improve selection of cCREs potentially regulating target genes. These resources are available from our VISION website (usevision.org) to aid research in genomics and hematopoiesis.*  
> -  An effective integration of major consortia (e.g. ENCODE, Roadmap) will produce
simplified representations of the data that facilitate discoveries and lead to testable hypotheses
about functions of genomic elements and mechanisms of regulatory processes. **Here, we report on our initial systematic integrative modeling of mouse hematopoiesis.**  
>>  In the intensively studied process of hematopoiesis, comprehensive datasets encompass virtually all the recognized regulatory and transcriptional changes that occur during differentiation. However, *elucidating from these comprehensive datasets the regulatory events most critical to producing the
transcriptional patterns needed for distinctive cell types is still a major challenge.*  
>> - We used the **Integrative and Discriminative Epigenomic Annotation System (IDEA)** to learn and assign epigenetic states, which are common combinations of features such as nuclease accessibility, histone modifications, and CTCF occupancy, jointly along chromosomes and across 20 hematopoietic cell types.  
>> - Furthermore, we combined the integrated features in the form of epigenetic states with peaks of
nuclease accessibility to produce an initial compendium of over 200,000 candidate CisRegulatory Elements (cCREs) active in one or more hematopoietic lineages in mouse. Investigation of state transitions in the cCREs across differentiation revealed insights into epigenetic dynamics, including progressions from poised to active or inactive enhancers and loss of nuclease accessibility at some CTCF-bound sites.  
>> - **Exploration of the correlations of cCRE states and gene expression produced a
flexible, user-tunable resource for assigning cCREs to candidate target genes in the
investigated cell types, which in turn can help explain the impacts of genetic variation in
noncoding regions.**  
>>
> **STEP 1:** We collated the raw sequence data for 150 determinations of relevant epigenetic features (104 experiments after merging replicates), including histone modifications and CTCF by ChIP-seq,
nuclease accessibility of DNA in chromatin by ATAC-seq and DNase-seq, and 20 experiments
on transcriptomes by RNA-seq.  
> **STEP 2:** We collected key signatures of epigenomic and transcriptomic data from different laboratory and consortia, which was processed due to wide differences in sequencing depth, fraction of reads on target, signal-to-noise ratio, presence of replicates, and other properties.  
> **STEP 3:** We use the IDEAS model to demonstrate the prevalence of each epigenetic features as a heatmap, organized by similarity among the states and providing an informative landscape that distinguishes multiple states signatures. Each state represents a distinct class of regulatory elements (including enhancers, promoters and boundary elements). *For example, six states showed a promoter-like signature, with high frequency of H3K4me3, but distinguished by the presence or absence of other features with functional implications.*  
>> - The frequent co-occurrence of some histone modifications have led to discrete models for
epigenetic structures of candidate cis-regulatory elements, or cCREs, and can be used to assign each
segment of DNA in each cell type to an epigenetic state. Computational tools such as
**chromHMM**, **Segway**, and **Spectacle** provide informative segmentations primarily in one dimension, usually along chromosomes. The Integrative and Discriminative Epigenome Annotation System, or IDEAS, expands the capability of segmentation tools by integrating the data simultaneously in two dimensions, along chromosomes and across cell types, thus improving the precision of state assignments.  
> - *Estimating regulatory output and assigning target genes to cCREs:* We investigated the effectiveness of the cCREs in explaining levels of gene expression and predicted the target genes for each cCRE (in **Materials and Methods (*Mapping cCREs to Genes*)**).

### Goal: To perform a tool similar to IDEAS, which instead compares the regions surrounding genes known to behave similarly (e.g. sex-biased, XCI escapees) and incorporates TF-binding data as well as epigenetic marks.
#### [X Inactivation and Escape: Epigenetic and Structural Features](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6779695/):  
> - A number of genes escape XCI in an individual-, tissue-, and cell type-specific manner, which can cause sex differences in gene expression. They are expressed from the Xi and lack epigenetic signatures characteristic of inactivated genes. They also appear to be located away from repressive genomic elements.  
> - Primary determinants of escape from XCI include **distance from Xist**, **density of LINE elements**, **clustering in domains**, **lack of repressive histone marks such as H3K27me3**, and **enrichment in active histone marks such as acetylation, and in transcription elongation marks including RNA PolII S2P and H3K36me3**, **DNA hypomethylation of CpG islands**. They tend to reside toward the **outside of the compacted inactivated interior of the Xi** and are influenced by **repeat E of Xist**, **lncRNAs often found near escape genes**, and **co-localization with clusters of CTCF binding and with TADs**.  
> - Escape genes in brain and liver adopt specific DNA methylation signatures like **enrichment in non-CG hypermethylation (mCH) throughout their gene body**.

## [IDEAS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5772166/): Integrative and Discriminative Epigenome Annotation System
- Installation: https://github.com/guanjue/IDEAS_2018)
- Alternative: https://github.com/yuzhang123/IDEAS


3. [Chromatin states responsible for the regulation of differentially expressed genes under 60Co~γ ray radiation in rice](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5639768/)  
> - Either an individual mark or a certain chromatin state was found to be highly correlated with the expression of up-regulated genes. In contrast, only the chromatin states, as opposed to any individual marks tested, are related to the expression of the down-regulated genes.  
> - In general, active marks (H4K12ac, H3K27ac, H3K4ac, H3K36me3 and H3K4me3) are positively correlated with gene expression, whereas repressive marks (H3K27me3 and H3K9me2) are negatively correlated with gene expression in eukaryotes.  
> - Distribution of chromatin statuses (CSs) across differentially expressed genes. The genome-wide chromatin state (CS) was divided into fifteen subgroups according to the combination of 7 marks as indicated in each group. (a) Distribution of chromatin states across control genes and the corresponding up-regulated genes with 5 kb up and down stream of the TSS. (b) Distribution of chromatin state across control genes and the corresponding down-regulated genes with 5 kb up and down-stream of the TSS. The x-axis represents the position relative to TSS; The y-axis represents fold enrichment, indicating the enrichment of the corresponding CS. 

# Bioinformatics Methods for Functional Genomics
1. Machine learning algorithms:

    - Unsupervised (i.e. class detection): Data clustering and principal component analysis;  
    - Supervised (i.e. class prediction/classification): Artificial neural networks and support vector machines;  

2. Functional enrichment analysis: Used to determine the extent of over- or under-expression of functional categories relative to a background set.  
    - Gene ontology-based enrichment analysis (DAVID and GSEA);  
    - Pathway-based analysis (Ingenuity and Pathway studio);  
    - Protein complex-based analysis (COMPLEAT);  

3.  Gapped k-mer SVM model: Used to infer the kmers that are enriched within cis-regulatory sequences with high activity compared to sequences with lower activity.  

Sources:
> - [Functional Genomics](https://en.wikipedia.org/wiki/Functional_genomics)

# Attempt Using ChromHMM Track (UCSC Genome Browser hg19):
### Use Roadmap Epigenomics ChromHMM 15 State Model & Cell and Tissue Gene Expression Profiles:
> The current release 9 of the Human Epigenome Atlas is a product of the NIH Roadmap Epigenomics Consortium. Release 9 contains a total of 2,804 genome-wide datasets, including 1,821 histone modification datasets, 360 DNase datasets, 277 DNA methylation datasets, and 166 RNA-Seq datasets, encompassing a total of 150.21 billion mapped sequencing reads corresponding to 3,174-fold coverage of the human genome. Release 9 includes a subset of 1,936 datasets grouped into 111 reference epigenomes where each reference epigenome contains a complete set of five core histone marks (H3K4me3, H3K4me1, H3K27me3, H3K9me3, and H3K36me3).  
> - Description of epigenomes: https://www.nature.com/articles/nature14248

From https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state:  

**ARCHIVE of all mnemonics.bed files**: A ChromHMM model applicable to all 127 epigenomes was learned by virtually concatenating consolidated data corresponding to the core set of 5 chromatin marks assayed in all epigenomes (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3). The regions were labeled using the state with the maximum posterior probability out of all 15.  

From https://amp.pharm.mssm.edu/Harmonizome/dataset/Roadmap+Epigenomics+Cell+and+Tissue+Gene+Expression+Profiles:


For this preliminary analysis, choose E109 (small_intestine):  
***Excel Workflow:***

Sheet|Executables
:--:|:---:
Up Regulated Genes (i.e. all epigenomes, all chromosomes)|Filter for the desired epigenome (**E109**), add a value column under each gene with "COUNTIF(all_genes,gene)" to filter for genes in the desired chromosome (**chrX**, 0's are excluded)
Down Regulated Genes (i.e. all epigenomes, all chromosomes)|Filter for the desired epigenome (**E109**), add a value column under each gene with "COUNTIF(all_genes,gene)" to filter for genes in the desired chromosome (**chrX**, 0's are excluded)
ChromHMM Labelling (i.e. desired epigenome [**E109**], all chromosomes)|Filter to eliminate "FALSE" and transpose results from gene binning (*bins*)
Gene List (i.e. all chromosomes)| Filter for the desired chromosome (**chrX**)

*Determine if gene falls within a ChromHMM bin and extract gene name:* =IF(OR(OR(AND(genes_filtered!A2 <=ChromHMM_filtered!$B$1317,AND(genes_filtered!B2 <=ChromHMM_filtered!$C$1317,genes_filtered!B2 >=ChromHMM_filtered!$B$1317)),AND(genes_filtered!A2 <=ChromHMM_filtered!$B$1317,genes_filtered!B2 >= ChromHMM_filtered!$C$1317)),OR(AND(genes_filtered!A2 >=ChromHMM_filtered!$B$1317,genes_filtered!B2 <=ChromHMM_filtered!$C$1317),AND(AND(genes_filtered!A2 >=ChromHMM_filtered!$B$1317,genes_filtered!B2 <= ChromHMM_filtered!$C$1317),genes_filtered!B2 >= ChromHMM_filtered!$C$1317))),D2)

*Replace direct cell references for index:*

Position|Before|After
:---:|:---:|:---:
ChromHMM chromstart|ChromHMM_filtered!$B$1317|INDEX(ChromHMM_filtered!$A$1:$E$4178,A$1,2), where A1 = row_num and 2 = chromStart
ChromHMM chromEnd|ChromHMM_filtered!$C$1317|INDEX(ChromHMM_filtered!$A$1:$E$4178,A$1,3), where A1 = row_num and 3 = chromEnd
Gene chromstart|genes_filtered!A2|INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,1), where ROW()-1 = row_num and 1 = chromStart
Gene chromEnd|genes_filtered!B2|INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2), where ROW()-1 = row_num and 2 = chromEnd
Gene name|genes_filtered!D2|INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,4), where ROW()-1 = row_num and 4 = geneID

**Result:** =IF(OR(OR(AND(INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,1) <=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,2),AND(INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) <=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,3),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) >=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,2))),AND(INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,1) <=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,2),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) >= INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,3))),OR(AND(INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,1) >=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,2),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) <=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,3)),AND(AND(INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,1) >=INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,2),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) <= INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,3)),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,2) >= INDEX(ChromHMM_filtered!$A$1:$E$4178,AXQ$1,3)))),INDEX(genes_filtered!$A$2:$D$2426,ROW()-1,4))

**ALTERNATIVE:** Normalized counts per state:  

=COUNTIFS(ChromHMM_filtered!$B$1:$B$4178,">"&$C2,ChromHMM_filtered!$C$1:$C$4178,"<"&$D2,ChromHMM_filtered!$D$1:$D$4178,F$1)+COUNTIFS(ChromHMM_filtered!$B$1:$B$4178,">"&$C2,ChromHMM_filtered!$C$1:$C$4178,">"&$D2,ChromHMM_filtered!$B$1:$B$4178,"<"&$D2,ChromHMM_filtered!$D$1:$D$4178,F$1)+COUNTIFS(ChromHMM_filtered!$B$1:$B$4178,"<"&$C2,ChromHMM_filtered!$C$1:$C$4178,"<"&$D2,ChromHMM_filtered!$C$1:$C$4178,">"&$C2,ChromHMM_filtered!$D$1:$D$4178,F$1)+COUNTIFS(ChromHMM_filtered!$B$1:$B$4178,"<"&$C2,ChromHMM_filtered!$C$1:$C$4178,">"&$D2,ChromHMM_filtered!$D$1:$D$4178,F$1)