### 20190905: Analysis of differential contact beween healthy and _Salmonella enterica_-infected mice
**cmdoret**

Here I compare contacts between healthy and infected _A. castellanii_ for high (2kb) resolutions . 

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')


In [56]:
# Load files and packages
import numpy as np
import pandas as pd
import os
import warnings
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
import hicstuff.hicstuff as hcs
import hicstuff.view as hcv
import cooler
import pybedtools
import serpentine as serp


out = 'data/output/'
indir = 'data/input/'
healthy_cool = cooler.Cooler(out + 'cool/uninfected_merged_subsampled.cool')
infected_cool = cooler.Cooler(out + 'cool/infected_merged_subsampled.cool')
#healthy_bedgraph = pd.read_csv(out + 'all_signals_AT419.bedgraph', sep='\t')
#infected_bedgraph = pd.read_csv(out + 'all_signals_AT418.bedgraph', sep='\t')
diff_loops = pd.read_csv(out + 'pareidolia/loops_small_change_infection_time.tsv', sep='\t')
genes = pd.read_csv(indir + 'annotations/c3_annotations/Acanthamoeba_castellanii.gff3', sep='\t')
diff_genes = pd.read_csv(out + 'pareidolia/loops_small_diff_genes.bed', sep='\t', header=None, names=['chrom', 'start', 'end', 'name', 'diff_score', 'strand'])


In [57]:
os.getcwd()
os.chdir("/home/cmatthey/Repos/Acastellanii_legionella_infection/")

Only regions with p-values below 10e-3 are selected as potential candidates. When replicates will be available, I will use FDR instead of simple p-value.

In [58]:
diff_genes.head()

Unnamed: 0,chrom,start,end,name,diff_score,strand
0,scaffold_1,11709,30538,FUN_000005,-0.83513,-
1,scaffold_1,84594,93266,FUN_000024,6.445427,-
2,scaffold_1,78160,84132,FUN_000023,6.445427,+
3,scaffold_1,122032,124302,FUN_000033,2.122255,+
4,scaffold_1,140191,142477,FUN_000040,1.016146,+


In [59]:
diff_loops.head()

Unnamed: 0,chrom1,start1,end1,chrom2,start2,end2,bin1,bin2,diff_score
0,scaffold_1,16000,18000,scaffold_1,48000,50000,8.0,24.0,-0.83513
1,scaffold_1,84000,86000,scaffold_1,106000,108000,42.0,53.0,6.445427
2,scaffold_1,122000,124000,scaffold_1,134000,136000,61.0,67.0,2.122255
3,scaffold_1,142000,144000,scaffold_1,158000,160000,71.0,79.0,1.016146
4,scaffold_1,250000,252000,scaffold_1,272000,274000,125.0,136.0,1.847027


In [60]:
%matplotlib notebook

region = 'scaffold_1'

inf_zoom = infected_cool.matrix(balance=False).fetch(region)
uni_zoom = healthy_cool.matrix(balance=False).fetch(region)
uni_norm = healthy_cool.matrix(balance=True).fetch(region)
inf_norm = infected_cool.matrix(balance=True).fetch(region)

serp_A, serp_B, serp_ratio = serp.serpentin_binning(
    inf_zoom,
    uni_zoom,
    parallel=1,
    triangular=True,
    iterations=10
)


2020-10-26 17:42:41.998064 Starting 10 binning processes...
0	 Total serpentines: 727821 (100.0 %)
1	 Total serpentines: 483826 (66.47596043532681 %)
2	 Total serpentines: 165229 (22.70187312539759 %)
3	 Total serpentines: 58229 (8.000456156115309 %)
4	 Total serpentines: 22903 (3.1467902135277765 %)
5	 Total serpentines: 11704 (1.6080877028829892 %)
6	 Total serpentines: 8181 (1.1240401142588632 %)
7	 Total serpentines: 7275 (0.9995589574909215 %)
8	 Total serpentines: 7117 (0.9778503230876823 %)
9	 Total serpentines: 7116 (0.9777129266674086 %)
9	 Over: 2020-10-26 17:43:00.154903
0	 Total serpentines: 727821 (100.0 %)
1	 Total serpentines: 483706 (66.45947286489398 %)
2	 Total serpentines: 165363 (22.72028424571426 %)
3	 Total serpentines: 58379 (8.021065619156358 %)
4	 Total serpentines: 22965 (3.155308791584744 %)
5	 Total serpentines: 11660 (1.6020422603909479 %)
6	 Total serpentines: 8212 (1.1282994032873468 %)
7	 Total serpentines: 7344 (1.0090393104898046 %)
8	 Total serpentine

In [61]:
%matplotlib notebook
idx, psi = hcs.distance_law_from_mat(inf_norm)
_, psu = hcs.distance_law_from_mat(uni_norm)
plt.loglog(idx, psu)
plt.loglog(idx, psi)

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f1529cd1210>]

In [62]:
%matplotlib notebook

fig, axes = plt.subplots(1, 3, sharex=True, sharey=True)
axes[0].imshow(inf_norm, cmap='afmhot_r', vmax = np.percentile(inf_norm, 99.5))
axes[1].imshow(uni_norm, cmap='afmhot_r', vmax = np.percentile(uni_norm, 99.5))
axes[2].imshow(serp_ratio - np.median(serp_ratio), cmap='coolwarm', vmin=-2, vmax=2)
    
sub_loops = diff_loops.loc[diff_loops.chrom1 == region, :]
for i in range(3):
    axes[i].scatter(sub_loops.bin1, sub_loops.bin2, facecolor='', edgecolor='blue')
axes[0].set_title("Infected")
axes[1].set_title("Uninfected")
axes[2].set_title("Ratio (I/U)")
plt.suptitle(f"A. castellanii, {region}, 2kb")

<IPython.core.display.Javascript object>

Text(0.5, 0.98, 'A. castellanii, scaffold_1, 2kb')

In [63]:
%matplotlib notebook
plt.imshow(inf_norm, cmap='afmhot_r', vmax = np.percentile(inf_norm, 99.5), rasterized=True)
plt.scatter(sub_loops.bin1, sub_loops.bin2, facecolor='', edgecolor='blue')
plt.suptitle(f"Uninfected A. castellanii, {region}, 2kb")
plt.savefig("uninfected_acastellanii_2kb_scf1.svg")

<IPython.core.display.Javascript object>

In [71]:
%matplotlib notebook 
plt.hist(diff_loops.start2 - diff_loops.start1, 150)
plt.xlabel("Loop size")
plt.ylabel("Number of loops")
plt.xlim(0, 100000)

<IPython.core.display.Javascript object>

(0.0, 100000.0)

All genes that overlap regions of differential contacts are then selected. This may not be optimal because genes may be separated from their enhancer element without being affected directly. This can happen if a domain boundary moves into the sequence between gene and enhancer.

The enrichment of GO terms (annotations) within regions with altered chromatin conformation is then tested to check for specific pathways or funcions.

In [41]:
# Overlap with annotations
genes_bed_path = indir + 'tracks/mm10_annotations.bed'
(genes
 .loc[:, ["chromosome_name", "start_position", "end_position", "ensembl_gene_id"]]
 .to_csv(genes_bed_path, header=None, index=False, sep='\t')
)
genes_bed = pybedtools.BedTool(indir + 'tracks/mm10_annotations.bed')
diff_bed = pybedtools.BedTool(out + 'sig_diff.bed')
genes_inter_diff = genes_bed.intersect(diff_bed)
genes_inter_diff.saveas(out+'genes_diff.bed')


KeyError: "None of [Index(['chromosome_name', 'start_position', 'end_position', 'ensembl_gene_id'], dtype='object')] are in the [columns]"

In [46]:
# Get corresponding GO terms
genes_diff = pd.read_csv(out+'genes_diff.bed', sep='\t', names=["chrom", "start", "end", "ensembl_gene_id"])
genes_diff = genes_diff.merge(genes, left_on="ensembl_gene_id", right_on="ensembl_gene_id", how="inner")



FileNotFoundError: [Errno 2] No such file or directory: 'data/output/genes_diff.bed'

In [7]:
# Download GO association data required for enrichment test
from goatools.base import download_go_basic_obo
from goatools.base import download_ncbi_associations
from goatools.obo_parser import GODag
from goatools.anno.genetogo_reader import Gene2GoReader
from goatools.test_data.genes_NCBI_10090_ProteinCoding import GENEID2NT as GeneID2nt_mus
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
from goatools.godag_plot import plot_gos, plot_results, plot_goid2goobj
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
obo_fname = download_go_basic_obo()
fin_gene2go = download_ncbi_associations()
obodag = GODag("go-basic.obo")
# Read NCBI's gene2go. Store annotations in a list of namedtuples
# taxid 10090 is Mus musculus
objanno = Gene2GoReader(fin_gene2go, taxids=[10090])
# Get namespace2association where:
#    namespace is:
#        BP: biological_process               
#        MF: molecular_function
#        CC: cellular_component
#    assocation is a dict:
#        key: NCBI GeneID
#        value: A set of GO IDs associated with that gene
ns2assoc = objanno.get_ns2assc()

for nspc, id2gos in ns2assoc.items():
    print("{NS} {N:,} annotated mouse genes".format(NS=nspc, N=len(id2gos)))

  EXISTS: go-basic.obo
  EXISTS: gene2go
go-basic.obo: fmt(1.2) rel(2019-07-01) 47,413 GO Terms
HMS:0:00:02.688893 366,685 annotations READ: gene2go 
1 taxids stored: 10090
MF 16,801 annotated mouse genes
CC 18,979 annotated mouse genes
BP 17,808 annotated mouse genes


In [8]:
exp_all_genes = {x: GeneID2nt_mus[x] for x in genes.entrezgene_id if x in GeneID2nt_mus.keys()}
exp_diff_genes = {x: GeneID2nt_mus[x] for x in genes_diff.entrezgene_id if x in GeneID2nt_mus.keys()}

In [9]:
# Build GO enrichment analysis object

goeaobj = GOEnrichmentStudyNS(
        exp_all_genes.keys(), # List of mouse protein-coding genes
        ns2assoc, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method


Load BP Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 80% 16,049 of 19,974 population items found in association

Load CC Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 87% 17,448 of 19,974 population items found in association

Load MF Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 79% 15,692 of 19,974 population items found in association


In [10]:
# Run gene ontology enrichment analysis
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
geneids_study = exp_diff_genes.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]


Run BP Gene Ontology Analysis: current study set of 28 IDs ...
 89%     25 of     28 study items found in association
100%     28 of     28 study items found in population(19974)
Calculating 12,146 uncorrected p-values using fisher_scipy_stats
  12,146 GO terms are associated with 16,049 of 19,974 population items
     210 GO terms are associated with     25 of     28 study items
  METHOD fdr_bh:
       0 GO terms found significant (< 0.05=alpha) (  0 enriched +   0 purified): statsmodels fdr_bh
       0 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

Run CC Gene Ontology Analysis: current study set of 28 IDs ...
 93%     26 of     28 study items found in association
100%     28 of     28 study items found in population(19974)
Calculating 1,722 uncorrected p-values using fisher_scipy_stats
   1,722 GO terms are associated with 17,448 of 19,974 population items
      65 GO terms are associated with     26 of 

When running the analysis with a stringent threshold of pvalue < 10-4, I get no enriched biological process or molecular function, but only cellular component "Nucleosome" with genes Tnp2, Prm1, Prm2 and Prm3.

When running the analysis with less stringent filter pvalue < 10-3, I get enrichment for biological processes keratinization, keratinocyte differentiation and peptide cross linking. This is due to a group of several Sprr genes in the region at chr3:92Mb-94Mb

In [11]:
%matplotlib notebook
# Save and visualize results
gene2sym = { k: v.Symbol for k, v in exp_diff_genes.items()}
goeaobj.wr_txt(out+"go_enrichment.txt", goea_results_sig)
plot_results(out+"mouse_salmonella_GO_enriched_{NS}.png",
             goea_results_sig,
             id2symbol=gene2sym,
             study_items=6)


      1 GOEA results for     4 study items. WROTE: ../data/output/go_enrichment.txt
    1 usr  10 GOs  WROTE: ../data/output/mouse_salmonella_GO_enriched_CC.png


In [12]:
img=mpimg.imread(out+"mouse_salmonella_GO_enriched_BP.png")
plt.figure()
plt.imshow(img)
plt.show()

<IPython.core.display.Javascript object>

In [13]:
img=mpimg.imread(out+"mouse_salmonella_GO_enriched_CC.png")
plt.figure()
plt.imshow(img)
plt.show()

<IPython.core.display.Javascript object>