### 20200422: Downstream analyses of 3D genomic changes in _L. pneumophila_ infected _A. castellanii_
**cmdoret**

In this notebook, I investigate the biological role of regions detected as changing in 3D structure during infection of amoeba by _Legionella pneumophila_. To summarise the context, this is a follow-up on 2 analyses:

1. Genome assembly and annotation of _A. castellanii_ using ONT long reads, Hi-C, Illumina shotgun and RNAseq
2. Pattern detection in Hi-C maps from _A. castellanii_ and quantification of their change during infection.

I use the annotations generated in 1. to investigate potential roles of affected regions in 2.

In [1]:
# Load files and packages
from typing import List, Union, Iterable, Optional
import numpy as np
import pandas as pd
import os
import warnings
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
import hicstuff.hicstuff as hcs
import hicstuff.view as hcv
import cooler
import pareidolia.io as pai
import pareidolia.hic_utils as pah
from cooltools.sample import sample_cooler


out = '../../data/output/'
indir = '../../data/input/'
res = 2000
samples = pd.read_csv('../../samples.tsv', sep='\t', comment='#')
infected_bedgraph = pd.read_csv(out + 'all_signals_AT420.bedgraph', sep='\t')
healthy_bedgraph = pd.read_csv(out + 'all_signals_AT421.bedgraph', sep='\t')
diff_borders = pd.read_csv(out + 'pareidolia/borders_change_infection_time.tsv', sep='\t')
diff_loops = pd.read_csv(out + 'pareidolia/loops_change_infection_time.tsv', sep='\t')
genes = pd.read_csv(indir + 'annotations/c3_annotations/Acanthamoeba_castellanii.annotations.txt', sep='\t')
genes = genes.rename(columns={"Contig": "Chromosome", "Stop": "End"})
# Remove genes which do not have an entrez id
#genes = genes.loc[~np.isnan(genes.entrezgene_id), :]
#genes.entrezgene_id = genes.entrezgene_id.astype(int)


In [20]:
genes.head()

Unnamed: 0,GeneID,Feature,Chromosome,Start,End,Strand,Name,Product,BUSCO,PFAM,InterPro,EggNog,COG,GO Terms,Secreted,Membrane,Protease,CAZyme,Notes,Translation
0,FUN_000001,CDS,scaffold_1,503,5457,-,,hypothetical protein,,PF00071,IPR001806;IPR020849;IPR027417,,,GO:0016020;GO:0005525;GO:0003924;GO:0007165,,1 (o833-852i),,,,MATKDFKIAIGGPGGVGKSAIVLQFVTGNYVSEYGAYRKQFALGDN...
1,FUN_000002,CDS,scaffold_1,7855,9815,+,,hypothetical protein,,,,,,,,,,,,MQLMSTQHHKKRTPEDHRPDLPSSARNLLASSPPARSPRSPDGSPH...
2,FUN_000003,CDS,scaffold_1,10018,10602,-,,hypothetical protein,,PF07258,IPR017920,,,,,,,,,MWRRPQSVAVGYTGAPKRASLAVGGSGPAVSGRKGDVRFPTLADLR...
3,FUN_000004,CDS,scaffold_1,10748,11427,+,,hypothetical protein,,,,,,,,1 (o90-111i),,,,MWKQSSRVAVVARSRAVVPSSSGLRRSAGALLPRRAGAETEPEWKL...
4,FUN_000005,CDS,scaffold_1,11709,30538,-,,hypothetical protein,,PF00076;PF00622;PF00632,IPR000504;IPR000569;IPR001870;IPR003877;IPR012...,,,GO:0004842;GO:0003676;GO:0005515,,2 (o2988-3009i3316-3335o),,,,MGAEEAVHLEGHHSSRSNRPATPQYSAASASGAAPPAPPLRGEAAS...


In [2]:
# Manage cools, merge samples by condition to increase coverage in visualisations
def gather_and_merge_cools(condition: str, merged_out: str):
    cools = (
        samples.loc[samples.condition == condition, 'library']
        .apply(lambda p: f"{out}/cool/{p}.mcool::/resolutions/{res}")
        .tolist()
    )
    cooler.merge_coolers(merged_out, cools, mergebuf=10e8)
    
gather_and_merge_cools('uninfected', "healthy.cool")
gather_and_merge_cools('infected', "infected.cool")


INFO :: Merging:
../../data/output//cool/AT408.mcool::/resolutions/2000
../../data/output//cool/AT419.mcool::/resolutions/2000
../../data/output//cool/AT421.mcool::/resolutions/2000
INFO :: NumExpr defaulting to 8 threads.
INFO :: Creating cooler at "healthy.cool::/"
INFO :: Writing chroms
INFO :: Writing bins
INFO :: Writing pixels
INFO :: nnzs: [4627584, 552433, 1126529]
INFO :: current: [4627584, 552433, 1126529]
INFO :: Writing indexes
INFO :: Writing info
INFO :: Done
INFO :: Merging:
../../data/output//cool/PM106.mcool::/resolutions/2000
../../data/output//cool/AT407.mcool::/resolutions/2000
../../data/output//cool/AT418.mcool::/resolutions/2000
../../data/output//cool/AT420.mcool::/resolutions/2000
INFO :: Creating cooler at "infected.cool::/"
INFO :: Writing chroms
INFO :: Writing bins
INFO :: Writing pixels
INFO :: nnzs: [5264991, 4340619, 2188053, 1290478]
INFO :: current: [5264991, 4340619, 2188053, 1290478]
INFO :: Writing indexes
INFO :: Writing info
INFO :: Done


In [10]:

# Subsample cools to identical coverage between conditions
def subsample_cools(cools: List[str], cpus: int=8, balance: bool=True) -> List['cooler.Cooler']:
    if cpus > 1:
        import multiprocessing as mp
        pool = mp.Pool(cpus)
        my_map = pool.imap_unordered
    else:
        my_map = map
    coolers = pai.get_coolers(cools)
    target = pah.get_min_contacts(coolers)
    sub_cools = ["sub_" + cl for cl in cools]
    for i in range(len(cools)):
        sample_cooler(
            coolers[i],
            sub_cools[i],
            count=target-1,
            exact=False,
            map_func=my_map
        )
    sub_coolers = pai.get_coolers(sub_cools)
    # Balance the subsampled versions
    if balance:
        for clr in sub_coolers:
            cooler.balance_cooler(
                clr,store=True,
                map=my_map,
                mad_max=5,
                chunksize=10000000,
                min_nnz=10,
                max_iters=200,
                ignore_diags=2
            )
    if cpus > 1:
        pool.close()
    return sub_coolers

healthy_cool, infected_cool = subsample_cools(['healthy.cool', 'infected.cool'])

INFO :: Creating cooler at "sub_healthy.cool::/"
INFO :: Writing chroms
INFO :: Writing bins
INFO :: Writing pixels
INFO :: Writing indexes
INFO :: Writing info
INFO :: Done
INFO :: Creating cooler at "sub_infected.cool::/"
INFO :: Writing chroms
INFO :: Writing bins
INFO :: Writing pixels
INFO :: Writing indexes
INFO :: Writing info
INFO :: Done
INFO :: variance is 207516.4035251037
INFO :: variance is 178502.21226294566
INFO :: variance is 6544.088298908443
INFO :: variance is 22429.3587515201
INFO :: variance is 1312.8879459398481
INFO :: variance is 4614.821651772319
INFO :: variance is 551.4385040972979
INFO :: variance is 1275.4581852006963
INFO :: variance is 272.55604743634456
INFO :: variance is 416.4566990524887
INFO :: variance is 131.60184953660084
INFO :: variance is 149.6257068751345
INFO :: variance is 60.93831076465593
INFO :: variance is 56.90446190648478
INFO :: variance is 27.298625543543995
INFO :: variance is 22.41011815482649
INFO :: variance is 11.949339108949578

Only regions with p-values below 10e-3 are selected as potential candidates. When replicates will be available, I will use FDR instead of simple p-value.

In [6]:
def filter_patterns(df):
    filt = (df.loc[df.diff_score != 0.0, :]
      .sort_values(abs(df.diff_score))
      .reset_index(drop=True)
   )
    return filt
diff_borders = diff_borders.loc[diff_borders.diff_score != 0.0, :].reset_index(drop=True)
diff_loops = diff_loops.loc[diff_loops.diff_score != 0.0, :].reset_index(drop=True)
diff_borders.head()

Unnamed: 0,chrom1,start1,end1,chrom2,start2,end2,bin1,bin2,diff_score
0,scaffold_4,1496000,1498000,scaffold_4,1496000,1498000,4056.0,4056.0,0.172005
1,scaffold_1,402000,404000,scaffold_1,402000,404000,201.0,201.0,-0.117396
2,scaffold_1,426000,428000,scaffold_1,426000,428000,213.0,213.0,0.157139
3,scaffold_5,914000,916000,scaffold_5,914000,916000,4592.0,4592.0,-0.133855
4,scaffold_8,1298000,1300000,scaffold_8,1298000,1300000,7139.0,7139.0,-0.158142


In [7]:
print(f"Keeping {diff_borders.shape[0]} borders and {diff_loops.shape[0]} loops with detectable change..")

Keeping 44 borders and 76 loops with detectable change..


In [8]:
def compute_di(mat, max_pix=10):
    """ Computes directionalitin index according to definition from Dixon et al., 2012"""
    
    di = np.zeros(mat.shape[0])
    for i in range(mat.shape[0]):
        before = np.nanmean(mat[i - max_pix: i, i -  max_pix: i])
        after = np.nanmean(mat[i: i + max_pix, i: i +  max_pix])
        expected = (before + after ) / 2
        sign = ((before - after) / np.abs(before - after))
        di[i] = sign * (((after - expected)**2) / expected + ((before - expected)**2) / expected)
    di[:max_pix] = 0
    di[-max_pix:] = 0
    high = np.nanmax(np.abs(di))
    di /= high # Scale to 1
    return di

In [11]:
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (9.0, 6.0)
# Show example regions
def plot_regions(healthy_cool, infected_cool, df, region_id, region_size=100000, blur=0):
    
    def region_lines(ax, s, e):
        """Add lines to mark region of differential contacts"""
        style = {"lw": 0.5, "alpha": 0.6, "c": "g"}
        for i in range(3):
            ax[0, i].axvline(x=s, **style)
            ax[0, i].axvline(x=e, **style)
            ax[0, i].axhline(y=s, **style)
            ax[0, i].axhline(y=e, **style)
            ax[1, i].axvline(x=s, **style)
            ax[1, i].axvline(x=e, **style)
            
    
    def nan_gaussian(U, sigma):
        """Gaussian filter which does not include NAs"""
        import scipy.ndimage as ndi
        V=U.copy()
        V[np.isnan(U)]=0
        VV=ndi.gaussian_filter(V, sigma=sigma)
        W=0*U.copy()+1
        W[np.isnan(U)]=0
        WW=ndi.gaussian_filter(W, sigma=sigma)
        Z=VV/WW   
        return Z
    
    # Extract region of interest
    region = df.iloc[region_id]
    chrom, start, end = region.chrom1, region.start1, region.end1
    pos = (start + end) / 2
    ucsc_query = f'{chrom}:{int(max(0, pos-region_size))}-{int(pos+region_size)}'
    
    # Subset matrix
    healthy_zoom = healthy_cool.matrix(balance=True).fetch(ucsc_query)
    infected_zoom = infected_cool.matrix(balance=True).fetch(ucsc_query)
    infected_zoom[np.isnan(infected_zoom)] = 0.0
    healthy_zoom[np.isnan(healthy_zoom)] = 0.0
    
    # Blur ratio to improve readability
    if blur > 0:
        infected_blur = nan_gaussian(infected_zoom, sigma=blur)
        healthy_blur = nan_gaussian(healthy_zoom, sigma=blur)
    else:
        infected_blur = infected_zoom
        healthy_blur = healthy_zoom

    log_ratio = np.log2(infected_blur / healthy_blur)
    #log_ratio[np.isnan(infected_zoom)] = 0.0
    #log_ratio[np.isnan(healthy_zoom)] = 0.0
    
    # Initialize figure
    fig, ax = plt.subplots(2, 3, sharex=True, sharey=False, gridspec_kw={'height_ratios': [5, 1]})
    
    # Draw lines
    region_size = (end - start) // healthy_cool.binsize
    mid = healthy_zoom.shape[0] // 2
    start_bin, end_bin = mid - region_size, mid + region_size
    region_lines(ax, start_bin, end_bin)
    
    # Make heatmap
    plt.suptitle(f"{chrom}:{start}-{end}")
    ax[0, 0].imshow(np.log2(infected_zoom), cmap="Reds")
    ax[0, 0].set_title("Infected")
    ax[0, 1].imshow(np.log2(healthy_zoom), cmap="Reds")
    ax[0, 1].set_title("Uninfected")
    ax[0, 2].imshow(log_ratio, cmap='bwr', vmin=-2, vmax=2)
    ax[0, 2].set_title("I / U")
    
    # Plot DI
    n = infected_zoom.shape[0]
    infected_di = compute_di(infected_zoom, max_pix=10)
    healthy_di = compute_di(healthy_zoom, max_pix=10)
    
    for i in range(3): ax[1, i].set_ylim(-1, 1, emit=False)
    for i in range(3): ax[1, i].axhline(0, lw=0.5, c='black')
    ax[1, 0].plot(range(n), infected_di)
    ax[1, 1].plot(range(n), healthy_di)
    ax[1, 2].plot(range(n), infected_di - healthy_di)



from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
#show_inter_chr(ot_mat, chrA='chr03', chrB='chr08', var='repeats_prop')
region_size_slider = widgets.IntSlider(min=10000, max=500000, step=2000, value=100000)
blur_slider = widgets.FloatSlider(min=0, max=4, step=0.1, value=0)
pl = interactive(plot_regions, healthy_cool=fixed(healthy_cool),
                 infected_cool=fixed(infected_cool),
                 df=fixed(diff_borders),
                 region_id=range(diff_borders.shape[0]),
                 region_size=region_size_slider,
                 blur=blur_slider
                )
#display(pl)
display(pl)
plt.savefig('test.svg')

interactive(children=(Dropdown(description='region_id', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…

<Figure size 648x432 with 0 Axes>

In [None]:
%matplotlib inline
set_matplotlib_formats('svg')
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (9.0, 6.0)
# Show example regions


#show_inter_chr(ot_mat, chrA='chr03', chrB='chr08', var='repeats_prop')
pl = interactive(plot_regions, healthy_cool=fixed(healthy_cool),
                 infected_cool=fixed(infected_cool),
                 df=fixed(diff_loops),
                 region_id=range(diff_loops.shape[0]),
                 region_size=region_size_slider,
                 blur=blur_slider
                )
#display(pl)
display(pl)
plt.savefig('test.svg')

In [None]:
diff_loops.diff_score.hist()
plt.title("Differential loop scores (infected - healthy)")
plt.ylabel("N loops")
plt.ylabel("Differential score")

In [None]:
diff_borders.diff_score.hist()
plt.title("Differential border scores (infected - healthy)")
plt.ylabel("N borders")
plt.ylabel("Differential score")

All genes that overlap regions of differential contacts are then selected. This may not be optimal because genes may be separated from their enhancer element without being affected directly. This can happen if a domain boundary moves into the sequence between gene and enhancer.

The enrichment of GO terms (annotations) within regions with altered chromatin conformation is then tested to check for specific pathways or funcions.

In [12]:
genes.head()


Unnamed: 0,GeneID,Feature,Contig,Start,Stop,Strand,Name,Product,BUSCO,PFAM,InterPro,EggNog,COG,GO Terms,Secreted,Membrane,Protease,CAZyme,Notes,Translation
0,FUN_000001,CDS,scaffold_1,503,5457,-,,hypothetical protein,,PF00071,IPR001806;IPR020849;IPR027417,,,GO:0016020;GO:0005525;GO:0003924;GO:0007165,,1 (o833-852i),,,,MATKDFKIAIGGPGGVGKSAIVLQFVTGNYVSEYGAYRKQFALGDN...
1,FUN_000002,CDS,scaffold_1,7855,9815,+,,hypothetical protein,,,,,,,,,,,,MQLMSTQHHKKRTPEDHRPDLPSSARNLLASSPPARSPRSPDGSPH...
2,FUN_000003,CDS,scaffold_1,10018,10602,-,,hypothetical protein,,PF07258,IPR017920,,,,,,,,,MWRRPQSVAVGYTGAPKRASLAVGGSGPAVSGRKGDVRFPTLADLR...
3,FUN_000004,CDS,scaffold_1,10748,11427,+,,hypothetical protein,,,,,,,,1 (o90-111i),,,,MWKQSSRVAVVARSRAVVPSSSGLRRSAGALLPRRAGAETEPEWKL...
4,FUN_000005,CDS,scaffold_1,11709,30538,-,,hypothetical protein,,PF00076;PF00622;PF00632,IPR000504;IPR000569;IPR001870;IPR003877;IPR012...,,,GO:0004842;GO:0003676;GO:0005515,,2 (o2988-3009i3316-3335o),,,,MGAEEAVHLEGHHSSRSNRPATPQYSAASASGAAPPAPPLRGEAAS...


In [16]:
import pyranges as pr
genes_pr = pr.PyRanges(genes)

In [50]:
diff_borders_pr = pr.PyRanges(
    diff_borders
    .loc[:, ["chrom1", "start1", "end1", "diff_score"]]
    .rename(columns={"chrom1": "Chromosome", "start1": "Start", "end1": "End", "diff_score": "Score"})
)
diff_loops_pr = pr.PyRanges(
    pd.concat([
        (diff_loops
            .loc[:, ["chrom1", "start1", "end1", "diff_score"]]
            .rename(columns={"chrom1": "Chromosome", "start1": "Start", "end1": "End"})),
        (diff_loops
        .loc[:, ["chrom2", "start2", "end2", "diff_score"]]
        .rename(columns={"chrom2": "Chromosome", "start2": "Start", "end2": "End"}))
    ])
)

In [127]:
# Overlap with annotations
diff_borders_genes = genes_pr.overlap(diff_borders_pr).df
diff_loops_genes = genes_pr.overlap(diff_loops_pr).df

Extract known proteins, excluding tRNA genes for visual inspection

In [134]:
def gene_stats(df, name):
    is_known = df.Product[df.Product != 'hypothetical protein']
    known = len(is_known[is_known])
    is_trna = df.Product.str.contains('tRNA-')
    trna = len(is_trna[is_trna])
    print(
        f"There are {df.shape[0]} {name}, of which {known} have known functions, "
        f"including {trna} tRNAs."
    )

In [135]:
known_loops_genes = diff_loops_genes.loc[(~diff_loops_genes.Product.str.contains('tRNA-')) & (diff_loops_genes.Product != 'hypothetical protein'), :]

In [136]:
known_borders_genes = diff_borders_genes.loc[(~diff_borders_genes.Product.str.contains('tRNA-')) & (diff_borders_genes.Product != 'hypothetical protein'), :]

In [137]:

gene_stats(genes, "genes in the genome")


There are 14354 genes in the genome, of which 735 have known functions, including 387 tRNAs.


In [138]:
gene_stats(diff_loops_genes, 'infection dependent loop genes')

There are 1704 infection dependent loop genes, of which 55 have known functions, including 25 tRNAs.


In [139]:
gene_stats(diff_borders_genes, 'infection dependent border genes')

There are 3063 infection dependent border genes, of which 125 have known functions, including 63 tRNAs.


In [148]:
print(known_borders_genes.loc[:, ["GeneID", "Feature", "Chromosome", "Start", "End", "Strand", "Name", "Product",]].to_markdown())

|      | GeneID     | Feature   | Chromosome   |   Start |     End | Strand   | Name        | Product                                                                           |
|-----:|:-----------|:----------|:-------------|--------:|--------:|:---------|:------------|:----------------------------------------------------------------------------------|
|   15 | FUN_000207 | CDS       | scaffold_1   |  616228 |  618206 | +        | GSK3B       | Glycogen synthase kinase-3 beta                                                   |
|  126 | FUN_000336 | CDS       | scaffold_1   | 1007079 | 1008861 | -        | FTSJ1       | Putative tRNA (cytidine(32)/guanosine(34)-2'-O)-methyltransferase                 |
|  136 | FUN_000422 | CDS       | scaffold_1   | 1323317 | 1324222 | -        | AP1S2       | AP-1 complex subunit sigma-2                                                      |
|  151 | FUN_000568 | CDS       | scaffold_1   | 1720411 | 1721486 | -        | RAB2A_1     | Ras- protein Rab

In [None]:
# Compute GO enrichment


In [None]:
exp_all_genes = {x: GeneID2nt_mus[x] for x in genes.entrezgene_id if x in GeneID2nt_mus.keys()}
exp_diff_genes = {x: GeneID2nt_mus[x] for x in genes_diff.entrezgene_id if x in GeneID2nt_mus.keys()}

In [None]:
exp_diff_genes

In [None]:
# Build GO enrichment analysis object

goeaobj = GOEnrichmentStudyNS(
        exp_all_genes.keys(), # List of mouse protein-coding genes
        ns2assoc, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method

In [None]:
# Run gene ontology enrichment analysis
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
geneids_study = exp_diff_genes.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]

When running the analysis with a stringent threshold of pvalue < 10-4, I get no enriched biological process or molecular function, but only cellular component "Nucleosome" with genes Tnp2, Prm1, Prm2 and Prm3.

When running the analysis with less stringent filter pvalue < 10-3, I get enrichment for biological processes keratinization, keratinocyte differentiation and peptide cross linking. This is due to a group of several Sprr genes in the region at chr3:92Mb-94Mb

In [None]:
%matplotlib notebook
# Save and visualize results
gene2sym = { k: v.Symbol for k, v in exp_diff_genes.items()}
goeaobj.wr_txt(out+"go_enrichment.txt", goea_results_sig)
plot_results(out+"mouse_salmonella_GO_enriched_{NS}.png",
             goea_results_sig,
             id2symbol=gene2sym,
             study_items=6)


In [None]:
img=mpimg.imread(out+"mouse_salmonella_GO_enriched_BP.png")
plt.figure()
plt.imshow(img)
plt.show()

In [None]:
img=mpimg.imread(out+"mouse_salmonella_GO_enriched_CC.png")
plt.figure()
plt.imshow(img)
plt.show()

In [None]:
img=mpimg.imread(out+"mouse_salmonella_GO_enriched_MF.png")
plt.figure()
plt.imshow(img)
plt.show()