# Analysis of context effects on synthetic gene expression

In this example we study the effects of compositional and cellular context on gene expression using triple reporter plasmids. See paper (REF) for details of plasmid composition. In summary, each plasmid contains three transcription units producing RFP, YFP and CFP. The CFP TU is maintained the same in all plasmids, but the promoter of the RFP and YFP TUs is changed, generating 14 different combinations or contexts with a common reference gene.

First lets import the packages that we need, including the Flapjack API, and set some parameters for plotting with matplotlib:

In [None]:
import flapjack
from flapjack import Flapjack
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as io
import json
import pandas as pd
import seaborn as sns
import getpass
%matplotlib inline

SMALL_SIZE = 6
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=SMALL_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=SMALL_SIZE)  # fontsize of the figure title

io.orca.shutdown_server()

## Plotting the data
TODO: Change this to explain how to make the plot with the UI.

Using the Flapjack webapp, filter the data to select the study "Phase space" (** update name), and then choose the DNA (plasmid) named "pAAA". To compare between measurements we can group the data by DNA (tabs), DNA (subplots), and "Name" (lines). In order to compare different data with various magnitudes we normalize, here by the min/max of the measurements for each sample, by selecting from the dropdown menu. Now click the "View" button to see the plots. These plots look nice in the web interface, but for publication or reports we can format them better using Plotly. Here we format the figure to be half the width of a 1-column figure (1.65 inches wide) and 6pt font. To do this click "Figure (JSON)" to download the figure as a JSON file, and load it in the cell below by choosing the correct file name:

## Plotting the expression rate of each TU
TO DO: Change this to explain how to make the plot with the UI.

To analyze the behaviour of the TUs in more detail we can compute the expression rate (or synthesis rate) of the reporters. To do this, go to the analysis form. Choose "Expression rate (direct)". This implements the linear inversion method of NAME et al. (REF). Choose the degradation rate of the protein as 0 (stable protein) and the density name as "OD" (this is the biomass for normalization). Then click the analyze button to produce the plot. Again we can save the plot as JSON and reformat here in the notebook:

## Summarizing dynamics with mean expression rates
As a first approach to the overall dynamics of a genetic circuit is to take the mean level of expression, as approximated by the signal detected in the assay. This allows us to compare the average rates of gene expression

In [None]:
user = input()
passwd = getpass.getpass()
fj = Flapjack('3.128.232.8:8000')
#fj = Flapjack('localhost:8000')
fj.log_in(username=user, password=passwd)

In [None]:
od = fj.get('signal', name='OD')
cfp = fj.get('signal', name='CFP')
study = fj.get('study', name=['Context effects'])
exp = fj.analysis(study=study.id,
                    type='Mean Expression',
                    biomass_signal=od.id[0],
                    #ref_signal=cfp.id[0],
                    #bounds = [[0,0,0,0], [1,2,2,24]]
                    #bg_correction=2,
                    #min_biomass=0.05,
                    #remove_data=False
                      )

In [None]:
nexp = pd.DataFrame()
for samp, data in exp.groupby('Sample'):
    yfp = data[data.Signal=='YFP']['Expression'].values
    rfp = data[data.Signal=='RFP']['Expression'].values
    cfp = data[data.Signal=='CFP']['Expression'].values
    data.loc[data.Signal=='YFP', ['Expression']] = yfp/cfp
    data.loc[data.Signal=='RFP', ['Expression']] = rfp/cfp
    nexp = nexp.append(data)

In [None]:
nexp

Create a heatmap of gene expression in each condition by pivoting the dataframe:

In [None]:
fig,ax = plt.subplots(2,1, figsize=(3.5,2.25), sharex=True)
for i,name in enumerate(['RFP', 'YFP']):
    df_x = nexp[nexp['Signal']==name].copy()
    df_heatmap = df_x.pivot_table(values='Expression',
                                index=['Strain', 'Media'],
                                columns='Vector', aggfunc=np.mean)
    # Normalize to mean of columns
    df_heatmap = df_heatmap / df_heatmap.mean()
    # Normalize rows to mean
    #df_heatmap = df_heatmap.div( df_heatmap.mean(axis=1), axis=0 )
    # Take log of normalized values
    df_heatmap = df_heatmap.apply(np.log2)
    
    # Plot heatmap
    sns.heatmap(df_heatmap, annot=False, ax=ax[i], 
                square=True, 
                cmap='bwr', 
                center=0,
                #clim=[-1,1],
                vmin=-1, vmax=1, 
                facecolor='gray',
                linewidths=0.5, linecolor='black')
    # Format plot
    bottom, top = ax[i].get_ylim()
    ax[i].set_ylim(bottom + 0.5, top - 0.5)
    ax[i].set_title(name)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('')
    #plt.tight_layout()
    plt.subplots_adjust(hspace=0.2)
    plt.xticks(rotation=90)
    plt.title(name)
plt.tight_layout()
plt.savefig('heatmap_rpus.png', dpi=300, bbox_inches='tight')

## Using SynbioHub to compare compositional contexts

First get the DNAs in the study:

Next we query SynbioHub to get the part composition and add this to our dataframe. We are interested in the identity of the RFP and CFP TUs, which are encoded as "engineered regions":

In [None]:
from sbol import *

# Some nicer names for display purposes
TU_names = {
    'TU1_1': 'A',
    'TU1_2': 'B',
    'TU1_5': 'E',
    'TU1_8': 'G',
    
    'TU2_1': 'A',
    'TU2_3': 'C',
    'TU2_5': 'E',
    'TU2_6': 'D',
    'TU2_7': 'F',
}

# The URI of "Engineered region" used to encode the TUs
TU_role = 'http://identifiers.org/so/SO:0000804'

df = nexp.copy()
vectors = df.Vector.unique()

synbiouc = PartShop('http://3.128.232.8:7777')

result = pd.DataFrame()
rows_to_add = []
for vector in vectors:
    vec = fj.get('vector', name=[vector])
    dna_id = vec.dnas[0]
    dna = fj.get('dna', id=[dna_id])
    sboluri = dna.sboluri[0]
    data = df[df.Vector==vec.name[0]]

    if sboluri!='':
        # Create a new SBOL document
        doc = Document()
        synbiouc.pull(sboluri, doc)
        plasmid = doc.componentDefinitions[sboluri]
        composition = plasmid.getPrimaryStructure()
        TUs = [component.displayId for component in composition \
                       if TU_role in component.roles]
        # The first TU is the RFP TU
        data = data.assign(rfp_tu=TU_names[TUs[0]])
        # The second TU is the YFP TU
        data = data.assign(yfp_tu=TU_names[TUs[1]])
        rows_to_add.append(data)
    else:
        print('No SBOL URI')

df = result.append(rows_to_add)

# Save data to JSON for later analysis
df.to_json('context_effects.json')



In [None]:
np.any(np.isnan(df.Expression.values))

Now we can make heatmaps to compare the mean expression rate ratio of each TU in its different compositional contexts. To do this we pivot the table to have the YFP TU name along the x-axis and the RFP-TU along the y-axis. In order to see the effect of context irrespetive of the overall magnitude of expression, we normalize by the mean of the rows of the heatmap for RFP and the columns for YFP. We then take the log base 2 to see the fold change over the mean. In this way we expect the heatmap to be uniformly zero if there are no context effects.

In [None]:
df.Media.unique()

In [None]:
grouped_media = df.groupby('Media')
for media,media_data in grouped_media:
    grouped_strain = media_data.groupby('Strain')
    for strain,df in grouped_strain:
        fig,ax = plt.subplots(1,2, figsize=(3.3,1.4), sharex=False, sharey=False)
        cbar_ax = fig.add_axes([0.91, .1, .03, .75])
        df_c = df[df['Signal']=='CFP']
        for name,i in zip(['RFP', 'YFP'], np.arange(0,2)):
            df_x = df[df['Signal']==name].copy(deep=True)
            df_heatmap = df_x.pivot_table(values='Expression',
                                            index='rfp_tu',
                                            columns='yfp_tu', aggfunc=np.mean)
            if name=='YFP':
                # Normalize columns to mean
                df_heatmap = df_heatmap / df_heatmap.mean()
            else:
                # Normalize rows to mean
                df_heatmap = df_heatmap.div( df_heatmap.mean(axis=1), axis=0 )
            
            # Take log of normalized values
            df_heatmap = df_heatmap.apply(np.log2)
                
            g = sns.heatmap(df_heatmap, annot=True, ax=ax[i], 
                        square=True, fmt='0.1f', cmap='bwr', 
                        center=0, vmin=-3., vmax=3., linewidths=1, linecolor='black',
                        cbar=(i==1), cbar_ax=cbar_ax)
            g.set_facecolor('gray')
            bottom, top = ax[i].get_ylim()
            ax[i].set_title(name)
            ax[i].set_ylim(bottom + 0.5, top - 0.5)
            ax[i].set_xlabel('YFP TU')
            if i==0:
                ax[i].set_ylabel('RFP TU')
            else:
                ax[i].set_ylabel('')
        #plt.tight_layout()
        plt.subplots_adjust(bottom=0.3)
        plt.suptitle(strain + ' in ' + media)
        plt.savefig('heatmap_rpu_tu_'+media+'_'+strain+'.png', dpi=300)

## Statistical analysis of context effects on gene expression

In [None]:
import statsmodels as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm


label_map = {
    'C(Media)': 'Media',
    'C(Strain)': 'Strain',
    'C(rfp_tu)': 'RFP TU',
    'C(yfp_tu)': 'YFP TU',
    'C(rfp_tu):C(yfp_tu)': 'RFP/YFP TU',
    'C(Vector)': 'Vector',
    'Residual': 'Other'
}

df = nexp #pd.read_json('context_effects.json').dropna()

#fig,ax = plt.subplots(1,2, figsize=(6.7,2.5))
fig = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]])
i = 0
for name in ['RFP', 'YFP']:
    data = df[df['Signal']==name]
    #print(data.head())
    results = ols('Expression ~ C(Media) \
                  + C(Strain) \
                  + C(Vector)', data=data).fit()
    results.summary()

    aov_table = anova_lm(results, typ=2)

    ss_tot = aov_table['sum_sq'].sum()
    #print(ss_tot)

    aov_table['eta'] = aov_table['sum_sq'] / ss_tot *100

    print(aov_table)

    labels=[label_map[ind] for ind in aov_table.index]
    #ax[i].pie(aov_table['eta']) #, labels=labels)
    #plt.legend(labels)
    #ax[i].set_title(name)
    pie = go.Pie(labels=labels, values=aov_table['eta'])
    fig.add_trace(pie, row=1, col=i+1)
    i += 1
    
fig.show()


In [None]:
results.summary()

## Effect of context on gene expression time dynamics

In [None]:
yfp_vectors = [
    ['pBFA', 'pEFA', 'pGFA'],
    ['pBDA', 'pEDA', 'pGDA'],
    ['pBCA', 'pECA', 'pGCA'],
    ['pAAA', 'pBAA', 'pEAA', 'pGAA']
]
yfp_vector_ids = [[fj.get('vector', name=name).id[0] for name in vecs] for vecs in yfp_vectors]
yfp_id = fj.get('signal', name='YFP').id

medias = ['M9-glucosa', 'M9-glicerol']
strains = ['MG1655z1', 'Top10']

# YFP figures
for media in medias:
    for strain in strains:
        print(media, strain)
        for vi,vector_id in enumerate(yfp_vector_ids):
            media_id = fj.get('media', name=media).id
            strain_id = fj.get('strain', name=strain).id
            fig = fj.plot(study=study.id, 
                           vector=vector_id,
                           media=media_id,
                           strain=strain_id,
                           signal=yfp_id,
                           type='Expression Rate (indirect)',
                           biomass_signal=od.id[0],
                           pre_smoothing=11,
                           post_smoothing=11,
                            normalize='Temporal Mean', 
                            subplots='Signal', 
                            markers='Vector', 
                            plot='Mean')
            fig = flapjack.layout_print(fig, width=1.65, height=1.25)
            fname = '-'.join([media, strain, yfp_vectors[vi][0][2], 'YFP.png'])
            io.write_image(fig, fname)

In [None]:
rfp_vectors = [
    ['pBAA', 'pBCA', 'pBDA', 'pBFA'],
    ['pEAA', 'pECA', 'pEDA', 'pEFA'],
    ['pGAA', 'pGCA', 'pGDA', 'pGEA', 'pGFA']
]

rfp_vector_ids = [[fj.get('vector', name=name).id[0] for name in vecs] for vecs in rfp_vectors]
rfp_id = fj.get('signal', name='RFP').id

medias = ['M9-glucosa', 'M9-glicerol']
strains = ['MG1655z1', 'Top10']

# RFP figures
for media in medias:
    for strain in strains:
        print(media, strain)
        for vi,vector_id in enumerate(rfp_vector_ids):
            media_id = fj.get('media', name=media).id
            strain_id = fj.get('strain', name=strain).id
            fig = fj.plot(study=study.id, 
                           vector=vector_id,
                           media=media_id,
                           strain=strain_id,
                           signal=rfp_id,
                           type='Expression Rate (indirect)',
                           biomass_signal=od.id[0],
                           pre_smoothing=11,
                           post_smoothing=11,
                            normalize='Temporal Mean', 
                            subplots='Signal', 
                            markers='Vector', 
                            plot='Mean')
            fig = flapjack.layout_print(fig, width=1.65, height=1.25)
            fname = '-'.join([media, strain, rfp_vectors[vi][0][1], 'RFP.png'])
            io.write_image(fig, fname)

In [None]:
cfp_id = fj.get('signal', name='CFP').id
fig = fj.plot(study=study.id, 
               signal=cfp_id,
               type='Expression Rate (indirect)',
               biomass_signal=od.id[0],
               pre_smoothing=11,
               post_smoothing=11,
                normalize='Temporal Mean', 
                subplots='Signal', 
                markers='Vector', 
                plot='Mean')
fig = flapjack.layout_print(fig, width=1.65, height=1.25)
fig.update_traces(showlegend=False)
fname = 'CFP.png'
io.write_image(fig, fname)

In [None]:
# CFP figures
for media in medias:
    for strain in strains:
        print(media, strain)
        filter['dna'] = []
        filter['media'] = media
        filter['strain'] = strain
        figs,s = session.plot_rate_direct(filter)
        cfp = figs['CFP']
        # Layout the figure better for our purposes
        layout_print(cfp, width=3.3/2, aspect=1.25, font_size=6.)
        cfp.update_yaxes(title='Expression rate')
        cfp.update_yaxes(rangemode='tozero')
        #cfp.update_traces(showlegend=False)
        # Set the figure title
        cfp['layout']['annotations'][0]['text'] = 'CFP'
        cfp.show()
        fname = '-'.join([media[0], strain[0], 'CFP.png'])
        io.write_image(cfp, fname, 'png')

### Analyze time series data using tslearn

In [None]:
df_all = fj.analysis(study=study.id,
                            type='Expression Rate (indirect)',
                            biomass_signal=od.id,
                            pre_smoothing=11,
                            post_smoothing=11
                            #bg_correction=2,
                            #min_biomass=0.05,
                            #remove_data=False
                              )

In [None]:
import tslearn
from tslearn.clustering import TimeSeriesKMeans, silhouette_score
from tslearn.preprocessing import TimeSeriesScalerMinMax, \
                                    TimeSeriesScalerMeanVariance

In [None]:
fig,axs = plt.subplots(2, 2, figsize=(9,3.5))
axs = axs.ravel()
i = 0
for signal,data in df_all.groupby('Signal'):
    df_heatmap = data.sort_values('Time').pivot_table(values='Rate',
                                                index=[ 'Media', 'Strain', 'Vector'],
                                                columns='Time', aggfunc=np.mean)
    #df_heatmap = df_heatmap / df_heatmap.mean()
    df_heatmap = df_heatmap.div( df_heatmap.mean(axis=1), axis=0 )
    
    values = df_heatmap.values
    X_train = TimeSeriesScalerMeanVariance().fit_transform(values)
    
    max_silscore = 0
    for ncomps in range(2,14):
        _km = TimeSeriesKMeans(n_clusters=ncomps, random_state=0, n_init=10)
        _y_pred = km.fit_predict(X_train)

        silscore = silhouette_score(X_train, _y_pred, metric="euclidean")
        if silscore > max_silscore:
            max_silscore = silscore
            y_pred = _y_pred
            km = _km
            comps = ncomps
    print(comps)
    idx = np.argsort(y_pred)
    #axs[i].imshow(X_train[idx,:,0], cmap='jet', vmin=-3, vmax=3, aspect='auto')
    #g = sns.heatmap(df_heatmap[idx,:], cmap='jet', ax=axs[i])
    axs[i].plot(km.cluster_centers_[:,:,0].transpose())
    
    if i != 0:
        axs[i].set_xticks([])
        axs[i].set_yticks([])
    axs[i].set_title(signal)
    i += 1

In [None]:
fig = fj.plot(study=study.id, 
                           type='Expression Rate (indirect)',
                           biomass_signal=od.id[0],
                           pre_smoothing=11,
                           post_smoothing=11,
                            normalize='Mean/Std', 
                            subplots='Signal', 
                            markers='Media', 
                            plot='Mean')
fig = flapjack.layout_print(fig, width=9, height=3.5)
#fname = '-'.join([media, strain, yfp_vectors[vi][0][2], 'YFP.png'])
#io.write_image(fig, fname)

In [None]:
for ncomps in range(2,14):
    values = df_heatmap.values
    X_train = TimeSeriesScalerMeanVariance().fit_transform(values)
    km = TimeSeriesKMeans(n_clusters=ncomps, random_state=0, n_init=10)
    y_pred = km.fit_predict(X_train)
    
    silscore = silhouette_score(X_train, y_pred, metric="euclidean")
    print('Silhouette score = ', silscore)

In [None]:
values = df_heatmap.values
X_train = TimeSeriesScalerMeanVariance().fit_transform(values)
print(X_train.shape)    

km = TimeSeriesKMeans(n_clusters=2, random_state=0, n_init=10)
y_pred = km.fit_predict(X_train)


In [None]:
X_train.shape

In [None]:
colors = ['r', 'g']
for c in range(2):
    y = km.cluster_centers_[c].ravel()
    #plt.plot(y)
    idx = np.where(y_pred==c)[0]
    #print(idx)
    plt.plot(X_train[idx,:,0].transpose(), colors[c])