# Proteomics integration with an enzyme constrained model
In this notebook we are going to integrate proteomics data from [Schmidt et al. 2016](https://doi.org/10.1038/nbt.3418) into an enzyme constrained model of E. coli (eciML1515).

In [1]:
import pandas as pd
import cobra
from cobra.flux_analysis import flux_variability_analysis

# model from https://github.com/ginkgobioworks/geckopy/tree/master/tests/data
model = cobra.io.read_sbml_model('../models/eciML1515.xml.gz')

model.solver.configuration.presolve = True

In [2]:
def limit_proteins(model, measurements, error = 0):
    """Apply proteomics measurements to `model`.

    Adapted from https://github.com/DD-DeCaF/simulations/blob/devel/src/simulations/modeling/driven.py

    Parameters
    ----------
    model: cobra.Model
        The enzyme-constrained model.
    measurements : pd.DataFrame
        Protein abundances in mmol / gDW.

    """
    for protein_id, measure in measurements.items():
        try:
            rxn = model.reactions.get_by_id(f"prot_{protein_id}_exchange")
        except KeyError:
            pass
        else:
            # update only upper_bound (as enzymes can be unsaturated):
            rxn.bounds = (0, measure + measure*error)


In [3]:
def top_shadow_prices(solution, met_ids, top=1):
    """
    Retrieves shadow prices for a list of metabolites from the solution and ranks
    them from most to least sensitive in the model.

    Parameters
    ----------
    solution: cobra.Solution
        The usual Solution object returned by model.optimize().
    biomass_reaction: str
        name of biomass reaction
    met_ids: iterable of strings
        Subset of metabolite IDs from the model.
    top: int
        The number of metabolites to be returned.

    Returns
    -------
    shadow_pr: pd.Series
        Top shadow prices, ranked.
    """
    shadow_pr = solution.shadow_prices
    shadow_pr = shadow_pr.loc[shadow_pr.index.isin(met_ids)]
    return shadow_pr.sort_values()[:top]



def protein_to_metabolite(protein_id, model):
    met_id = model.metabolites.query(lambda m: protein_id in m.id)
    return met_id[0].id if met_id else ""



def flexibilize_proteomics(model, biomass_reaction, minimal_growth, proteomics):
    """
    Replace proteomics measurements with a set that enables the model to grow. Proteins
    are removed from the set iteratively based on sensitivity analysis (shadow prices).
    
    Adapted from https://github.com/DD-DeCaF/simulations/blob/devel/src/simulations/modeling/driven.py

    Parameters
    ----------
    model: cobra.Model
        The enzyme-constrained model.
    minimal_growth_rate: float
        Minimal growth rate to enforce.
    proteomics: pandas.DataFrame
        List of measurements.

    Returns
    -------
    growth_rate: dict
        New growth rate (will change if the model couldn't grow at the inputted value).
    proteomics: list(dict)
        Filtered list of proteomics.

    """
    def protein_to_metabolite(protein_id, model):
        met_id = model.metabolites.query(lambda m: protein_id in m.id)
        return met_id[0].id if met_id else ""

    # reset growth rate in model:
    model.reactions.get_by_id(biomass_reaction).bounds = (0, 1000)

    # build a table with protein ids, met ids in model and values to constrain with:
    prot_df = pd.DataFrame(proteomics)
    prot_df.index = prot_df.index.astype("str")
    prot_df["met_id"] = [protein_to_metabolite(prot, model) for prot in prot_df.index]
    prot_df = prot_df[prot_df.met_id != ""]

    # constrain the model with all proteins and optimize:
    limit_proteins(model, proteomics)
    solution = model.optimize()
    new_growth_rate = solution.objective_value if solution.objective_value else 0

    # while the model cannot grow to the desired level, remove the protein with
    # the highest shadow price:
    prots_to_remove = []
    no_improvement = 0
    while new_growth_rate < minimal_growth and not prot_df.empty:
        # get most influential protein in model:
        top_protein = top_shadow_prices(solution, list(prot_df["met_id"]))
        top_protein = top_protein.index[0]
        top_protein = prot_df.index[prot_df["met_id"] == top_protein][0]

        # update data: append protein to list, remove from current dataframe and
        # increase the corresponding upper bound to +1000:
        prots_to_remove.append(top_protein)
        prot_df = prot_df.drop(labels=top_protein)
        rxn = model.reactions.get_by_id(f"prot_{top_protein}_exchange")
        current = rxn.upper_bound
        rxn.bounds = (0, 1000)# current*100)

        # re-compute solution:
        solution = model.optimize()

        if solution:
            if (solution.objective_value - new_growth_rate) < 1e-5:  # the algorithm is stuck
               no_improvement += 1

        if no_improvement > 30:
            break

        new_growth_rate = solution.objective_value if solution else 0
        print((top_protein, new_growth_rate))

    # update growth rate if optimization was not successful:
    if new_growth_rate < minimal_growth:
        print(
            f"Minimal growth was not reached! "
            f"Final growth of the model: {new_growth_rate}"
        )

    return new_growth_rate, prots_to_remove

# How are proteins represented in an ecModel?

Each reaction has the corresponding protein as a substrate

In [4]:
model.reactions.PYK2No1

0,1
Reaction identifier,PYK2No1
Name,Pyruvate kinase(2) (No1)
Memory address,0x7fc416daae10
Stoichiometry,h_c + pep_c + 8.6697e-08 prot_P21599 + udp_c --> pyr_c + utp_c  H+ [cytosol] + Phosphoenolpyruvate [cytosol] + 8.6697e-08 prot_P21599 [cytosol] + UDP C9H11N2O12P2 [cytosol] --> Pyruvate [cytosol] + UTP [cytosol]
GPR,b1854
Lower bound,0.0
Upper bound,1000.0


The protein has an exchange reaction - the upper bound on this reaction can be constrained by proteomics data

In [5]:
model.reactions.prot_P21599_exchange

0,1
Reaction identifier,prot_P21599_exchange
Name,prot_P21599_exchange
Memory address,0x7fc41683bf90
Stoichiometry,--> prot_P21599  --> prot_P21599 [cytosol]
GPR,b1854
Lower bound,0.0
Upper bound,1000.0


Let's check the growth rate of the model before integrating proteomics

In [6]:
model.optimize()
summary = model.summary()
summary

Metabolite,Reaction,Flux,C-Number,C-Flux
ca2_e,EX_ca2_e_REV,0.004565,0,0.00%
cl_e,EX_cl_e_REV,0.004565,0,0.00%
cobalt2_e,EX_cobalt2_e_REV,2.192e-05,0,0.00%
cu2_e,EX_cu2_e_REV,0.0006218,0,0.00%
fe2_e,EX_fe2_e_REV,0.01409,0,0.00%
glc__D_e,EX_glc__D_e_REV,10.0,6,100.00%
k_e,EX_k_e_REV,0.1712,0,0.00%
mg2_e,EX_mg2_e_REV,0.007608,0,0.00%
mn2_e,EX_mn2_e_REV,0.000606,0,0.00%
mobd_e,EX_mobd_e_REV,6.139e-06,0,0.00%

Metabolite,Reaction,Flux,C-Number,C-Flux
4crsol_c,DM_4crsol_c,-0.0001956,7,0.01%
5drib_c,DM_5drib_c,-0.0001973,5,0.00%
amob_c,DM_amob_c,-1.754e-06,15,0.00%
co2_e,EX_co2_e,-24.0,1,99.99%
h2o_e,EX_h2o_e,-47.16,0,0.00%
h_e,EX_h_e,-8.058,0,0.00%
meoh_e,EX_meoh_e,-1.754e-06,1,0.00%


In the summary you can see which proteins are being used by the model and their amounts.
However, this model does not have the constraint on the total protein pool, so these predictions might not be realistic. 

## Proteomics integration
### Reading and processing the data
We will use a dataset from Schmidt 2016. In this publication they measured proteome of E. coli in various conditions.

In [7]:
# proteomics data
df = pd.read_csv(
    "../data/ecoli_proteomics_schmidt2016_S5_ren.tsv", # BW25113 22 media
    sep="\t", skiprows=2  # skip titles and subtitles (XLXS)
)

# metadata
exp_details = pd.read_csv(
    "../data/ecoli_details_schmidt2016_S23.tsv", 
    sep="\t",
    skiprows=2  # skip titles and subtitles (XLXS)
)

In [8]:
df.head(3)

Unnamed: 0,Uniprot Accession,Description,Gene,Peptides used for quantitation,Confidence score,Molecular weight (Da),Glucose_copies,LB_copies,Glycerol + AA_copies,Acetate_copies,...,Stationary phase 1 day_cv,Stationary phase 3 days_cv,Osmotic-stress glucose_cv,42°C glucose_cv,pH6 glucose_cv,Xylose_cv,Mannose_cv,Galactose _cv,Succinate_cv,Fructose_cv
0,P0A8T7,DNA-directed RNA polymerase subunit beta' OS=E...,rpoC,91,6045.53,155045.008,2779,7164,4503,2180,...,6.76,12.08,20.91,13.86,6.14,6.27,16.15,22.87,16.45,9.29
1,P0A8V2,DNA-directed RNA polymerase subunit beta OS=Es...,rpoB,89,5061.29,150520.2758,3957,8888,5199,2661,...,14.42,11.18,17.48,10.51,5.93,4.27,13.51,19.75,13.6,7.77
2,P36683,Aconitate hydratase 2 OS=Escherichia coli (str...,acnB,67,4505.67,93420.9457,7596,16600,17548,22844,...,3.93,10.04,16.6,9.57,4.47,21.78,3.16,2.07,4.21,3.25


In [9]:
exp_details.head(10)

Unnamed: 0,Growth condition,Strain,Growth rate (h-1),Stdev,Single cell volume [fl]1,Doubling time (h-1),Time exp before harvest (h),# of doublings at exponential growth before harvesting,OD @ harvesting. replicates,Unnamed: 9,Unnamed: 10,Number of Proteins Identified (FDR 1%)2
0,LB,BW25113,1.9,0.03,4.29,0.4,5.0,13.7,1.8,1.37,1.34,1752.0
1,LB,MG1665,1.78,0.05,4.23,0.4,5.4,13.9,0.53,1.09,1.32,1733.0
2,LB,NCM3722,2.3,0.05,4.36,0.3,3.2,10.6,0.91,0.96,1.0,1703.0
3,Glycerol + AA,BW25113,1.27,0.01,3.83,0.5,7.5,13.8,0.5,0.51,0.5,1675.0
4,Acetate,BW25113,0.3,0.04,2.3,2.3,29.3,12.7,0.53,0.22,0.48,1683.0
5,Fumarate,BW25113,0.42,0.02,2.54,1.7,25.4,15.2,0.22,0.2,0.14,1696.0
6,Galactose,BW25113,0.26,0.003,2.21,2.7,52.1,19.2,2.02,2.03,2.02,1650.0
7,Glucose,BW25113,0.58,0.01,2.84,1.2,22.7,19.1,0.4,0.43,0.39,1702.0
8,Glucose,MG1665,0.67,0.07,3.0,1.0,22.0,21.4,0.62,0.38,0.63,1706.0
9,Glucose,NCM3722,1.03,0.06,3.54,0.7,19.9,29.6,1.06,1.03,1.41,1761.0


In [10]:
# select column with protein copy numbers, coefficient of variation and UniProt accession number
df = df.loc[:, 
    df.columns.str.contains("_copies|_cv", 
                            regex=True) |  # only interested in copies/cell and uncertainty
    df.columns.isin(["Uniprot Accession"])  # and relevant info about proteins
]
df=df.drop(2047, axis=0)


In [11]:
# select one condition
condition = "Glucose"
df_ac = df.loc[:, ["Uniprot Accession", f"{condition}_copies", f"{condition}_cv"]]
# rename resulting columns
df_ac.columns = ["uniprot", "copies_per_cell", "CV"]

In [12]:
df_ac.describe()

Unnamed: 0,copies_per_cell,CV
count,2057.0,2057.0
mean,2451.522606,23.619208
std,9434.381168,25.353809
min,0.0,0.09
25%,39.0,8.94
50%,229.0,15.37
75%,1255.0,27.73
max,252452.0,172.56


In [13]:
# apply uncertainty (extend upper bound by 3x coefficient of variation)
df_ac["copies_upper"] = df_ac["copies_per_cell"] + 3*df_ac["CV"]/100 * df_ac["copies_per_cell"]

In [14]:
# convert units from copies to mmoles
df_ac["mmol_per_cell"] = df_ac["copies_upper"] * 1e3/6.022e23

In [15]:
# extract cell volume
cell_volume = exp_details.loc[
    (exp_details["Growth condition"] == condition),# & 
    #(exp_details["Strain"] == "BW25113"),
    "Single cell volume [fl]1"
].mean()


In [16]:
cell_volume

np.float64(3.1266666666666665)

In [17]:
# calculate concentration (mmol per g cell dry weight)
cell_density = 0.34  # https://bionumbers.hms.harvard.edu/bionumber.aspx?id=109049
df_ac["conc"] = df_ac["mmol_per_cell"]/cell_volume * 1e12 / (cell_density  * cell_density)

In [18]:
# extract experimental growth rate
growth_experimental = exp_details.loc[
    (exp_details["Growth condition"] == condition),# & 
    #(exp_details["Strain"] == "BW25113"),
    "Growth rate (h-1)"
].mean()
growth_experimental

np.float64(0.7600000000000001)

In [19]:
proteomics = df_ac["conc"]
proteomics.index = df_ac["uniprot"]

In [20]:
# save the non enzyme-constrained model
plain_model = model.copy()

### Integrate proteomics

In [21]:
limit_proteins(model, proteomics)
model.optimize()
model.summary()

Metabolite,Reaction,Flux,C-Number,C-Flux
ca2_e,EX_ca2_e_REV,2.89e-05,0,0.00%
cl_e,EX_cl_e_REV,2.89e-05,0,0.00%
cobalt2_e,EX_cobalt2_e_REV,1.388e-07,0,0.00%
cu2_e,EX_cu2_e_REV,3.936e-06,0,0.00%
fe2_e,EX_fe2_e_REV,8.916e-05,0,0.00%
glc__D_e,EX_glc__D_e_REV,2.12,6,100.00%
k_e,EX_k_e_REV,0.001084,0,0.00%
mg2_e,EX_mg2_e_REV,4.816e-05,0,0.00%
mn2_e,EX_mn2_e_REV,3.836e-06,0,0.00%
nh4_e,EX_nh4_e_REV,0.05996,0,0.00%

Metabolite,Reaction,Flux,C-Number,C-Flux
4crsol_c,DM_4crsol_c,-1.238e-06,7,0.00%
5drib_c,DM_5drib_c,-1.238e-06,5,0.00%
ac_e,EX_ac_e,-3.129,2,50.11%
co2_e,EX_co2_e,-2.351,1,18.82%
for_e,EX_for_e,-0.7823,1,6.26%
glyclt_e,EX_glyclt_e,-3.714e-06,2,0.00%
h2o_e,EX_h2o_e,-2.497,0,0.00%
h_e,EX_h_e,-4.996,0,0.00%
lac__D_e,EX_lac__D_e,-1.033,3,24.81%


We observe that the growth rate is now very low. This is because we have just included more than 2000 new constraints and their experimental errors add up.

### We need to relax some constraints
To do that, we relax proteins that influence the growth rate the most (have the highest shadow prices) one by one until the target growth rate is eached.

In [22]:
biomass_reaction = "BIOMASS_Ec_iML1515_core_75p37M"
new_growth_rate, prots_removed = flexibilize_proteomics(model, biomass_reaction, growth_experimental*0.9, proteomics)

('P0A7E3', 0.005808325146965149)
('P15254', 0.011907048951857917)
('P60782', 0.011907048951857912)
('P31119', 0.01509917737298079)
('P0A6X1', 0.0204109472282912)
('P21151', 0.02555142237937021)
('P0A6C5', 0.03269752605581791)
('P0AD65', 0.039394725584286146)
('P27300', 0.05813822043507452)
('P0AD68', 0.06087682328900411)
('P17854', 0.07205325219895482)
('Q47146', 0.07576189028021905)
('P0AC16', 0.09157133851432961)
('P0ABG1', 0.09188613924492049)
('P0A6W3', 0.10725365219107008)
('P14900', 0.11561096729013522)
('P09029', 0.15857449347848757)
('P0A6I9', 0.2002092542377616)
('P00547', 0.2005618449103512)
('P11446', 0.23134030843294037)
('P56580', 0.23134030846493406)
('P76014', 0.24022901716875014)
('P37349', 0.2629676912903388)
('P76015', 0.26302788000838995)
('P17952', 0.2764074025481077)
('P76085', 0.2764074025480886)
('P76078', 0.2764074025480886)
('P76081', 0.27640740254808244)
('P0AC44', 0.2767481449750867)
('P07001', 0.27786631661302036)
('P0A953', 0.2785151193424046)
('P68699', 0.

We see that only a small fraction of proteins needed to be relaxed

In [23]:
percentage_removed = round(len(prots_removed)/proteomics.shape[0]*100, 1)
print(f"Proteins in dataset: {proteomics.shape[0]}\nProteins removed: {len(prots_removed)} ({percentage_removed}%)")

Proteins in dataset: 2057
Proteins removed: 74 (3.6%)


In [24]:
fluxes = model.optimize()
fluxes.fluxes.to_csv(f"../results/ecModel_{condition}_flux_results.csv")
model.summary()

Metabolite,Reaction,Flux,C-Number,C-Flux
ca2_e,EX_ca2_e_REV,0.003674,0,0.00%
cl_e,EX_cl_e_REV,0.003674,0,0.00%
cobalt2_e,EX_cobalt2_e_REV,1.764e-05,0,0.00%
cu2_e,EX_cu2_e_REV,0.0005004,0,0.00%
fe2_e,EX_fe2_e_REV,0.01134,0,0.00%
glc__D_e,EX_glc__D_e_REV,10.0,6,100.00%
k_e,EX_k_e_REV,0.1378,0,0.00%
mg2_e,EX_mg2_e_REV,0.006123,0,0.00%
mn2_e,EX_mn2_e_REV,0.0004877,0,0.00%
mobd_e,EX_mobd_e_REV,4.94e-06,0,0.00%

Metabolite,Reaction,Flux,C-Number,C-Flux
4crsol_c,DM_4crsol_c,-0.0001574,7,0.00%
5drib_c,DM_5drib_c,-0.0001588,5,0.00%
amob_c,DM_amob_c,-1.412e-06,15,0.00%
co2_e,EX_co2_e,-31.03,1,99.99%
glyclt_e,EX_glyclt_e,-0.0004722,2,0.00%
h2o_e,EX_h2o_e,-49.67,0,0.00%
h_e,EX_h_e,-6.485,0,0.00%
meoh_e,EX_meoh_e,-1.412e-06,1,0.00%


### Next steps
As a next step, we could for example
* visualize the fluxes in Escher or other software
* perform flux variability analysis or sampling to characterize the solution space
* compare it to unconstrained model - are the flux distributions different? Are different products secreted?
* compare the different conditions (e.g. ask how does the metabolism rewire in rich vs poor medium, high temperature and other stressors...)