# Published data - Cupriavidus necator
## License information
The data and model used in this notebook comes from 

Alagesan, S., Minton, N.P. & Malys, N. 13C-assisted metabolic flux analysis to investigate heterotrophic and mixotrophic metabolism in Cupriavidus necator H16. Metabolomics 14, 9 (2018). https://doi.org/10.1007/s11306-017-1302-z 

and is licensed under a Creative Commons Attribution 4.0 International License.

You should have received a copy of the license along with this
work. If not, see <http://creativecommons.org/licenses/by/4.0/>.

## Description of data
Data collection
- Batch culture samples from two time points during exponential phase

In [37]:
import pandas as pd
import numpy as np
import time
import ast
import matlab.engine
import sys
#import escher
import dotenv
import pandera as pa
import BFAIR.mfa.INCA.dataschemas as dataschemas
import BFAIR.mfa.INCA.INCA_script_writing as INCA_script_writing
from BFAIR.mfa.INCA.run_inca import run_inca

In [38]:
# import environment variables
INCA_base_directory = dotenv.get_key(dotenv.find_dotenv(), "INCA_base_directory")

#### Import using parsers
We will try to make this tutorial as realistic as possible, therefore we will also include some data preprocessing to show case an example of how to get the data into the correct format. We have taken the data from the supplementary materials from the article described above. The actual supplementary materials is a word document, therefore we manually extracted the data into a series of csv and excel files. As a beginning we will have a look at the reactions and atom map from the article.

In [39]:
reacts = pd.read_excel("./Literature data/Cupriavidus necator  Alagesan 2017/reactions.xlsx")
reacts

Unnamed: 0,Reaction ID,Equations (Carbon atom transition)
0,ex_1,FRU.ext (abcdef) -> F6P (abcdef)
1,ex_2,GLY.ext (abc) -> GLY (abc)
2,R1,GLY (abc) -> DHAP (abc)
3,R2,F6P (abcdef) -> G6P (abcdef)
4,R3,G6P (abcdef) -> F6P (abcdef)
...,...,...
69,R68,ANTHR (abcdefg) + R5P (hijkl) -> CPADR5P (abcd...
70,R69,CPADR5P (abcdefghijkl) -> INDG (abcdfghijkl) +...
71,R70,INDG (abcdfghijkl) -> TRP (abcdfghijkl)
72,R71,R5P (abcde) + MTHF (f)-> HIS (edcbaf)


We see that the model contains 74 reaction. The INCA parser uses the package Pandera to validate the input data. We can check if the reaction data correctly formatted according to the `ReactionsSchema`. We will wrap it in a try-except clause to create an output that is easier to interpret.

In [40]:
try:
    dataschemas.ReactionsSchema.validate(reacts)
except pa.schemas.errors.SchemaError as e:
    print(type(e))
    print(e)

<class 'pandera.errors.SchemaError'>
column 'rxn_eqn' not in dataframe
  Reaction ID Equations (Carbon atom transition)
0        ex_1   FRU.ext (abcdef) -> F6P (abcdef)
1        ex_2         GLY.ext (abc) -> GLY (abc)
2          R1            GLY (abc) -> DHAP (abc)
3          R2       F6P (abcdef) -> G6P (abcdef)
4          R3       G6P (abcdef) -> F6P (abcdef)


We see that the schema does not pass the validation, raises a `SchemaError`, and tells that the schema expects a column named `rxn_eqn` which is not found. Let's have a look at what are actually requirements for the model_reaction_schema.

In [41]:
dataschemas.ReactionsSchema.columns

{'rxn_eqn': <Schema Column(name=rxn_eqn, type=DataType(str))>,
 'rxn_id': <Schema Column(name=rxn_id, type=DataType(str))>}

The model_reaction data input requires two columns `rxn_eqn` and `rxn_id`. Let's rename the columns of the reactions data and rerun the validation.

In [42]:
reacts_processed = (reacts
    .copy()
    .rename(columns={"Reaction ID": "rxn_id", "Equations (Carbon atom transition)":"rxn_eqn"})
)
dataschemas.ReactionsSchema.validate(reacts_processed)
reacts_processed.head()

Unnamed: 0,rxn_id,rxn_eqn
0,ex_1,FRU.ext (abcdef) -> F6P (abcdef)
1,ex_2,GLY.ext (abc) -> GLY (abc)
2,R1,GLY (abc) -> DHAP (abc)
3,R2,F6P (abcdef) -> G6P (abcdef)
4,R3,G6P (abcdef) -> F6P (abcdef)


The reactions data passed the validation and we are ready to move on to the tracer information. This information is lifted from the materials and methods. They used substrates with 99 atom% purity for both the D-[1-13C]fructose, and [1,2-13C]glycerol. For the mixotrophic growth experiment they use two substrate [1,2-13C]glycerol and CO2. We will assume that the CO2 in labelled following natural abundance. Therefore, we will not consider the CO2 in the tracer specification.

In [43]:
tracer_info = pd.DataFrame.from_dict({
    'experiment_id': [
        'fructose', 'glycerol', 'mixotroph',
    ],
    'met_id': ['FRU.ext', 'GLY.ext', 'GLY.ext'],
    'tracer_id': [
        'D-[1-13C]fructose', '[1,2-13C]glycerol', '[1,2-13C]glycerol',
    ],
    'atom_ids': [
        [1], [1,2], [1,2],
    ],
    'atom_mdv': [
        [0.01,0.99], [0.01,0.99],[0.01,0.99],
    ],
    'enrichment': [
        1, 1, 1,
    ]    
}, orient='columns')
tracer_info.head()

Unnamed: 0,experiment_id,met_id,tracer_id,atom_ids,atom_mdv,enrichment
0,fructose,FRU.ext,D-[1-13C]fructose,[1],"[0.01, 0.99]",1
1,glycerol,GLY.ext,"[1,2-13C]glycerol","[1, 2]","[0.01, 0.99]",1
2,mixotroph,GLY.ext,"[1,2-13C]glycerol","[1, 2]","[0.01, 0.99]",1


In [44]:
try:
    dataschemas.tracer_schema.validate(tracer_info)
except Exception as e:
    print(e)

Now lets prepare the isopomer distribution vectors (idvs) also called the mass distribution vectors (mdvs). We will first inspect the data schema to get a since of the required input data.

In [45]:
dataschemas.MSMeasurementsSchema.columns

{'experiment_id': <Schema Column(name=experiment_id, type=DataType(str))>,
 'met_id': <Schema Column(name=met_id, type=DataType(str))>,
 'ms_id': <Schema Column(name=ms_id, type=DataType(str))>,
 'measurement_replicate': <Schema Column(name=measurement_replicate, type=DataType(int64))>,
 'labelled_atom_ids': <Schema Column(name=labelled_atom_ids, type=DataType(object))>,
 'unlabelled_atoms': <Schema Column(name=unlabelled_atoms, type=DataType(str))>,
 'mass_isotope': <Schema Column(name=mass_isotope, type=DataType(int64))>,
 'intensity': <Schema Column(name=intensity, type=DataType(float64))>,
 'intensity_std_error': <Schema Column(name=intensity_std_error, type=DataType(float64))>,
 'time': <Schema Column(name=time, type=DataType(float64))>}

This provides a short overview of which columns are required. For a more elaborate description of the input data see XX. Next, we load the MS data set from the article. We have copied the tables from the word document into an excel-workbook with three sheets one for each experiment (Fructose, Glycerol, GlycerolAndCO2). Lets have a look at the content of one of the sheets:

In [46]:
pd.read_excel("Literature data/Cupriavidus necator  Alagesan 2017/MDV_raw.xlsx", sheet_name='fructose').head()

Unnamed: 0,Amino Acid,Unnamed: 1,m/z,M,M+1,M+2,M+3,M+4,M+5,M+6,M+7,M+8,M+9
0,Alanine,[M-85],232,0.9759 ± 0.0027,0.0230 ± 0.0035,0.0011 ± 0.0009,0,0,0,0,0,0,0
1,,[M-57],260,0.3509 ± 0.0135,0.6407 ± 0.0138,0.0078 ± 0.0003,0.0006 ± 0.0006,0,0,0,0,0,0
2,Glycine,[M-85],218,0.9950 ± 0.0021,0.0049 ± 0.0021,0,0,0,0,0,0,0,0
3,,[M-57],246,0.9357 ± 0.0068,0.0638 ± 0.0072,0.0005 ± 0,,0,0,0,0,0,0
4,Valine,[M-85],260,0.9513 ± 0.0058,0.0436 ± 0.0057,0.0049 ± 0.0005,0,0.00003 ± 0,0,0,0,0,0


This data does is not in the correct format for ms_measurement_schema. As a first step, we will process the data into long (tidy) format, because from tidy data we can easily produce a valid input format. We have written up a small function which parses a single sheet into long format. This function is obviously specific for this particular data set.

In [47]:
def parse_mdv_raw_to_long(df: pd.DataFrame, experiment_id: str)-> pd.DataFrame:
    df['Amino Acid'] = df['Amino Acid'].ffill()
    long = df.melt(id_vars=['Amino Acid', 'Unnamed: 1', 'm/z'], var_name='mass_isotope').drop(columns=['Unnamed: 1'])
    long[['intensity', 'intensity_std_error']] = long['value'].str.split(r'±|\+', regex=True, expand=True)
    long.drop(columns=['value'], inplace=True)

    # convert strings to floats
    long['intensity'] = long['intensity'].str.strip().astype(float)
    long['intensity_std_error'] = long['intensity_std_error'].str.strip().astype(float)

    # some amino acids have trailing spaces
    long['Amino Acid'] = long['Amino Acid'].str.strip()

    # make ids
    long['fragment_id'] = long['Amino Acid'].str.replace(" ", '') + long['m/z'].astype(str)
    long['experiment_id'] = experiment_id
    return long.dropna()

Now we will simply loop over the sheets of the excel-workbook and parse each sheet with the parser written above and stack the dataframes to obtain one dataframe in long format with data from all three experiments.

In [48]:
xl_file = pd.ExcelFile("Literature data/Cupriavidus necator  Alagesan 2017/MDV_raw.xlsx")
mdvs_long = pd.DataFrame()
for sheet in xl_file.sheet_names:
    df = xl_file.parse(sheet)
    df = parse_mdv_raw_to_long(df, experiment_id=sheet)
    mdvs_long = pd.concat([mdvs_long, df])
mdvs_long.reset_index(drop=True, inplace=True)

mdvs_long.head()

Unnamed: 0,Amino Acid,m/z,mass_isotope,intensity,intensity_std_error,fragment_id,experiment_id
0,Alanine,232,M,0.9759,0.0027,Alanine232,fructose
1,Alanine,260,M,0.3509,0.0135,Alanine260,fructose
2,Glycine,218,M,0.995,0.0021,Glycine218,fructose
3,Glycine,246,M,0.9357,0.0068,Glycine246,fructose
4,Valine,260,M,0.9513,0.0058,Valine260,fructose


In [49]:
mdvs_long['intensity_std_error'].describe()

count    333.000000
mean       0.009974
std        0.010750
min        0.000000
25%        0.001500
50%        0.005200
75%        0.015800
max        0.054400
Name: intensity_std_error, dtype: float64

INCA has troubles if the measurement errors are too small or 0. Therefore will will apply a minimum error of 1e-7.

In [50]:
# set minumum std_error to  
minimum_std_error = 1e-4
mdvs_long.loc[mdvs_long['intensity_std_error'] < minimum_std_error, 'intensity_std_error'] = minimum_std_error

We see that the Amino acid names does not match the metabolite IDs in used in the reactions of the model. To accommodate this, we manually created a map between the metabolite IDs in the model and the amino acid names used in the data.

In [51]:
met_abbriviations = pd.DataFrame(
    [
        ('ALA', 'Alanine'), ('ASP', 'Aspartic acid'), ('GL', 'Glycine'), ('GLU', 'Glutamic acid'), 
        ('HIS', 'Histidine'), ('ILE', 'Isoleucine'), ('LEU', 'Leucine'), ('MET', 'Methionine'), 
        ('PHE', 'Phenylalanine'), ('SER', 'Serine'), ('THR', 'Threonine'), ('VAL', 'Valine')
    ], columns=['met_id','Amino Acid']
)
met_abbriviations.head()

Unnamed: 0,met_id,Amino Acid
0,ALA,Alanine
1,ASP,Aspartic acid
2,GL,Glycine
3,GLU,Glutamic acid
4,HIS,Histidine


We can merge this map into the dataframe.

In [52]:
mdvs_long_met_ids = mdvs_long.merge(met_abbriviations, on='Amino Acid', how='left')
mdvs_long_met_ids.head()

Unnamed: 0,Amino Acid,m/z,mass_isotope,intensity,intensity_std_error,fragment_id,experiment_id,met_id
0,Alanine,232,M,0.9759,0.0027,Alanine232,fructose,ALA
1,Alanine,260,M,0.3509,0.0135,Alanine260,fructose,ALA
2,Glycine,218,M,0.995,0.0021,Glycine218,fructose,GL
3,Glycine,246,M,0.9357,0.0068,Glycine246,fructose,GL
4,Valine,260,M,0.9513,0.0058,Valine260,fructose,VAL


In [53]:
mdvs_long_met_ids['mass_isotope'].unique()

array(['M', 'M+1', 'M+2', 'M+3', 'M+4', 'M+5', 'M+6', 'M+7', 'M+8', 'M+9'],
      dtype=object)

In [54]:
def mass_isotope_to_int(mass_isotope: str)-> int:
    if isinstance(mass_isotope, int): # avoids error when rerunning the cell
        return mass_isotope
    elif mass_isotope == "M":
        return 0
    else:
        return int(mass_isotope.replace("M+", ""))
mass_isotope_to_int("M+1")

1

In [55]:
mdvs_long_met_ids['mass_isotope'] = mdvs_long_met_ids['mass_isotope'].apply(mass_isotope_to_int)
mdvs_long_met_ids.head()

Unnamed: 0,Amino Acid,m/z,mass_isotope,intensity,intensity_std_error,fragment_id,experiment_id,met_id
0,Alanine,232,0,0.9759,0.0027,Alanine232,fructose,ALA
1,Alanine,260,0,0.3509,0.0135,Alanine260,fructose,ALA
2,Glycine,218,0,0.995,0.0021,Glycine218,fructose,GL
3,Glycine,246,0,0.9357,0.0068,Glycine246,fructose,GL
4,Valine,260,0,0.9513,0.0058,Valine260,fructose,VAL


Now, we need to aggregate the MS values and the std. error into python lists and name the columns to match the expected names in the `ms_measurement_schema`.

In [56]:
# aggregated_idvs = (mdvs_long_met_ids
#     .groupby(['experiment_id', 'met_id', 'fragment_id','Amino Acid', 'm/z'])
#     .aggregate(lambda x: x.tolist())
#     .reset_index()
#     .rename(columns={'value':'idv', 'std_error':'idv_std_error'})
# )
# aggregated_idvs.head()

So far so good, but we are still missing information about which carbon atoms from the metabolite are found in each fragment (`labelled_atom_ids`). If we want to do natural abundance correction through INCA we should also supply the chemical formula of all the unlabelled atoms of the fragment. We have stored this information in a separate csv file. The labelled atoms column contains a list as a string therefore we will use the trick from section "Note about formatting when reading csv and and excel files" in the input data description.

In [57]:
# Create dictionary to look up the fomula of each fragment in the model
fragments = pd.read_csv('Literature data/Cupriavidus necator  Alagesan 2017/fragments.csv', sep='\t', converters={'labelled_atoms': ast.literal_eval})
fragments.head()

Unnamed: 0,selected,fragment_id,metabolite_id,labelled_atoms,unlabelled_atoms,active,formula
0,True,Tyrosine302,tyr__L,"[1, 2]",C12H32O2NSi2,True,C14H32O2NSi2
1,False,Lysine431,lys__L,"[1, 2, 3, 4, 5, 6]",C14H47O2N2Si3,True,C20H47O2N2Si3
2,False,Lysine329,lys__L,"[2, 3, 4, 5, 6]",C12H41N2Si2,True,C17H41N2Si2
3,False,Histidine338,his__L,"[2, 3, 4, 5, 6]",C12H36N3Si2,False,C17H36N3Si2
4,False,Histidine440,his__L,"[1, 2, 3, 4, 5, 6]",C14H42O2N3Si3,False,C20H42O2N3Si3


We will need to merge this information into the measurement data (aggregated idvs) and for clarity we drop all the columns that are not required, even though it is allowed to have extra columns.

In [58]:
ms_data = (mdvs_long_met_ids
    .merge(
        fragments[['fragment_id', 'labelled_atoms', 'unlabelled_atoms']], 
        on='fragment_id', how='left'
    )
    .rename(columns={ # rename columns to match the schema
        'fragment_id': 'ms_id',
        'labelled_atoms': 'labelled_atom_ids',
    })
    .drop(columns=['Amino Acid', 'm/z'])
)
ms_data.head()

Unnamed: 0,mass_isotope,intensity,intensity_std_error,ms_id,experiment_id,met_id,labelled_atom_ids,unlabelled_atoms
0,0,0.9759,0.0027,Alanine232,fructose,ALA,"[2, 3]",C8H26ONSi2
1,0,0.3509,0.0135,Alanine260,fructose,ALA,"[1, 2, 3]",C8H26O2NSi2
2,0,0.995,0.0021,Glycine218,fructose,GL,[2],C8H24ONSi2
3,0,0.9357,0.0068,Glycine246,fructose,GL,"[1, 2]",C8H24O2NSi2
4,0,0.9513,0.0058,Valine260,fructose,VAL,"[2, 3, 4, 5]",C8H30ONSi2


One final thing is missing for the MS measurements and that is the time column. This this information is only used for Isotopically non-stationary MFA, but due to the way the INCA parser works it is a required input also for isotopically stationary MFA. In that case just fill the time column with zeros.

In [59]:
ms_data['time'] = 0
ms_data['measurement_replicate'] = 1

The data contains measurements from a fragment called `Methionine292`, we don't immediately know the labelled atom ids for this fragment, which resulted in Nan's in the labelled atom ids columns. We will just remove these measurements from the data.

In [60]:
ms_data = ms_data.query('ms_id != "Methionine292"')


In [61]:
ms_data_filled = INCA_script_writing.fill_all_mass_isotope_gaps(ms_data)
ms_data_filled[ms_data_filled.intensity.isna()]

Unnamed: 0,experiment_id,ms_id,measurement_replicate,mass_isotope,intensity,intensity_std_error,met_id,labelled_atom_ids,unlabelled_atoms,time
85,fructose,Valine260,1,3,,,VAL,"[2, 3, 4, 5]",C8H30ONSi2,0.0
172,glycerol,Phenylalanine336,1,1,,,PHE,"[1, 2, 3, 4, 5, 6, 7, 8, 9]",C8H30O2NSi2,0.0


In [63]:
import warnings
warnings.filterwarnings('error')
try:
    dataschemas.MSMeasurementsSchema.validate(ms_data)
except UserWarning as e:
    print(e)
warnings.resetwarnings()

We see that the `ms_measurement_schema` raise a warning. We warning tells us that some of the idv's does not contain a measurement for all possible mass isotopes. 

In [27]:
# mask = ms_data['idv'].apply(len) < ms_data['labelled_atom_ids'].apply(len) + 1

We see that the MS measurements passes the validation and we can now create the INCA scripts. We will use the function `create_inca_script_from_data()`. This function automatically determines measurement/data types in each experiment and build an INCA script for all or only the desired experiments. All experiments in the same INCA script will be fitted together, thus should be under the exact same biological conditions only varying in tracer, e.g. parallel labelling experiments one with [1-13C]Glucose and another with [1,2-13C]Glucose. The three experiments, fructose, glycerol and mixotroph, deploys different substrates. Thus, they cannot be considered parallel labelling experiments and therefore we need to fit the data from each experiment individually. This means that we need to create an INCA script for each of them.

We will start with just one of the three experiments, fructose.

In [28]:
script_fructose = INCA_script_writing.create_inca_script_from_data(
    reactions_data=reacts_processed, 
    tracer_data=tracer_info, 
    ms_measurements=ms_data,
    experiment_ids=['fructose']
)

The INCA parser has now created most of the INCA script. But it has not specified any options other than the default once or instructions on which algorithms INCA should execute on the data. We discussed earlier that the data was all ready corrected for natural abundance, we will assume that this also include the unlabelled atoms. Therefore, we will specify two INCA settings: sim_na and sim_more. First, determines whether to simulate of the natural abundance of the labelled atoms, and the second of the unlabelled atoms.

In [29]:
script_fructose.add_to_block("options", INCA_script_writing.define_options(sim_na=False, sim_more=False))

Next, we will specify which algorithms to run in this example we will use estimate, continuate and simulate (simulate is required for the results to be opened in the INCA GUI).

In [30]:
script_fructose.add_to_block("runner", INCA_script_writing.define_runner("Literature data/Cupriavidus necator  Alagesan 2017/c_necator_fructose.mat", run_estimate=False, run_simulation=False, run_continuation=False))

In [31]:
script_fructose.save_script("Literature data/Cupriavidus necator  Alagesan 2017/c_necator_fructose.m")

In [32]:
# it is the methionine idvs that are wrong, there are more measurements the labelled atoms in the model
# But also the there is something wrong with the passing of the data some fragments are missing
run_inca(script_fructose, INCA_base_directory)

INCA script saved to /var/folders/z6/mxpxh4k56tv0h0ff41vmx7gdwtlpvp/T/tmpssl7cic4/inca_script.m.
Starting MATLAB engine...
 
ms_fructose = 1x23 msdata object
 
fields: atoms  id  [idvs]  more  on  state  
 
Alanine232 Alanine260 Asparticacid302 Asparticacid390 Asparticacid418 Glutamicacid330 Glutamicacid432 Glycine218 Glycine246 Histidine338 Histidine440 Isoleucine274 Leucine274 Methionine320 Phenylalanine302 Phenylalanine308 Phenylalanine336 Serine362 Serine390 Threonine376 Threonine404 Valine260 Valine288
 
 
ms_fructose = 1x23 msdata object
 
fields: atoms  id  [idvs]  more  on  state  
 
Alanine232 Alanine260 Asparticacid302 Asparticacid390 Asparticacid418 Glutamicacid330 Glutamicacid432 Glycine218 Glycine246 Histidine338 Histidine440 Isoleucine274 Leucine274 Methionine320 Phenylalanine302 Phenylalanine308 Phenylalanine336 Serine362 Serine390 Threonine376 Threonine404 Valine260 Valine288
 
 
ms_fructose = 1x23 msdata object
 
fields: atoms  id  [idvs]  more  on  state  
 
Alanine23

In [29]:
experimentalMS_data_I.experiment_id.unique()

array(['[1,2-13C]glycerolandCO2'], dtype=object)

In [30]:
from BFAIR.mfa.INCA.INCA_results import INCA_results

In [31]:
res = INCA_results("C_necator.mat")

In [37]:
res.fitdata.fitted_parameters.query("type == 'Net flux'").query("eqn.str.contains('3PG')")

Unnamed: 0,type,id,eqn,val,std,lb,ub,unit,free,alf,chi2s,cont,cor,cov,vals,base
3,Net flux,R10,3PG -> G3P,15.66367,764.959548,[],[],[],1,0.05,[],1,"[-0.8775680723982162, 0.00023667780352492045, ...","[-1522.620734422586, 0.001810722045440154, 0.0...",[],{'id': []}
4,Net flux,R11,3PG -> PEP,1.142262,239.247937,[],[],[],0,0.05,[],1,"[0.9977056078555188, 3.568390500707045e-06, 3....","[541.4060078812232, 8.538397432866773e-06, 8.5...",[],{'id': []}
5,Net flux,R12,PEP -> 3PG,0.6909481,28.692899,[],[],[],1,0.05,[],1,"[-0.9435229471547154, 0.00027963130521239374, ...","[-61.404379620693064, 8.0244634750537e-05, 8.0...",[],{'id': []}
42,Net flux,R46,RUBP + CO2 -> 3PG + 3PG,0.761857,28.305497,[],[],[],1,0.05,[],1,"[-0.9964076021805945, 0.0003130007631041342, 0...","[-63.970578676181816, 8.860780476573071e-05, 8...",[],{'id': []}
49,Net flux,R52,3PG -> SER,9.999962e-08,9488.461664,[],[],[],0,0.05,[],1,"[-0.02973593667895147, 0.00010209758630875627,...","[-639.9556619278513, 0.009688734862258226, 0.0...",[],{'id': []}
50,Net flux,R53,3PG -> CYS,1e-07,9484.245773,[],[],[],1,0.05,[],1,"[-0.00015403116004471162, -0.00010081785400171...","[-3.3134760904294303, -0.009563041436877917, -...",[],{'id': []}
51,Net flux,R54,3PG -> GL + MTHF,2.000009e-07,0.040067,[],[],[],0,0.05,[],1,"[-0.06029186368099066, 2.2992937604176356e-05,...","[-0.005479213367024284, 9.213768553673063e-09,...",[],{'id': []}
73,Net flux,R9,G3P -> 3PG,14.59127,728.51408,[],[],[],1,0.05,[],1,"[-0.868530384236555, 0.0002316065156711028, 0....","[-1435.1438067999543, 0.0016875028377398849, 0...",[],{'id': []}


Now that we have loaded the inca results, we can investigate how results. The first step is to investigate the diagnostics. Here we want to investigate a few factors:
1. Did the fit pass the Goodness-of-fit test
2. Are the residuals normally distributed
3. Are the any measurements that appears to be outliers

 

In [34]:
res.fitdata.get_goodness_of_fit()

Fit accepted: False
Confidence level: 0.05
Chi-square value (SSR): 478.4614348817875
Expected chi-square range: [33.96812643 73.8098634 ]


In [35]:
res.fitdata.test_normality_of_residuals()

Residuals are normally distributed: False on a 0.05 significance level


In [36]:
import BFAIR.mfa.visualization.diagnositics as diagnostics

diagnostics.plot_norm_probplot(res, interactive=True)

  for col_name, dtype in df.dtypes.iteritems():


In [27]:
res.measurements_and_fit_detailed['weighted residual'].transform(lambda x: x**2).sum()

AttributeError: 'INCA_results' object has no attribute 'measurements_and_fit_detailed'

In [None]:
res.fitdata.fitted_parameters['eqn']

0     FRU.ext -> F6P
1     GLY.ext -> GLY
2        GLY -> DHAP
3         3PG -> G3P
4         3PG -> PEP
           ...      
92                []
93                []
94                []
95                []
96                []
Name: eqn, Length: 97, dtype: object


The issue is that the parser for the parameter info relies on very specific naming of the fragments, i.e. it relies on the id being split and hardcoded indexing to select different features. 

I happens because the `len(fragment_list)` parameter is < 5. But in general this data does not contain the compound equation for the fragments. The best solution would probably to remake the data model for how the information, which is currently stored in the rxn_id, could be more properly handled.

The `rxn_id` is parsed directly from the matlab object in `get_fitted_parameters()`. Thus, its is matlab/INCA which creates these strange ids.

The fragment data does not contain fragment formular, I think this is the reason why it fails.

The msdata() has the .more attribute, this appears to contain the atoms of all non-labelled atoms, this could for example be all non carbon atoms in the molecule or the fomular for the derivatized compound minus the carbon atoms that originate from the amino acid.


## Refactoring sort_... functions
These functions are hard to understand because they do many thing at the same time. Their main purpose seems to be converting a dictionary into a pd.Dataframe(). This can be done much simpler using pd.DataFrame.from_dict(). Then it only remain to add a bit of extra information, such as simulation_id and simulation_dateAndTime.

We see that it is only the 'simulation_id', 'simulation_dateAndTime' that are missing.

Expected output columns fittedMeasuredFragmentsResiduals (Found in the MFA_INCA_data_reimport.ipynb):

['simulation_id', MISSING
'simulation_dateAndTime', MISSING
'experiment_id',
'sample_name_abbreviation', MISSING
'time_point', 
'fragment_id',MISSING
'fragment_mass', MISSING
'res_data', 
'res_fit', 
'res_peak', 
'res_stdev',
'res_val',
'res_msens', MISSING - but hardcoded to None
'res_esens', MISSING - but hardcoded to None
'used_', MISSING - but hardcoded to True
'comment_']MISSING - but hardcoded to None

The reaction ID contain the fragment id, this is due to the specific id schema for the fragments. Thus it should be safe to take the fragment ID from the reation ID.