# Quantifying GCQuant for GCMS & GC-TIC



To quantify GC data with GCQuant you will need three things:

1. A dataframe containing library_ids with corresponding retention times
2. A dataframe containing area with corresponding retention times
3. A dataframe with calibration curve information.

After reading in GCMS data, concentrations can be determined using gcquant. gcquant has the following hierarchy:

- no one
- no two
- no three

## LibraryIDs and Areas
For this example we will use the GC - Agilent class to create the first two items. This is easily done with the code below. For a full description of the class look [here](https://github.com/blakeboswell/chemtbd/blob/master/example.ipynb).

In [1]:
from chemtbd.io import Agilent
agi = Agilent.from_root('data/test3')

lib = agi.results_lib
area = agi.results_tic

In [2]:
lib.head()

Unnamed: 0_level_0,header=,pk,rt,pct_area,library_id,ref,cas,qual
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
FA03.D,1=,1.0,5.7877,2.0335,Methyl octanoate,17.0,000000-00-0,96.0
FA03.D,2=,2.0,7.3441,3.4015,Methyl decanoate,1.0,000000-00-0,98.0
FA03.D,3=,3.0,8.0364,1.7448,Methyl undecanoate,2.0,000000-00-0,98.0
FA03.D,4=,4.0,8.6715,3.9674,Methyl dodecanoate,3.0,000000-00-0,98.0
FA03.D,5=,5.0,9.2781,1.9607,Methyl tridecanoate,4.0,000000-00-0,99.0


In [3]:
area.head()

Unnamed: 0_level_0,header=,peak,rt,first,max,last,pk_ty,height,area,pct_max,pct_total
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FA01.D,1=,1.0,12.288,1600.0,1609.0,1647.0,rBV3,71023.0,478771.0,39.71,6.909
FA01.D,2=,2.0,13.598,1830.0,1838.0,1864.0,rBV2,247725.0,825285.0,68.46,11.91
FA01.D,3=,3.0,14.428,1977.0,1983.0,2004.0,rBV,481706.0,1098175.0,91.09,15.848
FA01.D,4=,4.0,15.08,2091.0,2097.0,2109.0,rBV,806692.0,1205528.0,100.0,17.397
FA01.D,5=,5.0,15.692,2198.0,2204.0,2215.0,rBV,731146.0,1085862.0,90.07,15.67


## Create Calibration DataFrame

Great! Now that the species have a corresponding area, the next step is to create calibration curves. To create calibration curves the following 3 things will be needed, and this information will need to be captured in a dataframe.

- identify files within the subfolder that contain calibration curve data
- provide the known concentrations of each species in each file

We recommend either putting the data into a csv file and importing it using pandas or creating a pandas dataframe. Below is how this can be performed. 

#### IMPORTANT: 
1. The dataframe headers must have a library_id column and the remaining columns should be the file names for each standard. 
2. Be sure the file names are exactly the same as the files in the subfolder including any extention in the name (e.g. ".D").
3. UNITS UNITS UNITS. The concentrations should be entered in molar (mol/L).

In [4]:
import pandas as pd
standards = pd.read_csv('data/standards.csv')
standards.head()

Unnamed: 0,library_id,FA03.D,FA04.D,FA05.D
0,Methyl palmitate,0.25,0.5,1
1,Methyl heptadecanoate,0.25,0.5,1
2,Methyl docosanoate,0.25,0.5,1
3,Methyl undecanoate,0.25,0.5,1
4,"Methyl cis-8,11,14-eicosatrienoate",0.25,0.5,1


## Analysis with GCQuant

Now that we have everything we need, lets load the data into GCQuant and see what analysis is now available.

In [5]:
from chemtbd.io import GCQuant
gcq = GCQuant(lib,area,standards)

### Getting compiled data
GCQuant compiled both the `lib` and the `area` dataframes which you provided to the class into one dataframe. The rows were matched based on the squared difference of the retention time from each row (i.e. each peak). As shown below this compiled dataframe is now available.

In [6]:
gcq.compiled.head()

Unnamed: 0_level_0,pk,rt,library_id,cas,qual,area
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
FA03.D,1.0,5.7877,Methyl octanoate,000000-00-0,96.0,1489466.0
FA03.D,2.0,7.3441,Methyl decanoate,000000-00-0,98.0,2491449.0
FA03.D,3.0,8.0364,Methyl undecanoate,000000-00-0,98.0,1277982.0
FA03.D,4.0,8.6715,Methyl dodecanoate,000000-00-0,98.0,2905961.0
FA03.D,5.0,9.2781,Methyl tridecanoate,000000-00-0,99.0,1436154.0


### Calibration Curve Data
The calibration curve data is now available as well. The class provides statistics as well as easily viable plots to make sure your curves area linear.

In [7]:
gcq.stdcurves.head()

Unnamed: 0,library_id,responsefactor,intercept,rvalue,pvalue,stderr,max,min
0,"All cis-4,7,10,13,16,19-docosahexaenoate methy...",1.367304e-07,0.103774,0.999932,0.007421,1.593827e-09,6566581.0,1094384.0
1,Methyl arachidate,7.146232e-08,-0.063537,0.999076,0.027367,3.073928e-09,14785000.0,4222622.0
2,Methyl arachidonate,1.344684e-07,0.067813,0.999319,0.000681,3.51112e-09,7056294.0,1390417.0
3,"Methyl cis-11,14,17-eicosatrienoate",1.306034e-07,0.102132,1.0,0.0,0.0,6874772.0,3046387.0
4,Methyl cis-11-eicosenoate,6.336257e-08,0.008515,1.0,0.00011,1.092154e-11,15647413.0,3810380.0


It is good to make sure all of the correlations have a high r<sup>2</sup> for the linear fit. We can use `.sort('rvalue')` to order the dataframe by r<sup>2</sup> value from lowest to highest.

In [8]:
gcq.stdcurves.sort('rvalue').head()

Unnamed: 0,library_id,responsefactor,intercept,rvalue,pvalue,stderr,max,min
16,Methyl linolelaidate,,,0.0,,,6795725.0,6795725.0
15,Methyl linoleate,3.726853e-08,-0.21515,0.981679,0.122048,7.233712e-09,31474938.0,11356919.0
17,Methyl myristate,7.870497e-08,0.035087,0.98484,0.01516,9.802298e-09,11874993.0,1821171.0
6,"Methyl cis-8,11,14-eicosatrienoate",1.185876e-07,0.06983,0.995401,0.000374,6.588781e-09,7855741.0,1456720.0
5,"Methyl cis-5,8,11,14,17-Eicosopentaenoate",1.304701e-07,0.086535,0.998149,0.038744,7.950033e-09,6922358.0,1128019.0


### Quantifying Concentrations

Now that we know our calibration curves have a good linear fit, we can easily calculate the unknown concentrations of each species using GCQuants `.concentrations` property.

In [9]:
gcq.concentrations.head()

Unnamed: 0_level_0,pk,rt,library_id,cas,qual,area,conc,conc%,area%
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
FA03.D,1.0,5.7877,Methyl octanoate,000000-00-0,96.0,1489466.0,0.241746,0.03515,0.020335
FA03.D,2.0,7.3441,Methyl decanoate,000000-00-0,98.0,2491449.0,0.252273,0.03668,0.034015
FA03.D,3.0,8.0364,Methyl undecanoate,000000-00-0,98.0,1277982.0,0.252682,0.03674,0.017448
FA03.D,4.0,8.6715,Methyl dodecanoate,000000-00-0,98.0,2905961.0,0.243649,0.035426,0.039674
FA03.D,5.0,9.2781,Methyl tridecanoate,000000-00-0,99.0,1436154.0,0.249314,0.03625,0.019607


Nice! We now have the calculated concentrations in the `conc` column.  Addtionally GCQuant has provided the normalized concentration, `conc%`, and normalized area, `area%`.

But the dataframe isn't always the best way to view the results. GCQuant has a few built in reporting methods to easily "see" your results

## Report Results
### Pivot Table 
Pivot tables are generally a nice way of organizing printing data. The `.pivot` method is relatively simple to implement on your own. If there are other cases which could be useful below is an example of creating our own pivot table with pandas `.pivot_table` method.

*Note: it is important to reset the index of the gcq.concentrations dataframe so the `key` is then a column and not the index.*

In [10]:
pd.pivot_table(gcq.concentrations.reset_index(),
               index='library_id',
               columns='key',
               values='conc').head()

key,FA03.D,FA04.D,FA05.D,FA08.D,FA09.D,FA11.D,FA12.D,FA13.D,FA14.D
library_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"All cis-4,7,10,13,16,19-docosahexaenoate methyl ester",0.25341,0.494965,1.001626,,,,,,
Methyl arachidate,0.238221,0.518745,0.993033,,,,,,
Methyl arachidonate,0.25478,0.493979,1.000621,,,,,,
"Methyl cis-11,14,17-eicosatrienoate",,0.5,1.0,,,,,,
Methyl cis-11-eicosenoate,0.24995,0.500075,0.999975,,,,,,


We can also just look at our experimental data without the standard curve data by simply calling `gcq.concentrations_exp`. Similarly, we can get just the standard curve data `gcq.concentrations_std` and place either in a pivot table.

In [11]:
pd.pivot_table(gcq.concentrations_exp.reset_index(),
               index='library_id',
               columns='key',
               values='conc').head()

key,FA08.D,FA09.D,FA11.D,FA12.D,FA13.D,FA14.D
library_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Methyl linolelaidate,,,,,,
Methyl myristate,,,,,0.054904,
Methyl palmitate,,,0.100125,0.266785,0.270947,0.250985
Methyl palmitoleate,,,0.093528,0.174033,0.305703,0.160023
Methyl stearate,,,,,,


In [12]:
pd.pivot_table(gcq.concentrations_std.reset_index(),
               index='library_id',
               columns='key',
               values='conc').head()

key,FA03.D,FA04.D,FA05.D
library_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"All cis-4,7,10,13,16,19-docosahexaenoate methyl ester",0.25341,0.494965,1.001626
Methyl arachidate,0.238221,0.518745,0.993033
Methyl arachidonate,0.25478,0.493979,1.000621
"Methyl cis-11,14,17-eicosatrienoate",,0.5,1.0
Methyl cis-11-eicosenoate,0.24995,0.500075,0.999975


### Bar Plots

Your data can easily be plotted with GCQuant. There are many plotting packages available in python, but here Bokeh is used to prepare a few simple examples which you can use as a quick check or to build upon.

Bokeh has very good documentation so detailed explanation of the code can be understood by referencing Bokeh's documentation.

First import the required classes

In [13]:
from bokeh.charts import Bar, show, output_notebook
from bokeh.layouts import row

To see the charts displayed in the notebook use the line below.

In [14]:
output_notebook()

Now a bar chart of the concentration percents can be created

In [15]:
bar_conc_per = Bar(gcq.concentrations_exp.reset_index(),
                       label ='key',
                       values = 'conc%',
                       agg='sum',
                       stack='library_id',
                       title = 'Concentration Percentage')
show(bar_conc_per)

To plot absolute values we can create a similiar plot.

In [16]:
bar_conc_abs = Bar(gcq.concentrations_exp.reset_index(),
                       label ='key',
                       values = 'conc',
                       agg='sum',
                       stack='library_id',
                       title = 'Concentration')
show(bar_conc_abs)

The same data but not stacked.

In [17]:
bar_conc_abs_nostack = Bar(gcq.concentrations_exp.reset_index(),
                       label ='key',
                       values = 'conc',
                       group = 'library_id',
                       title = 'Concentration')
show(bar_conc_abs_nostack)

Combining plots

In [18]:
show(row(bar_conc_per,bar_conc_abs_nostack)) 