# Reading Agilent GCMS Files with `chemtbd`

> __NOTE__: We need a name.  See [issue 3](https://github.com/blakeboswell/chemtbd/issues/3).

Currently there is a hiearchy of objects for reading GCMS data:

- `GcmsDir` object will read `RESULTS.csv` and `DATA.MS` from a single Agilent `.D` directory
- `GcmsData` object will read a `DATA.MS` file
- `GcmsResults` object will read a `RESULTS.csv` file

These objects are available for import and direct use.  However, the main interface for file reading is the `chemtbd.io.Agilent` object which is a wrapper for the above methods.

To use `chemtbd.io.Agilent`, import it as follows.  The directory that contains the `chemtbd` folder has to be the working directory

In [1]:
from chemtbd.io import Agilent

`Agilent` provides three main read functions:

- `from_dir` expects a path to single Agilent `.D` directory as input
- `from_root` expects a path to a parent directory containing only `.D` directories as children
- `from_list` expects a list of paths to Agilent `.D` directories

For example, let's load all `.D` folders from the directory `data/test3`:

In [7]:
agi = Agilent.from_root('data/test3')

TypeError: 'NoneType' object is not iterable

Let's look at what `.D` folders loaded from the above directory:

In [None]:
agi.keys()

# Accessing all Files

We can access the __RESULTS.CSV__ `lib`, `fid`, and `tic` tables from all Agilent directories as a single pandas DataFrame using the below commands.

In [None]:
agi.results_lib.head()

In [None]:
agi.results_fid.head()

We can access the __DATA.MS__ `tme` tables from all Agilent directories as a single pandas DataFrame using the below command.  The same command wil work for the `tic` table.

In [None]:
agi.datams.head()

> __NOTE__:  `tme` and `tic` from `DATA.MS` should probably be in the same table.  see [issue 2](https://github.com/blakeboswell/chemtbd/issues/2) for discussion.

## Acessing a Single Agilent Directory

By default the `key` or directory name is index of the Agilent dataframes. Therefore, we can access the `RESULTS.CSV` and `DATA.MS` data for each `.D` individually through the standard pandas index selection procedure:

In [None]:
agi.results_tic.loc['FA01.D'].head()

In [None]:
agi.results_tic.loc['FA05.D'].head()

Calculating aggregate metrics across folders can also be done efficiently using standard pandas methods:

In [None]:
metrics = {'min': 'min', 'max': 'max', 'mean': 'mean'}
agi.results_tic.groupby(level=0).agg({'height': metrics, 'area': metrics})

In [None]:
%matplotlib inline

agi.results_tic.groupby(level=0).agg({'height': metrics, 'area': metrics}).plot()

## Chromatogram?

Below is a temporary interface for accessing data from `DATA.MS` files... not sure what do do with this data yet.

In [None]:
from chemtbd.io import GcmsData

Read directly from single file (no stacking yet because not sure its stackable)

In [None]:
gcms_data = GcmsData('data/test3/FA01.d/DATA.MS')

In [None]:
chrom = gcms_data.chromatogram

The resulting data frame has `index` equal to time and `coloumns` equal to ions.

In [None]:
chrom.head()

The data in the chrom data frame is a time series. People will generallly be interested in two things. 

(1) The sum of the rows. There area under each peak, plotted below, is proportional to concentration. This is the same as the tic vs. tme data previous.

Below shows the plot someone would want to see to verify their data. This file is a standard curve meaning they put in a known concentration of 9 species which is reflected in the appearance of 9 distinct peaks

In [None]:
%matplotlib inline
chrom.sum(axis=1).plot()

(2) For each time point there a X number of columns. Each column represents the strength of an ion at that timepoint. The ions are generated when a molecule hits the detector. Since only one compound (theorectically) is hitting the detector at that time(s) (the range from the beginning to the end of the peak), the pattern of the ion strenghts is a signature for that specific molecule. It is effectively a molecular fingerprint. Sometimes people like to look at this data, but more importantly people cross reference this data against a library and has the library tell them the molecule.

Below is the molecular "finger print" for the peak around 15.1 min.

In [None]:
import pandas as pd
%matplotlib inline
example_ion_df = pd.DataFrame(chrom.iloc[2101,:]).sort_index()
example_ion_df.plot()
example_ion_df.head()

A common feature of software that manipulates this data is that someone could click on a timepoint in the first plot and get a display of the second plot. Or have someway to specify a timepoint for which i want to see the ion profile.