# Reading Agilent GCMS Files with `chemtbd`

> __NOTE__: We need a name.  See [issue 3](https://github.com/blakeboswell/chemtbd/issues/3).

Currently there is a hiearchy of objects for reading GCMS data:

- `GcmsDir` object will read `RESULTS.csv` and `DATA.MS` from a single Agilent `.D` directory
- `GcmsData` object will read a `DATA.MS` file
- `GcmsResults` object will read a `RESULTS.csv` file

These objects are available for import and direct use.  However, the main interface for file reading is the `chemtbd.io.Agilent` object which is a wrapper for the above methods.

To use `chemtbd.io.Agilent`, import it as follows.  The directory that contains the `chemtbd` folder has to be the working directory

In [1]:
from chemtbd.io import Agilent

`Agilent` provides three main read functions:

- `from_dir` expects a path to single Agilent `.D` directory as input
- `from_root` expects a path to a parent directory containing only `.D` directories as children
- `from_list` expects a list of paths to Agilent `.D` directories

For example, let's load all `.D` folders from the directory `data/test3`:

In [2]:
agi = Agilent.from_root('data/test3')

The `Agilent` object loads data lazily.  After initialization, Agi is a dictionary that contains the folder names as `keys` and `GcmsDir` objects as values.  When we ask it for data, it will read the data from disk, structure it as a pandas DataFrame, store it in a cache and finally return it.  The next time we ask for the same data, the DataFrame is loaded from the cache.

Let's look at what `.D` folders are in `agi`:

In [3]:
agi.keys()

dict_keys(['FA01.D', 'FA02.D', 'FA03.D', 'FA04.D', 'FA05.D', 'FA06.D', 'FA07.D', 'FA08.D', 'FA09.D', 'FA10.D', 'FA11.D', 'FA12.D', 'FA13.D', 'FA14.D', 'FA15.D'])

# Accessing all Files

We can access the __RESULTS.CSV__ `tic` tables from all Agilent directories as a single pandas DataFrame using the below command.  The same command will similarly work for `lib` and `fdi` tables.

In [5]:
agi.results('tic').head()

Unnamed: 0,header=,peak,rt,first,max,last,pk_ty,height,area,pct_max,pct_total,key
0,1=,1.0,12.288,1600.0,1609.0,1647.0,rBV3,71023.0,478771.0,39.71,6.909,FA01.D
1,2=,2.0,13.598,1830.0,1838.0,1864.0,rBV2,247725.0,825285.0,68.46,11.91,FA01.D
2,3=,3.0,14.428,1977.0,1983.0,2004.0,rBV,481706.0,1098175.0,91.09,15.848,FA01.D
3,4=,4.0,15.08,2091.0,2097.0,2109.0,rBV,806692.0,1205528.0,100.0,17.397,FA01.D
4,5=,5.0,15.692,2198.0,2204.0,2215.0,rBV,731146.0,1085862.0,90.07,15.67,FA01.D


We can access the __DATA.MS__ `tme` tables from all Agilent directories as a single pandas DataFrame using the below command.  The same command wil work for the `tic` table.

In [6]:
agi.data('tme').head()

Unnamed: 0,tme,key
0,3.086817,FA01.D
1,3.092533,FA01.D
2,3.09825,FA01.D
3,3.103983,FA01.D
4,3.1097,FA01.D


> __NOTE__:  `tme` and `tic` from `DATA.MS` should probably be in the same table.  see [issue 2](https://github.com/blakeboswell/chemtbd/issues/2) for discussion.

## Acessing a Single Agilent Directory

We can access the `RESULTS.CSV` and `DATA.MS` data for each `.D` indivdually as follows:

In [8]:
agi['FA01.D'].results['tic'].head()

Unnamed: 0,header=,peak,rt,first,max,last,pk_ty,height,area,pct_max,pct_total
0,1=,1,12.288,1600,1609,1647,rBV3,71023,478771,39.71,6.909
1,2=,2,13.598,1830,1838,1864,rBV2,247725,825285,68.46,11.91
2,3=,3,14.428,1977,1983,2004,rBV,481706,1098175,91.09,15.848
3,4=,4,15.08,2091,2097,2109,rBV,806692,1205528,100.0,17.397
4,5=,5,15.692,2198,2204,2215,rBV,731146,1085862,90.07,15.67


In [9]:
agi['FA01.D'].data['tme'].head()

Unnamed: 0,tme
0,3.086817
1,3.092533
2,3.09825
3,3.103983
4,3.1097


>  __NOTE__ The interfaces are not consistent yet. Eventually we will have the option to use brackets and dot for requesting attributes from the `results` and `data` objects.   (as opposed to parenthesis sometimes)