# Using ChromProcess

This document runs through one possible way of using the code contained in ChromProcess, thereby introducing its core functionality.

# What is ChromProcess?

ChromProcess is a collection of Python functions and objects which provide a framework for manipulating sets of chromatographic data. The code is currently flexible enough to work with what we deal with in the [Huck Lab](https://www.hucklab.com). Development will be required going into the future.

There are several types of information required to reproducibly assign and/or quantify chromatographic peaks. ChromProcess is one way of interfacing these various information sources.

# Source Files

## Experiment-specific files

Experimental data and 'metadata' (primary data), go in the front of the analysis pipeline. Files containing information about the experiment should not be altered during analysis. Primary data files consist of several chromatographic data files (as of writing ChromProcess can load `.cdf`, `.txt` or `.csv` format) and a conditions file (comma-separated values `.csv`). The chromatographic data files should be named in such a way that they can be sorted programatically. For example, the sequence `chrom_001.csv`, `chrom_002.csv`, `chrom_003.csv` would be appropriate. The structure and contents of an example conditions file are shown below.

In [13]:
import csv

with open('../Templates/example_conditions.csv', "r") as f:
    csv_reader = csv.reader(f, delimiter=',')
    for row in csv_reader:
        line = row[0] + ': '
        line += ','.join([x for x in row[1:] if x != ''])
        print(line)

Dataset: example_dataset_name
start_experiment_information: 
series_values: 10,20,30
series_unit: time/ s
end_experiment_information: 
start_conditions: 
dihydroxyacetone/ M: 2
formaldehyde/ M: 2
NaOH/ M: 0.12
CaCl2/ M: 0.06
water/ M: 0
end_conditions: 


- Dataset: Reference code for the experiment
- start_experiment_information: # tag for experiment information field
- series_values: ordered values associated with the chromatograms (e.g. timepoints)
- series_unit: the units for series_values; format: {name}/ {unit}
- end_experiment_information: # tag for experiment information field
- start_conditions: # tag for experiment conditions field
    - Information between these two tags forms key: list pairs, e.g.
        - formaldehyde_concentration/ M, 0.2
        - water_flow_rate/ L/h, 0.1, 0.005, 0.0001 
- end_conditions: # tag for experiment conditions field

Keeping track of the operations performed on the primary data is an important component of a reproducible pipeline. To record and input details of analysis operations, an analysis details file (`.csv`) is also included alongside the primary data. The structure and contents of an example analysis file are shown below.

In [6]:
with open('../Templates/example_analysis_details.csv', "r") as f:
    csv_reader = csv.reader(f, delimiter=',')
    for row in csv_reader:
        line = row[0] + ': '
        line += ','.join([x for x in row[1:] if x != ''])
        print(line)

Dataset: example_dataset_name
Method: GCMS
regions: lower (float),upper (float),lower (float),upper (float)
internal_reference_region: lower (float),upper (float)
extract_mass_spectra: TRUE
mass_spectra_filter: 500
peak_pick_threshold: 0.1
dilution_factor: 5.714285714
dilution_factor_error: 0.0447
internal_ref_concentration: 0.0008
internal_ref_concentration_error: 9.89E-06


The fields are as follows:

- Dataset: Experiment code
- Method: Analysis method used to collect the data (e.g. GCMS or HPLC).
- regions: pairs of lower, upper retention times which outline regions of the chromatogram in which the program will search for peaks.
- internal_reference_region: a pair of lower, upper bounds between which the internnal reference (internal standard) should lie.
- extract_mass_spectra: whether to extract mass spectra information from the files during the analysis (TRUE) or not (FALSE).
- peak_pick_threshold: Threshold (as a fraction of the highest signal in a region) above which peaks will be detected.
- dilution_factor: Dilution factor applied during sample preparation (multiplying the concentrations derived from peak integrals by this value will convert them into those present in the unprepared sample)
- dilution_factor_error: standard error for the dilution factor.
- internal_ref_concentration: Concentration of the internal reference (internal standard)
- internal_ref_concentration_error: Standard error of the concentration of the internal reference (internal standard).

Additionally, a local assignments file (initially an empty `.csv` file) can be added to the project. These files can be arranged in folders however you wish. Whichever organisation scheme used must be systematic. An example struture is shown below:

## Files applicable to several experiments

Calibration information for an instrument may be applicable to one or more sets of data. This information can therefore be stored separately from the primary data. Alternatively, a copy of a calibration file can be included within the directory of each experiment. This method has the benefit of providing a more unambiguous association of the data to the calibration. On the other hand, if changes must be made to a calibration file (e.g. adding a new calibration for a compound), multiple files must be updated which may be more labour-intensive and error-prone. Either way, creating a file which assigns each experiment to a calibration file (including paths to each file) is beneficial, as is creating a workflow for updating analyses in response to changes to the source files.

# Overview of an Example Analysis Pipeline

The first step is to create peak table files containing peak positions, boundaries and integrals for each chromatogram. Each chromatogram is loaded in as a `Chromatogram` object. The analysis and conditions files are also loaded as `Analysis_Information` and `Experiment_Conditions` objects, respectively. Information in the analysis file is used to find peaks in each chromatogram before each peak is integrated. The `Chromatogram` can then be used to create a peak table with associated condition information, if required.

Before beginning the analysis, the chromatograms should be inspected and information should be input into the analysis file as appropriate (regions, concentrations, etc.). First, the source files are directly converted into objects:

In [10]:
from ChromProcess import Classes

conditions_file = '../Templates/example_conditions.csv'
analysis_file = '../Templates/example_analysis_details.csv'
chromatogram_directory = '../Templates/ExampleChromatograms'
conditions = Classes.Experiment_Conditions(information_file = conditions_file)
analysis = Classes.Analysis_Information(information_file = analysis_file)
chromatogram_files = os.listdir(chromatogram_directory)
chromatogram_files.sort()

chroms = []
for f in chromatogram_files:
    chroms.append(Classes.Chromatogram(f'{chromatogram_directory}/{f}'))

FileNotFoundError: [Errno 2] No such file or directory: '../Templates/example_conditions.csv'

Next, a peak for the internal standard is picked using information in `analysis` (the function modifies the chromatogram object passed to it by inserting the internal standard information, currently, only one internal standard is supported):

In [9]:
from ChromProcess import chromatogram_operations as chrom_ops

for c in chroms:
    chrom_ops.internalRefIntegral(c, analysis.internal_ref_region)

NameError: name 'c' is not defined

Information from `analysis` is again used to pick peaks in defined regions of each chromatogram. The functions add peak information into chromatograms.

In [None]:
for c in chroms:
    for r in analysis.regions:
            chrom_ops.regionPeakPick(c, r, threshold = analysis.peak_pick_threshold)
            chrom_ops.integratePeaks(c)

Peaks tables can then be written directly from each `Chromatogram` whilst inserting information from conditions files if required. Note here that the ordering of chromatograms must be the same as the order or series values in the conditions file.

In [None]:
# Output peak table
peak_table_directory = '../Templates/ExamplePeakTables'

os.makedirs(peak_table_directory, exist_ok = True)
for c,v in zip(chroms, conditions.series_values):
    c.write_peak_table(filename = f'{peak_table_directory}/{c.filename}',
                        value = v,
                        series_unit = conditions.series_unit)