# PPSIM Toolbox Basics

PPSIM provides a general interface for specifying the structure of the data of interest, a dynamical model to fit to the data, and for building data-processing pipelines.

It requires the presence of the following `*.yml`-files:

- `specifications.yml` the main file for setting up the data processing, including locaton of the data, regular expressions for data wrangling

- `model.yml`, a file specifying the mathematical model, model states and data bindings

- `stream.yml`, a placeholder file for specifying the pipleine (not yet working)

Look at each in detail to see what they contain.
PPSIM generates an object that enables easy access to attributes and setings used in the construction of pipelines.
The specifications-file allows for external resources, such as custom code to be included in the processing.
At the moment, the pipeline-feature is deprecated, `script.jl` describes the pipeline in pure julia code.
To process the data, simply modify the `specifications.yml`-file to point to the right data-source, specify an appropriate path for the folder to hold the processed data, and run `scrip.jl`

### The Specification-File
`specifications.yml` the main file for setting up the data processing.
Here, we specify:
- where our data sits (`source`, `sink`)
- what we expect headers of the MaxQuant-table to look like (`headerpattern`)
- how different sample types are distinguished (`sampletypes`)
- which conditions we look for (`conditions`)
- how crossmixes between different conditions are named (`crossmixes`)
- how the different experiments are identified (`experiment_ids`)
- how timepoints are specified in the headers (`timepoints -> hdr`)
- ...and what their numerical value is (`timepoints -> values`)

Moreover, we define a set of functions to extract timecourse-data and errors calculated by MaxQuant from the table,  (`timecourses`, `errors`), we define colours associated with the conditions in our plots (`colors`), and list a set of files, that serve as customised extensions to the general codebase of the toolbox (`include`).

Let's take a look at our specifications:

```yaml
source:         /home/nbuser/thesis-notebooks/Chapter6/data/bigtable/table.txt
sink:           /home/nbuser/thesis-notebooks/Chapter6/data/
headerpattern:  _MS\d\d\d\d_\d\d
sampletypes:
  Supernatant: S_MS\d\d\d\d_\d\d
  Pellet:      P_MS\d\d\d\d_\d\d
conditions:
  - Control
  - B55
  - GWL
crossmixes:
  - CM_Con_H_B55_L_
  - CM_Con_L_B55_H_
  - CM_Con_H_GWL_L_
  - CM_Con_L_GWL_H_
  - CM_B55_H_GWL_L_
  - CM_B55_L_GWL_H_
experiment_ids:
   - P_MS0009_01
   - P_MS0009_02
   - P_MS0016_01
   - P_MS0019_01
   - P_MS0021_01
   - S_MS0006_01
   - S_MS0006_02
   - S_MS0006_03
   - S_MS0008_01
   - S_MS0008_02
   - S_MS0009_01
   - S_MS0016_01
   - S_MS0016_02
   - S_MS0016_03
   - S_MS0021_01
   - S_MS0021_02
   - S_MS0021_03
timepoints:
  hdr:
    - T0
    - T2_5
    - T5
    - T7_5
    - T10
    - T20
    - T30
    - T45
  values:
    - 0
    - 2.5
    - 5
    - 7.5
    - 10
    - 20
    - 30
    - 45

timecourses:
  Control: (findfirst(names(S), :Control_0)):(findfirst(names(S), :Control_45))
  B55: (findfirst(names(S), :B55_0)):(findfirst(names(S), :B55_45))
  GWL: (findfirst(names(S), :GWL_0)):(findfirst(names(S), :GWL_45))
errors:
  Control: (findfirst(names(S), :error_Control_0)):(findfirst(names(S), :error_Control_45))
  B55: (findfirst(names(S), :error_B55_0)):(findfirst(names(S), :error_B55_45))
  GWL: (findfirst(names(S), :error_GWL_0)):(findfirst(names(S), :error_GWL_45))
colors:
  Control: "#5fbdff"
  B55: "#ffc900"
  GWL: "#5fbd00"


include:
  - custom_code/custom.jl
  - custom_code/preprocess.jl
  - custom_code/extract.jl
  - custom_code/cleaning.jl
  - custom_code/modelling.jl
  - custom_code/plot.jl
  - custom_code/filter_substrates.jl
```

We shall take a detailed look at the model, and the relevant custom extensions in due course.
For now let us load the toolbox and preprocess the data.

In [1]:
# Load the analysis toolbox and initialise it
using PPSIM
specs, streams, model = PPSIM.initialise(pwd());


Use "Dict(a=>b, ...)" instead.

Use "Dict(a=>b, ...)" instead.


The step above loads all the relevant files into memory, and makes creates an object called `specs` that contains all the information of our specifications-file. We will use this later to access our settings.

# Preprocessing

We'll now run the preprocess routine laid out in `custom_code/preprocess.jl`:

```julia
using PPSIM
using DataFrames

"""
**Preprocessing Routine**
Takes path to sourcefile, and path to sinkfile as arguments.
Queries column names of table and re-structures them so that numbers at the end
of column headers are  attached to the first part of the header preceding the
experiment id, as specified in specs.headerpattern
"""
function preprocess(sourcefile, sinkfile)
  data = readtable(sourcefile, separator = '\t')
  exp_indices = PPSIM.queryColumns(specs.headerpattern, data)
  # println(exp_indices)
  #Find the ones that have a tailing 1-3 preceded by an underscore
  indices = PPSIM.queryColumns(r"_[1-3]\b", exp_indices)
  for index in indices
    index = string(index)
    #get the number
    n = index[end]
    #remove the tail
    tail_removed = split(index, r"_[1-3]\b")[1]
    #Get part preceding specs.headerpattern
    firstpart = split(index, specs.headerpattern)[1]
    experiment = split(tail_removed, Regex(firstpart))[2]
    newname = firstpart * "_$n\_" * experiment
    rename!(data, symbol(index), symbol(newname))
  end
  writetable(sinkfile, data)
end
```

This code loops through all the columns of the MaxQuant-table, and renames a column, if it contains a tailing number, which would break the naming convention otherwise followed.

The function `preprocess` takes two arguments, a sourcefile (our main data table), and a sinkfile (the location of our output-file).

In [2]:
#Define location for the processed output file
processed = joinpath(specs.sink, "table_processed.txt")
#Run the preprocessing routine (takes a while)
preprocess(specs.source, processed)

  likely near In[2]:4
  likely near In[2]:4
  likely near In[2]:4
  likely near In[2]:4
  likely near In[2]:4
  likely near In[2]:4


After running this function, we create a new file at `/home/nbuser/thesis-notebooks/Chapter6/data/table_processed.txt`, which now contains clean headers.

Next we can proceed to extracting the data from the processed table.
Go to [Notebook02:Extracting](http://localhost:8888/notebooks/Chapter6/02_Extracting.ipynb)