In [1]:
using PPSIM
specs, streams, model = PPSIM.initialise(pwd());

# Cleaning and Summarisation

Follwing the extraction of the data detailed in the [previous notebook](http://localhost:8888/notebooks/Chapter6/02_Extracting.ipynb), we will now clean the datatable, summarise it and add annotations.

Let's take a look at the custom-code in [custom_code/cleaning.jl](http://localhost:8888/edit/Chapter6/custom_code/cleaning.jl)

It contains a single main function 

```julia
function clean(sourcefile, sinkfile, seqfile, ontology)
#...
end
```
, as well as a helperfunction

```julia
function trimSequence(seq, pos::Int64, leftflank; rightflank = Union{})
    if rightflank == Union{}
      rightflank = leftflank
    end
    l = maximum([1, pos - leftflank])
    u = minimum([length(seq), pos + rightflank])
    n_headingx = -minimum([0, pos - leftflank - 1])
    n_tailingx = -minimum([0, length(seq) - (pos + rightflank)])
    trimmedseq = ("_" ^ n_headingx) * seq[l:u] * ("_" ^ n_tailingx)
    if length(trimmedseq) != (rightflank + leftflank + 1)
        "Error: $seq"
    end
    return(trimmedseq)
end
```

The helperfunction above takes a sequence of aminoacids, a position, and windowsize as arguments and returns the correspondingly "trimmed" sequence.

The main function `clean()` takes source- and sinkfiles, as well as a two further arguments that specify two further input files that contain a list of protein sequences (`seqfile`), as well as an ontology-table downloaded from the [Panther-database](http://www.pantherdb.org/).

Moreover, it accepts two optional arguments that specify the filename of the cleaning-summary and a cutoff defining the minimal acceptable number of datapoints per condition.

Fundamentally, the cleaning function filters the dataset and logs the size of the dataset after each filtering step.
Then it uses protein-identifiers to add sequence and ontology data to the filtered dataset.

The filtering criteria are:

**Entries where data on the normalised ratio H/L is missing for any of the three conditions are excluded**
```julia
  D = D[D[:Control_0] .>= 0, :]
  D = D[D[:B55_0] .>= 0, :]
  D = D[D[:GWL_0] .>= 0, :]

```

**Exclude entries that lack a gene-name annotation**
```julia
  D = D[!isna(D[:Gene_names]), :]
```

**Exclude entries that do not meet the criterium on minimal number ofrequired datapoints**
```julia
  sufficient_data = Int64[]
  for i in 1:size(D)[1]
    n_data = Bool[]
    for c in specs.conditions
      l = findfirst(names(D), symbol(c * "_0"))
      u = findfirst(names(D), symbol(c * "_45"))
      n_data = [n_data; sum(convert(Array, D[i, l:u]) .>= 0) >= cutoff_n_data]
    end
    if all(n_data)
      sufficient_data = [sufficient_data; i]
    end
  end
  D = D[sufficient_data, :]
```

**Exclude entries that lack crossmixing data**
```julia
  D = D[D[:CM_Con_L_B55_H_] .>= 0, :]
  D = D[D[:CM_Con_H_B55_L_] .>= 0, :]
  D = D[D[:CM_Con_L_GWL_H_] .>= 0, :]
  D = D[D[:CM_Con_H_GWL_L_] .>= 0, :]
```

**Exclude entries with normalised rations > 10**
```julia
  data_ok = Bool[]
  for i in 1:size(D)[1]
    d = vec(convert(Array, D[i, findfirst(names(D), :Control_0):findfirst(names(D), :GWL_45)]))
    data_ok = [data_ok; !any(d .> 10)]
  end
  D = D[data_ok,:]
```

The function then adds sequence data and creates a list of proteins to be used as background set in the sequence analysis.
Entries for kreatins are excluded, and ontological information is added.

In [2]:
#Define file locations
processed = joinpath(specs.sink, "table_processed.txt");
extracted = joinpath(specs.sink, "extracted.csv");
sequences = joinpath(specs.sink, "sequences.json");
ontology = joinpath(specs.sink, "pantherGeneList.tsv");
cleaned =  joinpath(specs.sink, "cleaned.csv");

"/home/nbuser/thesis-notebooks/Chapter6/data/cleaned.csv"

In [3]:
#Perform cleaning
log = clean(extracted, cleaned, sequences, ontology);

## Summary of cleaning/summarisation

In [6]:
for (k,v) in log
    println(k, "\t", v)
end

EXTRACTED: total number of entries	46802
EXTRACTED: total number of unique peptides	23401
EXTRACTED: entries missing T0 in Control	30916
EXTRACTED: entries missing T0 in B55	33347
EXTRACTED: entries missing T0 in GWL	33940
CLEANING: total number of entries left after removing entries missing T0	9658
CLEANING: total number of unique peptides left after removing entries missing T0	8236
CLEANING: total number of entries left after removing entries with less than 4	9069
CLEANING: total number of unique peptides left after removing entries with less than 4	7743
CLEANING: total number of entries left after removing entries missing CMs involving Control	7446
CLEANING: total number of unique peptides left after removing entries missing CMs involving Control	6450
CLEANING: total number of entries left after removing entries with timepoints > 10	7424
CLEANING: total number of unique peptides left after removing entries with timepoints > 10	6432
CLEANING: total number of entries left after removi