<title/>Template notebook for site-centric PTM Signature Enrichment Analysis (PTM-SEA)

# Template notebook for running site-centric PTM signature enrichment analysis (PTM-SEA)

## Configure environment and prepare files

### Configure cloud environment
<div class="alert alert-block alert-info">

To start this notebook, click on "Cloud Environment" in the top-right corner. For **Application configuration** select "Custom Environment" and for **Container image** type in **gcr.io/broadcptac/ptm-sea:0.5.2**. This is a Terra-based Docker environment that has the required libraries and scripts to run PTM-SEA. Depending on how big the dataset is, select a suitable number of CPUs and Memory in the same tab.
 
To run a code block, click on it, and either choose **Cell -> Run Cells** or hit **Shift-ENTER**. Running the entire notebook is not recommended since many code blocks require user input. Carefully read each section, and then run the associated code block.
</div>

### Configure working directories

Load in helper functions to run PTM-SEA. The `init_project_dir()` function creates a directory for input and output files, the default name indicates runtime in the YYYYMMDD-HHMMSS format. If you wish to name the directory differently, you can specify the `name` argument as such: `init_project_dir(name = "my_project")`.

In [None]:
source("/ptm-sea/src/terra-functions.R")
init_project_dir()

### Upload files

1. Upload the input file to bucket and locate it

Open your workspace in a new tab or window. Upload files into your workspace by navigating to DATA tab -> Files tab, and then using the + button on the bottom right of your page. A single file is required: single-site PTM proteome [GCT v1.3+]. Next, list all files in the bucket and find the one you uploaded:

In [None]:
list_files_in_bucket(only_gct = TRUE)

2. Select the file name to copy over to environment

In [None]:
### EDIT THIS CELL (1/2)
input_file <- ...

In [None]:
copy_from_bucket_to_project_dir(input_file) 

## Single-site centric PTM-SEA

### Set parameters

1. Basic parameters for pre-processing PTM GCT:
- `id_type_out` - type of site annotation to process input GCT file into
    - by default, seqwin with flanking sequences is used (7 AA before and after the site)
   
    - if the column with flanking sequences is missing, select the format that matches the row IDs
- `seqwin_col` - name of the column containing the site annotation
    - this is only relevant if id_type_out is seqwin
- `organism` - organism from which the dataset is derived
- `mode` - determines how multiple sites per gene will be combined

In [None]:
### EDIT THIS CELL (2/2)
id_type_out       <- "seqwin"        # options: "uniprot", "refseq", "seqwin", "psp"
organism          <- "human"         # options: "human", "mouse", "rat"
mode              <- "median"        # options:
                                         # "sd" - most variable (standard deviation) across sample columns;
                                         # "SGT" - subgroup top: first subgroup in protein group (Spectrum Mill)
                                         # "abs.max" - for log-transformed, signed p-values.
                                         # "median" - default option

2. Advanced parameters for pre-processing PTM GCT

In [None]:
seqwin_col        <- "VMsiteFlanks"  # only relevant if the annotation is "seqwin", default: "VMsiteFlanks"
gene_symbol_col   <- "geneSymbol"    # default: "geneSymbol"
humanize_gene     <- FALSE      # if TRUE, gene symbols will be capitalized (for e.g. mouse or rat).

id_type           <- "sm"       # Notation of site-ids: 'sm' - Spectrum Mill; 'wg' - Web Gestalt; 'ph' - Philosopher
acc_type_in       <- "refseq"   # Type of accession number in 'rid' object in GCT file (uniprot, refseq, symbol).
residue           <- "S|T|Y"  # Modified residues, e.g. "S|T|Y" or "K".
ptm               <- "p"        # Type of modification, e.g "p", "ac", "ub", "gl"
localized         <- TRUE       # CAUTION: it is NOT RECOMMENDED to set this flag to FALSE. If TRUE only fully localized sites will be considered.

3. Advanced parameters for running PTM-SEA

In [None]:
list_files_in_bucket(only_gmt = TRUE)   # to use custom signatures, upload in DATA tab, they will show up here

In [None]:
ptm_sig_db_path   <- NULL               # copy path to .gmt file from above (`NULL` means use v2.0.0)

output_prefix     <- "ptm-sea-results"  # Label for output files from PTM-SEA
sample_norm_type  <- "rank"             # options: "rank", "log", "log.rank", "none"
weight            <- 0.75               # When weight=0, all genes have the same weight; if weight>0 actual values matter and can change the resulting score (default: 0.75).
correl_type       <- "z.score"          # options: "rank", "z.score", "symm.rank"
statistic         <- "area.under.RES"   # options: "area.under.RES", "Kolmogorov-Smirnov"
output_score_type <- "NES"              # Score type: "ES" - enrichment score,  "NES" - normalized ES
nperm             <- 1000               # Number of permutations
min_overlap       <- 5                  # Minimal overlap between signature and data set.
extended_output   <- TRUE               # If TRUE additional stats on signature coverage etc. will be included as row annotations in the GCT results files.
export_signal_gct <- TRUE               # For each signature export expression GCT files.
global_fdr        <- FALSE              # If TRUE global FDR across all data columns is calculated.

### Run PTM-SEA

In [None]:
input_ds <- file.path(project_input, basename(input_file))
ptm_sig_db <- get_ptm_sig_db(id_type_out, organism)

1. Pre-process GCT file into the right format for PTM-SEA.

In [None]:
preprocess_gct()

2. Run PTM-SEA. This will automatically save results to the bucket and output the name of the zip containing the outputs. If you wish to change the name of the output zip, you can specify the argument `name`.

In [None]:
run_ptm_sea()

3. If run successfully, your outputs will be saved in the bucket under default name of format `<workspace>_<runtime>.zip`. Open your workspace in a new tab or window. Navigate to DATA tab -> Files tab, select the zip output to download the folder with PTM-SEA outputs.