Loading the necessary packages

In [None]:
from __future__ import print_function

from depmapomics.config import *

from depmapomics import tracker, loading, fusions, expressions
from depmapomics import terra as myterra
from genepy import terra
from genepy.utils import helper as h
import subprocess

import dalmatian as dm

from bokeh.plotting import output_notebook

%load_ext autoreload
%autoreload 2

output_notebook()

# Generate sample set from new samples

we retrieve all the samples we can find from the GP workspaces

__CCLE specific__

In [None]:
if isCCLE:
    print("loading new RNAseq data")
    rnasamples = loading.loadRNA(SAMPLESETNAME)

In [None]:
if isCCLE:
    ref = tracker.getTracker()
    print('samples without rnaseq:')
    print(set(LINES_TO_RELEASE) - (set(rnasamples.arxspan_id) | set(ref[(ref.datatype=='rna') & (ref.blacklist==0)].arxspan_id)))
    %store rnasamples

In [None]:
%store -r rnasamples
rnasamples

In [None]:
if isCCLE:
    %store -r rnasamples
    print("uploading samples to the tracker and Terra: "+SAMPLESETNAME)
    loading.update(rnasamples, samplesetname=SAMPLESETNAME, stype='rna', bucket=RNA_GCS_PATH, refworkspace=RNAWORKSPACE)
    print('updating the workspaces with latest news from the sample tracker')
    ref = tracker.getTracker()
    myterra.copyToWorkspace(RNAWORKSPACE, ref[ref.datatype=="rna"], deleteUnmatched=True, addMissing=True,)

# run the pipeline

We are using Dalmatian to send request to Terra, we are running a set of 6 functions To generate the expression/fusion dataset:

We use the GTEx pipeline ([https://github.com/broadinstitute/gtex-pipeline/blob/v9/TOPMed_RNAseq_pipeline.md](https://github.com/broadinstitute/gtex-pipeline/blob/v9/TOPMed_RNAseq_pipeline.md)).

To generate the expression dataset, run the following tasks on all samples that you need, in this order:



*   samtofastq_v1-0_BETA_cfg 

    (broadinstitute_gtex/samtofastq_v1-0_BETA Snapshot ID: 5)

*   star_v1-0_BETA_cfg

(broadinstitute_gtex/star_v1-0_BETA Snapshot ID: 7)



*   rsem_v1-0_BETA_cfg 

    (broadinstitute_gtex/rsem_v1-0_BETA Snapshot ID: 4)

*   rsem_aggregate_results_v1-0_BETA_cfg (broadinstitute_gtex/rsem_aggregate_results_v1-0_BETA Snapshot ID: 3)

The outputs to be downloaded will be saved under the sample set that you ran. The outputs we use for the release are:



*   rsem_genes_expected_count
*   rsem_genes_tpm
*   rsem_transcripts_tpm

****Make sure that you delete the intermediate files. These files are quite large so cost a lot to store. To delete, you can either write a task that deletes them or use gsutil rm*****


##### Fusions {#fusions}

We use STAR-Fusion [https://github.com/STAR-Fusion/STAR-Fusion/wiki](https://github.com/STAR-Fusion/STAR-Fusion/wiki). The fusions are generated by running the following tasks



*   hg38_STAR_fusion (gkugener/STAR_fusion Snapshot ID: 14)
*   Aggregate_Fusion_Calls (gkugener/Aggregate_files_set Snapshot ID: 2)

The outputs to be downloaded will be saved under the sample set you ran. The outputs we use for the release are: 



*   fusions_star

This task uses the same samtofastq_v1-0_BETA_cfg task as in the expression pipeline, although in the current implementation, this task will be run twice. It might be worth combing the expression/fusion calling into a single workflow. This task also contains a flag that lets you specify if you want to delete the intermediates (fastqs). 

There are several other tasks in this workspace. In brief:



*   Tasks prefixed with **EXPENSIVE** or **CHEAP** are identical to their non-prefixed version, except that they specify different memory, disk space, etc. parameters. These versions can be used when samples fail the normal version of the task due to memory errors.
*   The following tasks are part of the GTEx pipeline but we do not use them (we use RSEM exclusively): markduplicates_v1-0_BETA_cfg (broadinstitute_gtex/markduplicates_v1-0_BETA Snapshot ID: 2), rnaseqc2_v1-0_BETA_cfg (broadinstitute_gtex/rnaseqc2_v1-0_BETA Snapshot ID: 2)
*   **ExonUsage_hg38_fixed** (gkugener/ExonUsage_fixed Snapshot ID: 1): this task calculates exon usage ratios. The non-fixed version contains a bug in the script that is not able to handle chromosome values prefixed with ‘chr’. The ‘fixed’ version resolves this issue.
*   **AggregateExonUsageRObj_hg38** (ccle_mg/AggregateExonUsageRObj Snapshot ID: 2): combines the exon usage ratios into a matrices that are saved in an R object.

## cleaning workspaces

In [None]:
if doCleanup:
    print("cleaning workspaces")
    torm = await terra.deleteHeavyFiles(RNAWORKSPACE)
    h.parrun(['gsutil rm '+i for i in torm], cores=8)
    terra.removeFromFailedWorkflows(RNAWORKSPACE, dryrun=False, everythingFor=[])

## On Terra

For non internal users, your Terra workspace needs to be correctly setup:

Please follow instructions in the readme and make sure that you created your sampleset

In [None]:
# TODO: update with latest workspace parameters from our repo

In [None]:
print("running Terra pipeline")
refwm = dm.WorkspaceManager(RNAWORKSPACE)
submission_id = refwm.create_submission("RNA_pipeline", SAMPLESETNAME,'sample_set',expression='this.samples')
await terra.waitForSubmission(RNAWORKSPACE, submission_id)

In [None]:
submission_id = refwm.create_submission("RNA_aggregate", 'all')
await terra.waitForSubmission(RNAWORKSPACE, submission_id)

### Save the workflow configurations used

In [None]:
terra.saveWorkspace(RNAWORKSPACE,'data/'+SAMPLESETNAME+'/RNAconfig/')

## On Local

### Expression post processing

In [None]:
await expressions._CCLEPostProcessing(samplesetname=SAMPLESETNAME, recompute_ssgsea=False, compute_enrichment=False)

# Fusion post processing

In [None]:
await fusions._CCLEPostProcessing(samplesetname=SAMPLESETNAME)