# Copy Number Pipeline

In [None]:
from __future__ import print_function
import os.path
import numpy as np
import pandas as pd
from IPython.display import Image,display
import dalmatian as dm
from IPython.core.display import HTML 
import sys
#loading package in another path
sys.path.insert(0, '..')
from src.CCLE_postp_function import *
from JKBio import TerraFunction as terra
# we will run some python functions directly in the notebook
import rpy2
%load_ext autoreload
%autoreload 2
%load_ext rpy2.ipython
from taigapy import TaigaClient
tc = TaigaClient()
# we will use google sheets (you need to have your storage and client files for your google account with an access to the required google sheets)- see sheeturl below
from gsheets import Sheets
sheets = Sheets.from_files('~/.client_secret.json', '~/.storage.json')
replace = {'T': 'Tumor', 'N': 'Normal', 'm': 'Unknown', 'L': 'Unknown'}

## boot up

- you first need to go to [taiga](https://cds.team/taiga/dataset) and create some new datasets for the virtual release
- the easiest way to create a new dataset is to upload an empty file (since at least one file is required). This empty file can be deleted when you update the dataset with a new version

we are instanciating all the parameters needed for this pipeline to run

In [None]:
# edit this at each new release
samplesetname = "20Q2"
release = samplesetname
prevname = '20Q1'
prevprevname ='19Q4'

# I think the prevversion and prevprevversion come from here (but need to check with Jeremie): segmented-cn-wes-prioritzed-7fe1
prevversion='40'
prevprevversion='38'
# taiga folder to upload to
virtual_public='public-20q2-075d'
virtual_dmc='dmc-20q2-2db6'
virtual_internal='internal-20q2-7f46'

# These are the workspaces which should contain the latest bams  (you can add or remove them)
workspace1="broad-genomics-delivery/Getz_IBM_CellLines_Exomes"
workspace2="broad-firecloud-ccle/CCLE_DepMap_WES"
workspace3="broad-genomics-delivery/CCLE_DepMap_WES"
workspace4="terra-broad-cancer-prod/CCLE_DepMap_WES"  # This is the newest workspace
workspace5="terra-broad-cancer-prod/"

refworkspace="broad-firecloud-ccle/DepMap_WES_CN_hg38"
# what sources these workspaces corresponds to
source1="ibm"
source2="ccle"
source3="ccle"
source4="ccle"
sheeturl = "https://docs.google.com/spreadsheets/d/115TUgA1t_mD32SnWAGpW9OKmJ2W5WYAOs3SuSdedpX4"

In [None]:
%%R
# same thing for R
release <- '20Q2'
prevname <- '20Q1'
version<-37
genome_version <- 'hg38'

In [None]:
# we initialize the workspaces manager from dalmatian
wm1 = dm.WorkspaceManager(workspace1)
wm2 = dm.WorkspaceManager(workspace2)
wm3 = dm.WorkspaceManager(workspace3)

refwm = dm.WorkspaceManager(refworkspace)

In [None]:
# sometimes the bam files live under a different column in the workspace data
extract_to_change = {'ref_bams':'WES_bam'}

## Adding new data

We are looking for new samples in a range of workspaces.

They are quite messy and might contains duplicates, contain broken file paths...

- We are thus looking at the bam files one by one and comparing them with our own bams. 
- We remove broken files, duplicates and add new version of a cell line's bam if we find some.

In [None]:
sample_ids, refsamples, ccle_name = createDatasetWithNewCellLines(refworkspace, samplesetname, wmfroms = [workspace1, workspace2, workspace3], sources=[source1, source2, source3], gsfolderto='gs://ccle_bams/wes/', match='ACH', participantslicepos=10, accept_unknowntypes=True, extract = extract_to_change, dry_run = True)

In [None]:
newsample[1][newsample[1].index.isin(newsample[0])]

In [None]:
#temporary renaming
rename = {'ACH-002446_2':'ACH-003000_1'}

In [None]:
sample_ids, refsamples, ccle_name = createDatasetWithNewCellLines(refworkspace, samplesetname, wmfroms = [workspace1, workspace2, workspace3], sources=[source1, source2, source3], gsfolderto='gs://ccle_bams/wes/', match='ACH', participantslicepos=10, accept_unknowntypes=True, extract = extract_to_change, dry_run = False, rename=rename)

In [None]:
refsamples[refsamples.index.isin(sample_ids)].WES_bam

## Check that we have all the cell lines we expect for this release

This involves comparing to the list in the Google sheet "Cell Line Profiling Status."

_As the list cannot be parsed, we are not comparing it for now_

In [None]:
# this function may not work - it hasn't been tested
url = 'https://docs.google.com/spreadsheets/d/1qus-9TKzqzwUMNWp8S1QP4s4-3SsMo2vuQRZrNXf7ag/edit?ts=5db85e27#gid=0&fvid=1627883727'

compareToCuratedGS(url, sample = newsample[0], samplesetname = samplesetname, colname = 'CN New to internal')

# run the pipeline

We are using Dalmatian to send request to Terra, we are running a set of 5 functions To generate the copy number dataset:

*   **BamToUnmappedRGBams_MC** vdauwera/BamToUnmappedRGBamsSnapshot ID: 3
*   **Generate_uBAM_File_List** gkugener/ArrayOfFilesToTxtSnapshot ID: 1
*   **Realign_WES_GATK4** gatk/PreProcessingForVariantDiscovery_GATK4Snapshot ID: 7
*   **CNV_sample_XX** gatk/CNV_Somatic_Pair_WorkflowSnapshot ID: 9
*   **Aggregate_CN_seg_files** gkugener/Aggregate_CN_seg_filesSnapshot ID: 2

This output file for download will be saved under the sample set under the combined_seg_file attribute.

There are several other tasks in this workspace. In brief:

*   **CNV_Somatic_Panel_Workflow_Agilent_XX** gatk/CNV_Somatic_Panel_WorkflowSnapshot ID: 11. This task was used in this workspace to generate the Sanger PON. In the Sanger dataset, there is a set of 40 normal cell lines samples (cell lines derived from matched normal tissue). We can use these to generate a PON to normalize to rather than using the Agilent PON we use for the other CCLE cell lines. This leads to less noisy results. HOWEVER, results using the PON from this workflow should not use the X chromosome, as the sanger normals are not exclusively female or male (it is likely a mix).
*   **SANGER_PON_CNV_sample_XX** gatk/CNV_Somatic_Pair_WorkflowSnapshot ID: 9. Same as the CNV_sample_XX_gatk, except that is uses the Sanger based PON. Should be used only for the Sanger cell lines.
*   **Sanger_PON_Aggregate_CN_seg_files** gkugener/Aggregate_CN_seg_filesSnapshot ID: 2. Aggregates the segment files for the samples that were run using the Sanger PON based CNV workflow.

## On Terra

In [None]:
# a list of Terra workflows that are in the workspace and that we will call sequentially
bamtoubam= "BamToUnmappedRGBams_MC"
ubamtofilelist = "Generate_uBAM_File_List"
realign="Realign_WES_GATK4"

In [None]:
# see dalmatian
subid = refwm.create_submission(bamtoubam,samplesetname,"sample_set","this.samples")
terra.waitForSubmission(refworkspace, subid)

In [None]:
subid = refwm.create_submission(ubamtofilelist,samplesetname,"sample_set","this.samples")
terra.waitForSubmission(refworkspace, subid)

In [None]:
subid = refwm.create_submission(realign,samplesetname,"sample_set","this.samples")
terra.waitForSubmission(refworkspace, subid)

In [None]:
# Testing out the XY PoN for CN characterization. Will test by producing an output in a different column from usual so it's easy to delete the column attribute later
# Also, need to make a split between Agilent and ICE samples..
samplesetname = '19Q4'
submission_id= refwm.create_submission("CNV_sample_XX",etype='sample_set',entity=samplesetname,expression='this.samples')
terra.waitForSubmission(refworkspace,submission_id)

In [None]:
submission_id= refwm.create_submission("CNV_sample_XX",etype='sample_set',entity=samplesetname,expression='this.samples')
terra.waitForSubmission(refworkspace,submission_id)

In [None]:
submission_id = refwm.create_submission("Aggregate_CN_seg_files",entity=samplesetname)
terra.waitForSubmission(refworkspace,submission_id)

__we are getting the results file path__

In [None]:
# aggregated = refwm.get_entities('sample_set').loc[prevname]["combined_seg_file"]
aggregated = refwm.get_entities('sample_set').loc[samplesetname]["combined_seg_file"]

aggregated

## On local

__We then save the workflow configurations used__

In [None]:
terra.saveConfigs(refworkspace,'data/'+samplesetname+'/CNVconfig')

__delete unmapped bams generated during the process__

In [None]:
sam = refwm.get_samples()
samplesinset = sam[sam.index.isin(refwm.get_sample_sets().loc[samplesetname].samples)]

In [None]:
for val in samplesinset.readgroup_ubams:
    ubams = ''
    for v in val:
        ubams+=' '+v
    os.system('gsutil -m rm'+ubams)

__and move the hg38 aligned bams to our own datastorage bucket__

Note that we may encounter some WGS files, which need to go to a different folder from the WES bam files.

In [None]:
onlycol = ['hg38_analysis_ready_bam',
'hg38_analysis_ready_bam_index',
'hg38_analysis_ready_bam_md5']
cols_to_grab = onlycol + ['datatype']
wes_newgs = 'gs://cclebams/hg38_wes/'
wgs_newgs = 'gs://cclebams/hg38_wgs/'
wes_data = ''
wgs_data = ''

wes_samples = samplesinset[samplesinset.datatype == 'wes'].index.tolist()
wgs_samples = samplesinset[samplesinset.datatype == 'wgs'].index.tolist()

for k, val in samplesinset[cols_to_grab].iterrows():
    datatype = samplesinset[cols_to_grab].loc[k,'datatype']
    for v in val:
        if 'gs://' in v and datatype == "wes":
            wes_data+=' '+v
        elif 'gs://' in v and datatype == "wgs":
            wgs_data+=' '+v
            
! gsutil -m mv $wes_data $wes_newgs
! gsutil -m mv $wgs_data $wgs_newgs

wes_res = terra.changeGSlocation(refworkspace,prevgslist=['gs://fc-secure-d2a2d895-a7af-4117-bdc7-652d7d268324/'], newgs=wes_newgs, onlycol=onlycol, entity='samples', entity_subset = wes_samples, keeppath=False, dry_run = True)

wgs_res = terra.changeGSlocation(refworkspace,prevgslist=['gs://fc-secure-d2a2d895-a7af-4117-bdc7-652d7d268324/'], newgs=wgs_newgs, onlycol=onlycol, entity='samples', entity_subset = wgs_samples, keeppath=False, dry_run = True)

In [None]:
! gsutil cp $aggregated "temp/cnv_ccle.called.seg"

__We download and reprocess removing the appended version and keeping only the newest versions__

In [None]:
a = pd.read_csv("temp/cnv_ccle.called.seg", sep='\t')
a

In [None]:
len(set(a.Sample))

### TODO: 1. add columns to seg file with arxspan ID, version. (2) only keep the newest version for any given arxspan ID.
The process to keep the newest version of any given line is a little different from 20Q2 onwards, because don't have any dataset that uses the CDS-IDs for the data from 20Q1 or earlier.

We have to download the Taiga datasets from the previous quarter, see if we have any arxspan IDs with new data, and then replace with that data. We use the function called "removeOlderVersions" to do this.

In [None]:
a_indexed = a.set_index('Sample')
a_indexed.index.names = ['sample_id']
a_indexed

In [None]:
renaming = removeOlderVersions(names=a_indexed.index.tolist(), refsamples=refwm.get_samples(), arxspan_id="arxspan_id", version="version")

In [None]:
a_indexed[a_indexed.index.isin(renaming.keys())].rename(renaming).to_csv("temp/cnv_ccle.called.seg", sep='\t')

__If want to reprocess something__

In [None]:
#%%R
#segments_unfiltered <- readr::read_csv('temp/wes.19Q3.segmented.cn')

## post Procesing

The post processing happens in R using guillaume's functions, in brief:

- processSegments
- filterForCCLE
- interpolateGapsInSegmented
- extendEndsOfSegments
- reprioritizeData

In [None]:
%%R
source('../gkugener/RScripts/load_libraries_and_annotations.R')
source("src/CCLE_postp_function.R")
library('celllinemapr')
library('magrittr')
library('taigr')
library('cdsomics')
library('readr')

In [None]:
%%R
# Previous release copy number profiles. This line will need to be updated as well
wes.priority.cn.seg.profiles <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.file=paste0("wes.",prevname,".segmented"),data.version='40') %>%
  dplyr::select(DepMap_ID, Chromosome, Start, End, Num_Probes, Segment_Mean, Source)

In [None]:
%%R
print(mean(wes.priority.cn.seg.profiles$Segment_Mean))
print(max(wes.priority.cn.seg.profiles$Segment_Mean))
print(min(wes.priority.cn.seg.profiles$Segment_Mean))

### debugging interpolateGapsInSegmented(segments)$segs
**TODO:** getting error when call interpolateGapsInSegmented for the 20Q2 data. In particular, the mutate call at line 275 in the CCLE_postp_function.R file yields the error: "Error: must be a double vector, not an integer vector". The solution was changing `TRUE ~ as.integer(1))` to `TRUE ~ as.numeric(1))` in the interpolateGapsInSegmented function. I've pushed this to the Git repo.

In [None]:
%%R
segments <- processSegments("temp/cnv_ccle.called.seg")
segments <- filterForCCLE(segments)
segments <- interpolateGapsInSegmented(segments)$segs
segments <- extendEndsOfSegments(segments,'../JKBio/data/hg38_cytoband.gz')
print(segments)
print(head(wes.priority.cn.seg.profiles))
# reprioritize also undo logtransform
segments_unfiltered <- reprioritizeData(segments, wes.priority.cn.seg.profiles)
# Fill in the gaps on the entire dataset
# Extend start sites to 1, end sites to the end of the chromosome?

In [None]:
%%R
print(mean(segments_unfiltered$Segment_Mean))
print(max(segments_unfiltered$Segment_Mean))
print(min(segments_unfiltered$Segment_Mean))

In [None]:
%%R
write.table(segments_unfiltered, file = paste0("temp/wes.",release,".segmented.cn"), sep = ',', quote = F, row.names = F) 

### creating gene copy number

- generateEntrezGenes
- generateGeneLevelMatrixFromSegments

In [None]:
%%R
entrezgenes <- generateEntrezGenes()

In [None]:
%%R
head(corner(entrezgenes))

In [None]:
%%R
res <- generateGeneLevelMatrixFromSegments(entrezgenes, segments_unfiltered)

In [None]:
%%R
genematrix_unfiltered <- res$gene_level_data_hg38
corner(genematrix_unfiltered)

In [None]:
%%R
print(mean(genematrix_unfiltered))
print(max(genematrix_unfiltered))

In [None]:
%%R
# we save this table to be validated in python (quicker way to do it by passing it in cells)
write.table(genematrix_unfiltered, file = paste0('temp/wes.',release,'.gene.cn'), 
sep = ',', quote = F, row.names = T)

## Validation step

Once the files are saved, we load them back in python and do some validations, in brief:

- mean,max,var...
- to previous version: same mean,max,var...
- checkAmountOfSegments: flag any samples with a very high number of segments
- checkGeneChangeAccrossAll: flag any genes which stay at a similar value across all samples

In [None]:
genecn = pd.read_csv('temp/wes.'+release+'.gene.cn', sep = ',')

In [None]:
genecn.shape

In [None]:
segmentcn = pd.read_csv('temp/wes.'+release+'.segmented.cn', sep = ',')

In [None]:
segmentcn.shape

In [None]:
# getting the previous versions to check that we have everything we should
prev = set(tc.get(name='segmented-cn-wes-prioritzed-7fe1', file='wes.'+prevname+'.gene', version="40").index.tolist())
prevprev= set(tc.get(name='segmented-cn-wes-prioritzed-7fe1', file='wes.'+prevprevname+'.gene', version='38').index.tolist())

In [None]:
new1 = set(genecn.index.values.tolist())
new2 = set(segmentcn['DepMap_ID'].values.tolist())
print(len(prev), len(prev & new1), len(new1), len(new1 & new2))

In [None]:
checkAmountOfSegments(segmentcn,thresh = 750)

In [None]:
checkGeneChangeAccrossAll(genecn, thresh=1.5)

In [None]:
segmentcn.Start = segmentcn.Start.astype(int)
segmentcn.End = segmentcn.End.astype(int)

In [None]:
genecn.values.min(), genecn.values.mean(), genecn.values.max()

In [None]:
if(genecn.values.max() > 100):
    print("\n\n\nTOO HIGH, not LOG2 transformed!")
if(len(genecn.index.tolist()) > len(set(genecn.index))):
    print("Duplicate CL, not reprioritized well!")

In [None]:
for k, val in samplesinset.iterrows():
    plot = val["modeled_segments_plot_tumor"]
    ! gsutil cp $plot temp/
    print(k)
    print(val['arxspan_id'])
    display(Image('temp/'+plot.split('/')[-1]))

These look bad in 20Q1: 
ACH-002511 (M140325), ACH-001370 (OCIP5X)

These CN plots subjectively appear to have too many segments in new 20Q2 samples: 
ACH-002399 (CDS-sukIAT, 21NT_1), ACH-002401 (CDS-tVy3GF, 21MT2_1), ACH-002400 (CDS-VUHMHG, 21MT1_1)

In [None]:
gsheets = sheets.get(sheeturl).sheets[6].to_frame()

wes_embargo = [i for i in gsheets['WES_embargo'].values.tolist() if i is not np.nan]
wes_dmc_embargo = [i for i in gsheets['WES_DMC_embargo'].values.tolist() if i is not np.nan]

blacklist = [i for i in gsheets['blacklist'].values.tolist() if i is not np.nan]

# Upload to taiga

- we load the blacklisted/embargoed sample ids
- we log2 transform and create a file for each release (and one containing everything)
- we upload the files using taigapy in a corresponding taiga dataset with the corresponding description and also upload it to its virtual dataset

## we push full dataset version in depmap taiga CN

In [None]:
print(segmentcn.Segment_Mean.max(), segmentcn.Segment_Mean.mean(), segmentcn.Segment_Mean.min())
print(genecn.values.max(), genecn.values.mean(),genecn.values.min())

In [None]:
## for genecn removing first blacklisted, then embargoed, to create two datasets
genecn = genecn.apply(lambda x: np.log2(1+x))
genecn.to_csv('temp/wes.'+release+'.gene.cn', index=True)
segmentcn.to_csv('temp/wes.'+release+'.segmented.cn', index=False)

In [None]:
if(genecn.values.max() > 100):
    print("\n\n\nTOO HIGH, not LOG2 transformed!")
if(len(genecn.index.tolist()) > len(set(genecn.index))):
    print("Duplicate CL, not reprioritized well!")

In [None]:
print("Old genecn shape:", genecn.shape)
print("Old segmentcn shape:", segmentcn.shape)

In [None]:
def checkLineInDsetVars(line):
    print("Line", line,"in the following dsets?")
    print(line in genecn.index.tolist(), "genecn")
    print(line in segmentcn.DepMap_ID.tolist(), "segmentcn")
    print()

In [None]:
checkLineInDsetVars('ACH-002874')
checkLineInDsetVars('ACH-000219')

In [None]:
tc.update_dataset(dataset_permaname="segmented-cn-wes-prioritzed-7fe1", 
                  upload_file_path_dict={
                    'temp/wes.'+release+'.gene.cn': 'NumericMatrixCSV',
                    'temp/wes.'+release+'.segmented.cn': 'TableCSV'},
                  changes_description=
"""
Removing duplication of ACH-000219 from segmentcn file to prevent issues in future releases, which use this Taiga dataset in the process of determining which new lines to release. Now both the genecn file and the segmentcn file have 1767 unique DepMap IDs.
""",
                  dataset_description=
"""
# Copy Number

Combined segment and gene-level CN calls from Broad WES, Sanger WES, and Broad SNP. Relative CN, log2(x+1) transformed.

PORTAL TEAM SHOULD NOT USE THIS: There are lines here that should not make it even to internal. Must use subsetted dataset instead. These data will not make it on the portal starting 19Q1. With the DMC portal, there is new cell line release prioritization as to which lines can be included, so a new taiga dataset will be created containing CN for the portal.

These data are generated for Achilles to pull from to run CERES.

Versions to use:

v45 for 20Q2 (For Achilles QC, use v44 of segmentcn)
v40 for 20Q1
v38 for 19Q4
v33 for 19Q4
v25 for 19Q2 (hg38 aligned, Broad WES and Sanger WES based calls were generated from bam realignment. SNP based calls are still from liftover). The gene mapping script was updated to improve the gene level matrix (to remove NAs). The segmented level matrix is untransformed relative CN. Gene level matrix is log2(CN + 1).
v20 for 19Q1 (version 21 is hg19)
v18: for 18Q4
v15: for 18Q3
v11: for 18Q2
Gene-level matrix in versions below 10 were using hg38 and not hg19. Version 11 is corrected and should be used instead

Calls on X, Y chromosome for profiles should not be used.

Prioritization is as follows:

Broad WES kept over everything
Sanger WES kept if:
This cell line did not fail fingerprinting
This cell line has no other CN data
This cell lines does not have CRISPR LFC data from the Achilles screen
This CN profile correlates better with Achilles CRISPR LFC data than Broad SNP CN OR the % gene-level difference between this cell lines CN profile from Sanger WES and Broad SNP < 2.5%
Broad SNP used for remaining lines with no Broad WES or with Sanger WES that does not pass the criteria above
The 'Source' column indicates which CN profile was used for that cell line.

version 6: renamed Sample column to CCLE_name for consistency for the Achilles pipeline

version 7: missing chordoma lines

version 8: fixed to names of two chordoma lines (changed suffix from CHORDOMA -> BONE) and removed renamed 
cell lines that were duplicated (with different names). Reran comparison using 18q2 LFC results. Gene level matrix will be generated for version 9

version 11: corrected error in gene-level matrix calculation (previously had been aligned to hg38 however alignment should be hg19). Segment level calls are unaffected.

versions 12-14: Sanger WES were multiplied by 2 so should not be used

version 15: internal segments and gene level matrices for 18q3 release including public version (removed black list lines and Broad WES < 6 months old). Gene level matrices are indexed using Broad IDs.

version 16: internal segments and gene level matrices for 18q4 release including public version (uses all SNP and only WES if those lines are present in the 18Q4 public Achilles dataset)

version 17: same as version 16 but with two additional line in the internal version

version 18: same as version 17 but switched one line in public to use SNP instead of WES because not in public Avana

version 20: two major changes occurred (1) we are using a FireCloud based pipeline for CN calling now for Broad WES data (2) we have moved to use hg38. This is accomplished by lifting over coordinates from hg19 to hg38 after processed by the CN pipeline.
v21 same as version 20, but we are using the original hg19 coordinates, not hg38

version 25: Broad WES and Sanger WES were realigned to hg38. SNP still uses liftover from hg19 to hg38

version 35: Seeing what went wrong with the upload.

version 36: problem with not log2 transforming the data

version 37: resolving the problem with log2 transforming the segment data

version 38: resolving the problem with log2 transforming the segment data

version 39: 20Q1. Samples ACH-002511 (M140325) and ACH-001370 (OCIP5X) appear to have too many segments looking at the CN profile.

version 40: unlog2 transforming segmentcn

version 41: 20Q2 (segmentcn is just relative copy number, whereas the genecn is log2(x+1) transformed). Added 7 new samples.
These CN plots subjectively appear to have too many segments in new 20Q2 samples: ACH-002399 (CDS-sukIAT, 21NT\_1), ACH-002401 (CDS-tVy3GF, 21MT2\_1), ACH-002400 (CDS-VUHMHG, 21MT1\_1)

version 42: **note: version 42 is missing some of the cell lines. Do not use** 

version 43: Resolving issue of no DepMap ID index in the genecn file. Duplicating the CN data in genecn and segmentcn for ACH-000219 so we have CN data for ACH-002874, the same cell line grown in different media. This step is required for Achilles / CERES.

version 44: Removing duplication of ACH-000219 from genecn file. The Achilles QC only needs the duplication in the segmentcn file. This change results in 1767 unique DepMap IDs in the genecn file, and 1768 unique DepMap IDs in the segmentcn file.

version 45: Removing duplication of ACH-000219 from segmentcn file to prevent issues in future releases, which use this Taiga dataset in the process of determining which lines should be released to Public. Now both the genecn file and the segmentcn file have 1767 unique DepMap IDs.

Gene level CN data:

__Rows__: DepMap cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

Segmented data:

__Columns__: DepMap\_ID, Chromosome, Start, End, Num\_Probes, Segment\_Mean""")

## We push internal dataset with blacklisted removed

and we add it to eternal dataset and to virtual dataset

In [None]:
## for segment removing first blacklisted, then embargoed, to create two datasets
print(len(segmentcn))
segmentcn = segmentcn[~segmentcn.DepMap_ID.isin(blacklist)]
print(len(segmentcn))
segmentcn.to_csv('temp/internal_'+release+'_segs_cn', index=False)
print(len(genecn))
genecn = genecn[~genecn.index.isin(blacklist)]
print(len(genecn))
genecn.to_csv('temp/internal_'+release+'_gene_cn', index=True)

In [None]:
tc.update_dataset(dataset_permaname="depmap-wes-cn-data-81a7", 
                  upload_file_path_dict={
                    'temp/internal_'+release+'_gene_cn': 'NumericMatrixCSV',
                    'temp/internal_'+release+'_segs_cn': 'TableCSV'},
                  dataset_description=
"""
# Copy Number


## ** Version 1 Internal 18Q1****

Generated with the following script:

```
wes_pri <- taigr::load.from.taiga(data.name='gene-level-cn-87aa', 
                                  data.version=5, 
                                  data.file='gene_CN_WES_priority')
source_info <- data.frame(ccle_name=gsub("snp_|sangerWES_|ccleWES_|achillesWES_", 
                                         "", row.names(wes_pri)), 
                          source=gsub("_.*", "", row.names(wes_pri)))
wes_pri %<>% magrittr::set_rownames(source_info$ccle_name)

```

## ** Version 2 Internal 18Q2****

Generated with the following script:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=9, data.file='wes_priority_cn_gene_matrix') %>% log2()


```

## ** Version 3 Internal 18Q2****

Generated with the following script:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=11, data.file='wes_priority_cn_gene_matrix') %>% log2()


```
## ** Version 4-6 Internal 18Q3****

__Description__: log2 gene level copy number data

Generated with the following script:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=15, data.file='wes_priority_cn_gene_matrix') %>% log2()


```
__Rows__: Broad (arxspan) cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

Version 5 has updated cell line name mapping

Version 4 and 5 the segmented CN for Sanger's data is off by a factor of 2, version 6 corrects this

**** Version 7 Internal 18Q4****

__Description__: log2 gene level copy number data

Generated with the following script:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=17, data.file='wes_priority_cn_gene_matrix') %>% log2()


```
__Rows__: DepMap (arxspan) cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

## ** Version 8-9 Internal 19Q1****

version 9 has the correct data for 19Q1

## ** Version 10-11 Internal 19Q2****

__version 11 added an additional 13 cell lines and adds the segmented copy number data__

## ** Version 12 Internal 19Q3****

__Description__: log2(X + 1) gene level copy number data (data is now log2 transformed with a __pseudocount of 1__ added). CN data is generated using __hg38__. 


## ** Version 15 Internal 19Q4****

Adding 35 new cell lines

## ** Version 16 Internal 19Q4****
resolving problem with not having log2 transform 

## ** Version 17 Internal 19Q4****
resolving problem with having log2 transform on segments

## ** Version 18 Internal 20Q1****
adding 8 new cell lines

## ** Version 19 Internal 20Q1****
unlog2 transforming segmentcn


Some cells lines have been flagged as:

 - having bad looking copy ration plots = ACH-002511 (M140325) and ACH-001370 (OCIP5X)
 - having too many segments (format: sample seg_count) = ACH-001079 2586, ACH-000044 1202, ACH-000258 872, ACH-001230 947, ACH-000068 812, ACH-000454 1051, ACH-000216 925, ACH-001150 782, ACH-001214 889, ACH-002335 1312, ACH-000836 1001, ACH-001957 1426, ACH-000960 913, ACH-000458 762, ACH-000578 869, ACH-000327 819, ACH-000090 1024, ACH-000488 954, ACH-000848 1171, ACH-000923 1469, ACH-000904 868, ACH-000452 816, ACH-000600 939, ACH-001656 902, ACH-000854 899, ACH-000774 953, ACH-001000 980, ACH-000941 813, ACH-000887 1408, ACH-001017 1223, ACH-001171 792, ACH-001071 1175, ACH-000593 764, ACH-001239 851, ACH-000071 1287, ACH-001956 1368, ACH-000509 873, ACH-002204 1318, ACH-000550 974, ACH-000738 1064, ACH-000870 1557, ACH-001036 858, ACH-001043 825, ACH-000028 868, ACH-001955 1296, ACH-000419 826, ACH-001234 819, ACH-001094 1036, ACH-001225 792, ACH-000118 794, ACH-000300 1431, ACH-001113 1072, ACH-001045 822, ACH-000444 974, ACH-000901 816, ACH-000865 1358, ACH-000961 763, ACH-001249 1756, ACH-000167 838, ACH-001101 1005, ACH-000842 929, ACH-000837 1015, ACH-000710 968, ACH-000195 2029, ACH-000064 1203, ACH-000690 771, ACH-000635 1368, ACH-000356 1294, ACH-000659 1129, ACH-000868 1422, ACH-000128 767, ACH-000658 927, ACH-001088 1337
 - Genes having a similar CN value accross all: []
 
## ** Version 20 Internal 20Q2****
Added 7 samples.

Some cells lines have been flagged as:

 - having bad looking copy ratio plots (appear to have too many segments): ACH-002399 (CDS-sukIAT, 21NT\_1), ACH-002401 (CDS-tVy3GF, 21MT2\_1), ACH-002400 (CDS-VUHMHG, 21MT1\_1)
 - having too many segments (format: sample seg_count): same as for 20Q1
 - Genes having a similar CN value accross all samples: []
 
 
## ** Version 21 Internal 20Q2****
 
Duplicating the CN data in genecn and segmentcn for ACH-000219 so we have CN data for ACH-002874, the same cell line grown in different media. This step is required for Achilles / CERES.


Gene level CN data:

__Rows__: DepMap cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

Segmented data:

__Columns__: DepMap\_ID, Chromosome, Start, End, Num\_Probes, Segment\_Mean
""")

In [None]:
# To add to a virtual dataset
AddToVirtual(virtual_internal, 'depmap-wes-cn-data-81a7', [('CCLE_gene_cn', 'internal_'+release+'_gene_cn'),('CCLE_segmented_cn', 'internal_'+release+'_segs_cn')])
# To add to a eternal dataset
AddToVirtual('depmap-a0ab', 'depmap-wes-cn-data-81a7', [('CCLE_gene_cn', 'internal_'+release+'_gene_cn'),('CCLE_segmented_cn', 'internal_'+release+'_segs_cn')])

## We add to dmc as in internal*

* **NOTE: change as of 20Q2 onwards**. We need to remove lines in WES_DMC_embargo from the Internal version of the CN datasets before we upload the `genecn` and `segmentcn` files to DMC.

In [None]:
## for segment removing first blacklisted, then embargoed, to create two datasets
print(len(segmentcn))
segmentcn = segmentcn[~segmentcn.DepMap_ID.isin(blacklist)]
segmentcn = segmentcn[~segmentcn.DepMap_ID.isin(wes_dmc_embargo)]

print(len(segmentcn))
segmentcn.to_csv('temp/dmc_'+release+'_segs_cn', index=False)
print(len(genecn))
genecn = genecn[~genecn.index.isin(blacklist)]
genecn = genecn[~genecn.index.isin(wes_dmc_embargo)]

print(len(genecn))
genecn.to_csv('temp/dmc_'+release+'_gene_cn', index=True)

In [None]:
tc.update_dataset(dataset_permaname="depmap-cn-data-9b9d",
                upload_file_path_dict={
                    'temp/dmc_'+release+'_gene_cn':'NumericMatrixCSV',
                    'temp/dmc_'+release+'_segs_cn': 'TableCSV',
                   },
                  changes_description=
"""
Adding 20Q2 samples.
""",
                
                  dataset_description="""
**** Version 1-2 DMC 19Q1****

version 2 contains the correct data for 19Q1

**** Version 3-4 DMC 19Q2****

__version 4 added an additional 13 cell lines and adds the segmented copy number data__

**** Version 5 DMC 19Q3***

**** Version 7 DMC 19Q4***
adding 35 new cell lines

**** Version 8 DMC 19Q4****
resolving problem with not having log2 transform 

**** Version 9 DMC 19Q4****
resolving problem with having log2 transformed the segments

**** Version 10 DMC 20Q1****
adding new samples

**** Version 11 DMC 20Q1****
unlog2 transforming segmentcn

**** Version 12 DMC 20Q2****
Adding samples to be included in 20Q2

__Description__: log2(X + 1) gene level copy number data (data is now log2 transformed with a __pseudocount of 1__ added). CN data is generated using __hg38__.  The segmented copy number data includes the mean segmented copy number segments.

Gene level CN data:

__Rows__: DepMap cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

Segmented data:

__Columns__: DepMap\_ID, Chromosome, Start, End, Num\_Probes, Segment\_Mean
""")

In [None]:
AddToVirtual(virtual_dmc, "depmap-cn-data-9b9d", files=[('CCLE_gene_cn', 'dmc_'+release+'_gene_cn'),('CCLE_segmented_cn', 'dmc_'+release+'_segs_cn')])

## We add to public as internal minus dmc embargoed and only cell lines from previous previous release (6 month)

In [None]:
print(len(segmentcn))
segmentcn = segmentcn[segmentcn.DepMap_ID.isin(prevprev)]
segmentcn = segmentcn[~segmentcn.DepMap_ID.isin(set(wes_embargo))]
print(len(segmentcn))
segmentcn.to_csv('temp/public_' + release + '_segs_cn', index=False)
print(len(genecn))
genecn = genecn[genecn.index.isin(prevprev)]
genecn = genecn[~genecn.index.isin(set(wes_embargo))]
print(len(genecn))
genecn.to_csv('temp/public_'+release+'_gene_cn', index=True)

In [None]:
tc.update_dataset(dataset_permaname='depmap-wes-cn-data-97cc',
                    upload_file_path_dict={
                    'temp/public_'+release+'_gene_cn':'NumericMatrixCSV',
                    'temp/public_'+release+'_segs_cn': 'TableCSV',
                   },
                  changes_description=
"""
Adding 20Q2 samples.
""",
                  dataset_description="""
**** Versions 1-5 Public 18Q1****

Gene-level WES copy-number data for publicly accessible CCLE data. 

```

internal_lines <- readr::read_csv("~/Downloads/avana-broad-18q1_v2-sample-info.csv")$cell_line
public_lines <- readr::read_csv("~/Downloads/avana-public-tentative-18q1_v5-sample-info.csv")$cell_line
non_public_lines <- setdiff(internal_lines, public_lines)

full_cn_set <- taigr::load.from.taiga(data.name='gene-level-cn-87aa', data.version=5, data.file='full_gene_CN')
source_info <- data.frame(source=gsub("_.*", "", row.names(full_cn_set)),
                          ccle_name=gsub("snp_|achillesWES_|ccleWES_|sangerWES_", "",
                                         row.names(full_cn_set)),
                          row_idx=1:nrow(full_cn_set))
to_remove <- source_info %>%
  dplyr::filter(ccle_name %in% non_public_lines,
                source %in% c("ccleWES", "achillesWES"))
also_to_remove <- source_info %>%
                    dplyr::filter(source == "sangerWES")
indices_to_remove <- c(to_remove$row_idx, also_to_remove$row_idx) %>% unique()
indices_to_keep <- source_info %>%
  dplyr::filter(!(row_idx %in% indices_to_remove)) %>%
  dplyr::group_by(ccle_name) %>%
  dplyr::mutate(priority=ifelse(source == "snp", 4,
                                ifelse(source == "sangerWES", 3,
                                       ifelse(source == "ccleWES", 2, 1)))) %>%
  dplyr::filter(priority == min(priority)) %>%
  dplyr::ungroup()

public_cn <- full_cn_set[indices_to_keep$row_idx,]
source_info <- data.frame(source=gsub("_.*", "", row.names(public_cn)),
                          ccle_name=gsub("snp_|achillesWES_|ccleWES_|sangerWES_", "",
                                         row.names(public_cn)))
public_cn %<>% magrittr::set_rownames(source_info$ccle_name)
```

CN data are on a log2 scale.

`WES_source_info` tracks the source data for each cell line. Sources are `snp`, `achillesWES`, `ccleWES`, and `sangerWES`

NOTE: Version 1 contained WES data from cell lines not available in the 18Q1 Public release. Versions 2-4 contained Sanger's WES CN data

**** Version 6 Public 18Q2****

Generated by running:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=10, data.file='public_wes_priority_cn_gene_matrix') %>% log2()
```

**** Version 7 Public 18Q2****

Generated by running:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=11, data.file='public_wes_priority_cn_gene_matrix') %>% log2()
```


**** Version 8-9 Public 18Q3****

Generated by running:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=15, data.file='public_wes_priority_cn_gene_matrix') %>% log2()
```
Version 8 the segmented CN for Sanger's data is off by a factor of 2, version 9 corrects this

includes cell lines that should not be public

**** Version 10 Public 18Q1, 18Q2, 18Q3****

__use version 10 for 18Q1, 18Q2 and 18Q3 datasets__ 

Version 10 is the most up-to-date version of "public\_18Q3\_gene\_cn.csv". The three datasets have been updated to remove cell lines that should not have been made public. They are named in the portal and google bucket for portal downloads as v2, e.g. public\_18Q3\_gene\_cn\_v2.csv.

__Rows__: Broad (arxspan) cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

**** Version 11 Public 18Q4****

Generated by running:

```
wes_pri <- taigr::load.from.taiga(data.name='segmented-cn-wes-prioritzed-7fe1', data.version=17, data.file='public_wes_priority_cn_gene_matrix') %>% log2()
```

__Rows__: DepMap (arxspan) cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

**** Version 12-14 Public 19Q1****

version 14 contains the correct data for 19Q1

version 13 is the same as v12 except that it uses the original hg19 coordinates not hg38. The Achilles public data set uses the hg19 coordinates. 

**** Version 15-16 Public 19Q2****

__version 16 also adds the segmented copy number data__

**** Version 17-18 Public 19Q3****


**** Version 23 Public 19Q4****

adding new cell lines

**** Version 24 Internal 19Q4****
resolving problem with not having log2 transform 

**** Version 25 Internal 20Q1****
adding new samples

**** Version 26 Internal 20Q1****
unlog2 transforming segmentcn

**** Version 26 Internal 20Q1****
Adding new samples


## Gene level CN data:

__data is hg38 liftover__

__Description__: log2 + 1 gene level copy number data (data is log2 transformed with a __pseudocount of 1__ added). It uses hg19 coordinates. Also the segmented copy number data.

__Rows__: DepMap cell line IDs

__Columns__: gene names in the format HGNC\_symbol (Entrez\_ID)

Segmented data:

__Columns__: DepMap\_ID, Chromosome, Start, End, Num\_Probes, Segment\_Mean, CCLE\_name
""")

In [None]:
# To add to a virtual dataset
AddToVirtual(virtual_public, "depmap-wes-cn-data-97cc", files=[('CCLE_gene_cn', 'public_'+release+'_gene_cn'),('CCLE_segmented_cn', 'public_'+release+'_segs_cn')])