# Produce CCLF report with all information for each specified cell line
The goal of this notebook is to be able to create a unified HTML report for either:
1. All CN and SNV data for a single participant (e.g. PEDS172) across the targeted probe data and WES data
    + Different culture conditions, passage number, tumor tissue vs cell line, etc.
2. All CN and SNV data for a single patient ID across the targeted probe data and WES data

Both of these will make it easier for collaborators and Moony Tseng to analyse the existing data and determine what the next steps should be. The goal is to best serve these individuals and groups.

## Acquire / produce all the data for mutations and copy number
Pull from CCLF_WES and the most updated TSCA workspace. Currently, trying to transition to CCLF_targeted. 

In [None]:
from __future__ import print_function
import os.path
# import os
import dalmatian as dm
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '../../JKBio/')
import TerraFunction as terra
import CCLF_processing
%load_ext autoreload
%autoreload 2
%load_ext rpy2.ipython
from IPython.display import Image, display, HTML
import ipdb

In [None]:
## widgets
# !pip install -U -q ipywidgets
# !jupyter nbextension enable --py widgetsnbextension

## qgrid for interactive plots
# !pip install qgrid
# !jupyter nbextension enable --py --sys-prefix qgrid

In [None]:
import qgrid # interactive tables
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import gcsfs # to be able to read in files from GCS in Python

# # Extra options
# pd.options.display.max_rows = 30
# pd.options.display.max_columns = 25
qgrid.set_grid_option('maxVisibleRows', 10)

# # Show all code cells outputs
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = 'all'

In [None]:
cwd = os.getcwd()
print(cwd)

In [None]:
specificSamples_both = ["CCLF_PEDS1012",
                   "PEDS172",
                   "PEDS182",
                   "PEDS196",
                   "PEDS204"]
specificSamples_onlyWES = ["PEDS012",
                   "PEDS018",
                   "PEDS110",
                   "PEDS117"]
specificSamples = specificSamples_both + specificSamples_onlyWES

In [None]:
df = '../../ccle_processing/ccle_tasks/data/kim_sept/kim_sample_disease_info.csv'

df = "/Users/gmiller/Documents/Work/GitHub/ccle_processing/ccle_tasks/data/kim_sept/kim_sample_disease_info.csv"

In [None]:
# gather all the existing files
CCLF_processing.getReport(datadir = "gs://cclf_results/targeted/test/", specificlist = ["PEDS172"], specificlist_disease=df)
# CCLF_processing.getReport(datadir = "gs://cclf_results/targeted/kim_sept_6/", specificlist = specificSamples, specificlist_disease=df)

We want to create heat map style copy number plots for each participant. Want to have all the culture conditions, primary tissue, matched normal that exist side by side.

We might have to make separate CN heat map for TSCA vs WES samples because can't create sample set containing both since they're in separate workspaces... or at least I think this is problematic. But maybe there's a workaround.

* step 1: create sample set for each participant (add each sample_id to a sample set list?)
   
* step 2: create submission for each participant to generate the CN heat map
    + Terra.waitForSubmission needed before step 3
    + try/except style?
* step 3: copy the image from the workspace into the output location

In [None]:
# create heat map style copy number plots for each participant
# want to have all the culture conditions, primary tissue, matched normal that exist side by side

# step 1: create sample set for each participant (add each sample_id to a sample set list?)
# step 2: create submission for each participant to generate the CN heat map
# - Terra.waitForSubmission needed before step 3
# - try/except style?
# step 3: copy the image from the workspace into the output location

In [None]:
# ! gsutil -m rm -r 'gs://cclf_results/targeted/test/'


***
***

# Pretty report generation
After grabbing and making all of the files we want for a given participant (e.g. PEDS182), we want to make a pretty, interactive report. This will be similar to a README except that we will directly embed tables and images. This involves using Jupyter widgets to create dropdown menus and the like. Here are the main functionalities I'd like:

1. kable-like tables that are interactive: sorting, filtering, typing in text or numbers to search, (ability to download sorted/filtered table as a CSV?)
2. ability to quickly go to any image in the directory. I want this so that the user can quickly look through the copy number maps (horizontal plots). Ideally, I'd like to be able to select which one(s) I'd like to view. This could be useful if they want to see two or more at once (i.e. to compare two treatment conditions).

## Automate generation of separate Jupyter notebook for each participant
To do this, we will use Papermill. Papermill automates notebook to notebook generation, and also executes the generated notebook. We may also want to convert the generated notebook to HTML. We can use *nbconvert* for this operation (see https://github.com/jupyter/nbconvert).

In [None]:
# path would be the participant-specific path
path = "gs://cclf_results/targeted/kim_sept_6/Alveolar_Rhabdomyosarcoma/PEDS172/" 
# a list of file paths for the selected participant
filepaths = ! gsutil ls -r {path}**

# get all the tables in the bucket
table_filepaths = ! gsutil ls -r {path}*.txt # check: will this search recursively for all .txt files?
to_add = ! gsutil ls -r {path}*.tsv
table_filepaths += to_add
# get all the pngs in the bucket
img_filepaths = ! gsutil ls -r {path}*.png

# copy all the pngs in the bucket to a tmp folder
tempdir='./temp/cclfreport/images/'
! gsutil cp -r {path}*.png {tempdir} # copy images from google bucket to local temp folder
local_img_filepaths = ! ls {tempdir}*.png
os.chdir(tempdir)
local_img_file_names = ! ls *.png # list of all pngs in tempdir
os.chdir(cwd)

In [None]:
print(local_img_filepaths)
print(local_img_file_names)

In [None]:
def make_interactive_table(filepath): # assuming single filepath
    print("Table: "+filepath[0])
    data = pd.read_table(filepath[0])
    qgrid_widget = qgrid.show_grid(data, show_toolbar=True, grid_options = {'forceFitColumns': False,
    'defaultColumnWidth': 150})
    display(qgrid_widget)
    print("\n")

# Sample information and identifiers
This section details the external IDs for all the samples we discovered when searching the existing targeted probe data and WES data.

In [None]:
all_external_ids = ! gsutil ls -r {path}*all_external_ids.tsv
all_failed_external_ids = ! gsutil ls -r {path}*all_failed_external_ids.tsv
# check: should I make them interactive??

## Table: all external IDs

In [None]:
all_external_ids

## Table: failed QC external IDs 
This table has the external IDs of all the samples that failed the depth of coverage QC in the targeted probe pipeline.

In [None]:
all_failed_external_ids

# Copy number data

## Copy number heat maps
There are two plots in this section, one for CN data from the targeted probe data and a second for CN data from WES data. To look at any one sample in more detail, look either at the corresponding horizontal CN plot or at the CN table.

These tables are searchable and filterable, so just search for the sample of interest.

### Targeted CN heat map

### WES CN heat map

## Copy number horizontal plots

Select the copy number plot you would like to display from the dropdown menu. The dropdown menu includes CN plots from both targeted probe (TSCA and TWIST) and WES data. The source of the data will be displayed on the title of the image. You can also refer to the table of all external IDs that maps each external ID to the source of the data.

**check:** can I add a linked reference to this table so that they can quickly jump there? Might be best to just make it it's own section so that it shows up in the TOC.

<!-- Note that to get nice dropdown menu names, I'm changing directories for now. There's probably a better way to do this. -->

In [None]:
os.chdir(tempdir)

In [None]:
# select image to display from dropdown menu    
@interact
def show_images(file=local_img_file_names): # can Image work with gcsfs/GCS file paths? no.
    print(file)
    display(Image(file))

In [None]:
## must change back to the main directory
os.chdir(cwd)

In [None]:
# fdir = '/Users/gmiller/Documents/Pictures/'

# @interact
# def show_images(file=os.listdir(fdir)):
#     print(fdir+file)
#     display(Image(fdir+file))

In [None]:
# get the CN tables from the Google storage bucket
tsca_cn = ! gsutil ls -r {path}*copy_number.tsv
wes_cn = ! gsutil ls -r {path}*wes_copy_number.tsv

## Targeted CN table

In [None]:
make_interactive_table(tsca_cn)

## WES CN table

In [None]:
make_interactive_table(wes_cn)

In [None]:
# for i in [tsca_cn, wes_cn]:
#     make_interactive_table(i)

# Mutation data

Below are interactive tables containing mutation information from the targeted probe data and the WES data. If there were multiple external IDs in eiter dataset, they have been combined into one table. The external_id column can be used to filter the data so only the mutations for a single external ID is displayed.

Note that this report only includes samples from the targeted data that pass the depth of coverage QC. Samples that did not pass this QC are not included in this report, and their data is not included in the Google bucket. A list of the samples that failed this QC is included earlier in this document.

Also, note that the below tables have been filtered such that keep equals True. What this means is that only the variants that passed the filtering steps in the pipeline are included in the tables below. However, the raw mutation TSVs included in the Google bucket contain all the variants regardless of whether keep is True or False if you are interested in that information.

**check:** would be ideal to start with it automatically filtered, but allow for the filter to be removed if desired. Also, I should probably be smart about the ordering of the columns...

In [None]:
tsca_mut = ! gsutil ls -r {path}*mutation.tsv
wes_mut = ! gsutil ls -r {path}*wes_mutations.tsv

## Targeted mutation table

In [None]:
make_interactive_table(tsca_mut)

## WES mutation table

In [None]:
make_interactive_table(wes_mut)

In [None]:
# for i in [tsca_mut, wes_mut]:
#     make_interactive_table(i)

In [None]:
# data = pd.read_table(table_filepaths[4])
# qgrid_widget = qgrid.show_grid(data, show_toolbar=True)
# qgrid_widget

## Dropdown, non-interactive tables
I'm not convinced that this should be included unless I can get it to be interactive or sortable or something...

In [None]:
# select table to display from dropdown menu
## doesn't work with the interactive tables, unfortunately.
@interact
def show_tables(file=table_filepaths):
    print(file)
    data = pd.read_table(file)
    qgrid_widget = qgrid.show_grid(data, show_toolbar=True)
    qgrid_widget
    display(data)

In [None]:
## reading in image from GCS
# method: https://pypi.org/project/fs-gcsfs/

# from fs_gcsfs import GCSFS
# gcsfs = GCSFS(bucket_name="cclf_results")

# gcsfs.fix_storage() # see https://fs-gcsfs.readthedocs.io/en/latest/#limitations
# gcsfs.tree()

# with open("/targeted/kim_sept_6/Alveolar_Rhabdomyosarcoma/PEDS172/PEDS172T_PF_AR5_p7_sample_statistics.txt") as f:
#     df = pd.read_csv(f)
    
# method: https://gcsfs.readthedocs.io/en/latest/
# fs = gcsfs.GCSFileSystem(project='my-google-project')
# fs.ls('my-bucket')
# with fs.open('my-bucket/my-file.txt', 'rb') as f:
#     df = pd.read_csv(f)
#         display(f)

# @interact
# def show_images(file=filepaths): # can Image work with gcsfs/GCS file paths? It doesn't look like it.
#     print(file)
#     display(Image(file))