# ISB-CGC Community Notebooks

Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Purpose: Painless intro to working in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:   
```
***

# Quick Start Guide to ISB-CGC
[ISB-CGC](https://isb-cgc.appspot.com/)

This Quick Start Guide is intended give an overview of the data available, to walk you though the steps of setting up your accounts, and get started with a basic example in python. If you have read the R version, you can skip to the Example section.

## Access Requirements
* Google Account to access ISB-CGC
* [Google Cloud Account](https://console.cloud.google.com)
* Some knowledge of SQL

## Access Suggestions
* Favored Programming Language (R or Python)
* Favored IDE (RStudio or Jupyter)

## Outline for this Notebook
* Quick Overview of ISB-CGC
* About the Data on ISB-CGC
* Overview How to Access Data
* Account Set up
* ISB-CGC Web Interface
* Google Cloud Platform (GCP) and BigQuery Overview
* Example of Accessing Data with Python
* Where to go next

## Overview of ISB-CGC
The ISB-CGC provides both interactive and programmatic access to
data hosted by institutes such as the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) of the [National Cancer Institute (NCI)](https://www.cancer.gov/) and the [Wellcome Trust Sanger Institute](https://www.sanger.ac.uk/) while leveraging many aspects of the Google Cloud Platform. You can also import your own data to analyze it side by side with the datasets and share your data when you see fit.

In [0]:
#@title Introduction to ISB-CGC Video
#@markdown This 12 minute video goes over an introduction to ISB-CGC
from IPython.display import YouTubeVideo
YouTubeVideo('RQsLKDTciWk', width=600, height=400)
#@markdown For more videos check out: [ISB-CGC Video Tutorial Series](https://isb-cgc.appspot.com/videotutorials/)

## About the Data in the Cloud
The main data that is hosted on the cloud is [The Cancer Genome Atlas (TCGA)](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) data which was a large-scale multi-disciplinary collaboration started by the [National Cancer Institute (NCI)](https://www.cancer.gov/) and the [National Human Genome Research Institute (NHGRI)](https://www.genome.gov/). Some of the hosted data types and files include RNA-Seq FASTQ, DNA-Seq and RNA-Seq BAM Files, Genome-Wide SNP6 array CEL files, and Variant-calls in VCF files along with a number of other datasets including data from [Therapeutically Applicable Research to Generate Effective Treatments (TARGET)](https://ocg.cancer.gov/programs/target) and [Cancer Cell Line Encyclopedia (CCLE)](https://depmap.org/portal/ccle/) programs. ISB-CGC hosts several tables in BigQuery with data from the TCGA, TARGET, and CCLE along with reference tables and [Catalogue Of Somatic Mutations In Cancer (COSMIC)](https://cancer.sanger.ac.uk/cosmic) data sets from the [Wellcome Trust Sanger Institute](https://www.sanger.ac.uk/). ISB-CGC is adding more data sets all the time, so if you have suggestions for a datasets to be added please email: [feedback@isb-cgc.org](mailto:feedback@isb-cgc.org)

For more information, please visit: [Programs and Data Sets](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html) and [Data in BigQuery](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery/data_in_BQ.html)

## Overview of How to Access Data
There are several ways to access the Data that is hosted by ISB-CGC. 

* [ISB-CGC WebApp](https://isb-cgc.appspot.com/)
  * Provides a graphical interface to metadata
  * Does not require knowledge of programming languages
* [ISB-CGC BigQuery Table Search](https://isb-cgc.appspot.com/bq_meta_search/)
  * Provides a table search for available ISB-CGC BigQuery Tables
  * Does NOT require a login for Google or ISB-CGC to access
* [ISB-CGC APIs](https://api-dot-isb-cgc.appspot.com/v4/swagger/)
  * Provides programmatic access to metadata
* [Google Cloud Platform](https://cloud.google.com/)
 * Allows you to use GCP APIs such as BigQuery, Cloud Datalab, Colaboratory
 * Allows you to host your own data on the Cloud
* [BigQuery](https://cloud.google.com/bigquery/)
 * A GCP Allows you to use SQL to access some data
* Supported Programming Languages
 * SQL
  * Can be used directly in BigQuery
 * [Python](https://www.python.org/)
  * [gsutil tool](https://cloud.google.com/storage/docs/gsutil) is a Python tool to access data via the command line
  * [Jupyter Notebooks](https://jupyter.org/)
  * [Google Colabratory](https://colab.research.google.com/)
  * [Cloud Datalab](https://cloud.google.com/datalab/)
 * [R](https://www.r-project.org/)
  * [RStudio](https://rstudio.com/)
  * [RStudio.Cloud](https://rstudio.cloud/)
* Command Line Interfaces
  * Cloud Shell via Project Console
  * [CLOUD SDK](https://cloud.google.com/sdk/)

## Account Set-up
*If not completed prior to reading this guide*
1.   Log in or [create](https://accounts.google.com/signup/v2/webcreateaccount?dsh=308321458437252901&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&flowName=GlifWebSignIn&flowEntry=SignUp#FirstName=&LastName=) a Gmail account
* Can be use your institutional email if it is a Google Identity
2.   Create a GCP Project using a GMail account
* Required to use all of the data, tools and the Google Cloud
* New accounts recieve a one-time allotment of [$300 in Google Credit](https://cloud.google.com/free/)
 * Google also offers a [Free Tier](https://cloud.google.com/free/) which grants 1 TB of queries a month
 * Additionally, ISB-CGC offers [$300 in free Cloud Credits](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowtoRequestCloudCredits.html)
3.   Authorize your account for dbGaP in the ISB-CGC WebApp (required for viewing controlled access data)
* To access controlled data, users must first be authenticated by NIH (via the ISB-CGC
web-app). Upon successful authentication, user dbGaP authorization will be verified.
These two steps are required before the user’s Google identity is added to the access
control list (ACL) for the controlled data. At this time, this access must be renewed every
24 hours.
* Please view [Accessing Controlled-Access Data](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Gaining-Access-To-Controlled-Access-Data.html) if you need help with this step.
4.   Register your GCP project in the ISB-CGC WebApp
* Please view [Registering your Google Cloud Project Service Account](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/webapp/Gaining-Access-To-Contolled-Access-Data.html#requirements-for-registering-a-google-cloud-project-service-account) if you need help with this step.
5.   Enable the following required Google Cloud APIs:
 * Google Compute Engine
 * Google Genomics
 * Google BigQuery
 * Google Cloud Logging
 * Google Cloud Pub/Sub
* [Google Tutorial on Enabling/Disabling GC APIs](https://cloud.google.com/apis/docs/enable-disable-apis)
6. Install optional software such as:
 * [Cloud SDK](https://cloud.google.com/sdk/)
 * [Anaconda Python](https://www.anaconda.com/distribution/)
 * [Jupyter Notebook](https://jupyter.org/)
 * [R](https://cran.r-project.org/)
 * [RStudio](https://www.rstudio.com/)
 * [Chrome](https://www.google.com/chrome/)
 * [Docker](https://www.docker.com/)
 

## ISB-CGC Web Interface
The ISB-CGC Web Interface is an [interactive web-based application](https://isb-cgc.appspot.com/) to access and explore the rich TCGA, TARGET, and CCLE datasets with more datasets being added regularly. Through the WebApp you can create Cohorts, lists of Favorite Genes, miRNA, and Variables. The Cohorts and Variables can be used in Workbooks to allow you to quickly analyze and export datasets by mixing and matching the selections. The ISB-CGC Web Interface also allows you to view and analyze available pathology and radiology images associated with selected cohort data.

## Google Cloud Platform and BigQuery Overview
The [Google Cloud Platform Console](https://console.cloud.google.com/) is the web-based interface to  your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, acces the [Cloud Shell command line](https://cloud.google.com/shell/docs/), etc. Click [here](https://raw.githubusercontent.com/isb-cgc/readthedocs/master/docs/include/intro_to_Console.pdf) to download a quick tour from ISB-CGC of the GCP Console. You'll want to remember that any costs that you incur are charged under your *current* project, so you will want to make sure you are on the correct one if you are part of multiple projects. [Here](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/DIYWorkshop.html#google-cloud-platform-console) is how to check which project is your *current* project.

"BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in." [*Source*](https://cloud.google.com/bigquery/) ISB-CGC has uploaded multiple cancer genomic datasets into BigQuery tables that are open-source such as TCGA and TARGET Clinical, Biospecimen and Molecular Data along with dataset megadata. This data can be accessed from the Google Cloud Platform Console web-UI, programmatically with R, and programmatically with python through Cloud Datalab or Colab. Check out our [Community Notebook Repository](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowTos.html) for example notebooks.

## Example of Accessing Data with Python


### Log into Google Cloud Storage and Authenticate ourselves
1. Authenticate yourself with your Google Cloud Login
2. A second tab will open or follow the link provided
3. Follow prompts to Authorize your account to use Google Cloud SDK
4. Copy code provided and paste into the box under the Command
5. Press Enter

Alternatives for Authentication can be found [here](https://googleapis.github.io/google-cloud-python/latest/core/auth.html)

In [0]:
# Run a command line command with the bang (!) and gcloud
!gcloud auth application-default login 

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline


Enter verification code: 4/XQGk8wtHV404M8mfwbkdcZjmj-DpxkeKCnUvD3hh4y8XCWa00jfNoww

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


To take a quick anonymous survey, run:
  $ gcloud alpha survey



### View Datasets and Tables in BigQuery
Let us look at the datasets available through ISB-CGC that are in BigQuery. You will need to load the BigQuery API and set the client [(click here for more information)](https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/client.html).

In [0]:

# Load BigQuery API
from google.cloud import bigquery

# Create a client to access the data within BigQuery
client = bigquery.Client('isb-cgc')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then print their names,
# else print that there are no data sets available
if datasets:
    print("Datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print("{} project does not contain any datasets.".format(project))

Datasets in project isb-cgc:
	CCLE_bioclin_v0
	GDC_metadata
	GTEx_v7
	QotM
	TARGET_bioclin_v0
	TARGET_hg38_data_v0
	TCGA_bioclin_v0
	TCGA_hg19_data_v0
	TCGA_hg38_data_v0
	Toil_recompute
	ccle_201602_alpha
	genome_reference
	hg19_data_previews
	hg38_data_previews
	metadata
	platform_reference
	tcga_201607_beta
	tcga_cohorts
	tcga_seq_metadata


Let us see which tables are under the TCGA_bioclin_v0 dataset.

In [0]:
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('isb-cgc.TCGA_bioclin_v0'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")

Tables:
	Annotations
	Biospecimen
	Clinical


### Access BigQuery to call a table


First you'll want to call to BigQuery with a magic command and then you can use Standard SQL to write your query. Click [here](https://googleapis.github.io/google-cloud-python/latest/bigquery/magics.html) for more on IPython Magic Commands for BigQuery. The result will be a [Pandas Dataframe](https://pandas.pydata.org/).

In [0]:
# Call to BigQuery with a magic command
# and replace PROJECT_ID with your project ID Number
%%bigquery --project PROJECT_ID
SELECT # Select a few columns to view
  program_name,
  case_barcode,
  project_short_name
FROM # From the TCGA Clinical Dataset
  `isb-cgc.TCGA_bioclin_v0.Clinical`
LIMIT # Limit to 5 rows as the dataset is very large and we only want to see a few results
  5

# Syntax for the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields

Unnamed: 0,program_name,case_barcode,project_short_name
0,TCGA,TCGA-01-0628,TCGA-OV
1,TCGA,TCGA-01-0630,TCGA-OV
2,TCGA,TCGA-01-0631,TCGA-OV
3,TCGA,TCGA-01-0633,TCGA-OV
4,TCGA,TCGA-01-0636,TCGA-OV


Now that wasn't so difficult! Have fun exploring and analyzing the ISB-CGC Data!

## Where to Go Next

Explore, Discover, and Analyze the Data provided by ISB-CGC along with side by side with your own! :)

ISB-CGC Links:

* [ISB-CGC Landing Page](https://isb-cgc.appspot.com/)
* [ISB-CGC Documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/)
* [How to Get Started on ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html)
* [How to access Google BigQuery](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/progapi/bigqueryGUI/HowToAccessBigQueryFromTheGoogleCloudPlatform.html)
* [Community Notebook Repository](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowTos.html)
* [Query of the Month](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/QueryOfTheMonthClub.html)
* [Quick Links](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/QuicklinksOneTable.html)

Google Tutorials:

* [Google's What is BigQuery?](https://cloud.google.com/bigquery/what-is-bigquery)
* [Google Cloud Client Library for Python](https://googleapis.github.io/google-cloud-python/latest/index.html)