# Documentation notebook

## Setup

### Package installation

The cmapBQ package is available from `pip` and can be installed using the command below. Documentation is available on [Read The Docs](https://cmapbq.readthedocs.io/en/latest/)

In [3]:
!pip -q install cmapBQ

[?25l[K     |██▏                             | 10kB 19.0MB/s eta 0:00:01[K     |████▍                           | 20kB 23.9MB/s eta 0:00:01[K     |██████▌                         | 30kB 20.8MB/s eta 0:00:01[K     |████████▊                       | 40kB 18.1MB/s eta 0:00:01[K     |███████████                     | 51kB 16.7MB/s eta 0:00:01[K     |█████████████                   | 61kB 14.5MB/s eta 0:00:01[K     |███████████████▎                | 71kB 6.6MB/s eta 0:00:01[K     |█████████████████▌              | 81kB 7.3MB/s eta 0:00:01[K     |███████████████████▋            | 92kB 8.0MB/s eta 0:00:01[K     |█████████████████████▉          | 102kB 8.0MB/s eta 0:00:01[K     |████████████████████████        | 112kB 8.0MB/s eta 0:00:01[K     |██████████████████████████▏     | 122kB 8.0MB/s eta 0:00:01[K     |████████████████████████████▍   | 133kB 8.0MB/s eta 0:00:01[K     |██████████████████████████████▌ | 143kB 8.0MB/s eta 0:00:01[K     |█████████████████████

### Standard Imports

In [4]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import requests

import matplotlib.pyplot as plt

### Credentials Setup and Package imports

Getting demo credentials from S3. To access BigQuery, a service account JSON credentials file must be obtained. Running the `cmap.config.setup_credentials(credentials_path)` function will point the toolkit to the credentials connected to your Google Account. 

More information about service accounts are available here: [Getting started with authentication](https://cloud.google.com/docs/authentication/getting-started)

In [5]:
import requests

# URL with credentials
url = ('https://s3.amazonaws.com/data.clue.io/api/bq_creds/BQ-demo-credentials.json')

response = requests.get(url)
credentials_filepath='/content/BQ-demo-credentials.json'

with open(credentials_filepath, 'w') as f:
  f.write(response.text)

Pointing cmapBQ to credentials file

In [6]:
import cmapBQ.query as cmap_query
import cmapBQ.config as cmap_config

#credentials_filepath='/content/cmapbq-external-test-580801ef1790.json'
# Set up credentials
cmap_config.setup_credentials(credentials_filepath)
bq_client = cmap_config.get_bq_client()

## Functions

### cmap_compounds

#### Query compoundinfo table for various field by providing lists of compounds, moa, targets, etc. ‘AND’ operator used for multiple conditions.


    cmapBQ.query.cmap_compounds(client, pert_id=None, cmap_name=None, moa=None, target=None, compound_aliases=None, limit=None, verbose=False)

    Parameters
            client – BigQuery Client
            pert_id – List of pert_ids
            cmap_name – List of cmap_names
            target – List of targets
            moa – List of MoAs
            compound_aliases – List of compound aliases
            limit – Maximum number of rows to return
            verbose – Print query and table address.
    Returns
        Pandas Dataframe matching queries

In [10]:
target = 'EGFR'
target = 'EGFR inhibitor'

compound_table = cmap_query.cmap_compounds(
    bq_client,
    pert_id=None,
    cmap_name=None, 
    moa=None, 
    target=None, 
    compound_aliases=None, 
    limit=None, 
    verbose=False
  )

## Do we need to be able to query by canonical smiles or inchi_keys? 

In [11]:
compound_table

Unnamed: 0,pert_id,cmap_name,target,moa,canonical_smiles,inchi_key,compound_aliases
0,BRD-A08715367,L-theanine,,,CCNC(=O)CCC(N)C(O)=O,DATAGRPVKZEWHA-UHFFFAOYSA-N,l-theanine
1,BRD-A12237696,L-citrulline,,,NC(CCCNC(N)=O)C(O)=O,RHGKLRLOHDJJDR-UHFFFAOYSA-N,l-citrulline
2,BRD-A18795974,BRD-A18795974,,,CCCN(CCC)C1CCc2ccc(O)cc2C1,BLYMJBIZMIGWFK-UHFFFAOYSA-N,7-hydroxy-DPAT
3,BRD-A27924917,BRD-A27924917,,,NCC(O)(CS(O)(=O)=O)c1ccc(Cl)cc1,WBSMZVIMANOCNX-UHFFFAOYSA-N,2-hydroxysaclofen
4,BRD-A35931254,BRD-A35931254,,,CN1CCc2cccc-3c2C1Cc1ccc(O)c(O)c-31,VMWNQDUVQKEIOC-UHFFFAOYSA-N,r(-)-apomorphine
...,...,...,...,...,...,...,...
39316,BRD-K62685538,triptorelin,GNRHR,Gonadotropin releasing factor hormone receptor...,CC(C)C[C@H](NC(=O)[C@@H](Cc1c[nH]c2ccccc12)NC(...,VXKHXGOKWPXYNA-PGBVPBMZSA-N,
39317,BRD-K62221994,T-98475,GNRHR,Gonadotropin releasing factor hormone receptor...,CC(C)OC(=O)c1cn(Cc2c(F)cccc2F)c3sc(c(CN(C)Cc4c...,RANJJVIMTOIWIN-UHFFFAOYSA-N,
39318,BRD-K53397409,benzoic-acid,RAB9A,"Precursor for food preservatives, plasticizers...",OC(=O)c1ccccc1,WPYMKLBDIGXBTP-UHFFFAOYSA-N,
39319,BRD-A62182663,YK-4279,DHX9,Binding of RNA helicase A to the transcription...,COc1ccc(cc1)C(=O)CC1(O)C(=O)Nc2c1c(Cl)ccc2Cl,HLXSCTYHLQHQDJ-UHFFFAOYSA-N,


### cmap_cell

####     Query cellinfo table

    cmapBQ.query.cmap_cell(client, cell_iname=None, cell_alias=None, ccle_name=None, primary_disease=None, cell_lineage=None, cell_type=None, table=None, verbose=False)

    Parameters
            client – Bigquery Client
            cell_iname – List of cell_inames
            cell_alias – List of cell aliases
            ccle_name – List of ccle_names
            primary_disease – List of primary_diseases
            cell_lineage – List of cell_lineages
            cell_type – List of cell_types
            table – table to query. This by default points to the siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns
        Pandas DataFrame



In [None]:
cell_lineage = 'lung'

cell_table = cmap_query.cmap_cell(
    bq_client,
    cell_iname=None,
    cell_alias=None,
    ccle_name=None,
    primary_disease=None,
    cell_lineage=cell_lineage,
    cell_type=None,
    table=None,
    verbose=False
)
 

### cmap_genes

#### Query geneinfo table. Geneinfo contains information about genes including ids, symbols, types, ensembl_ids, etc.

    cmapBQ.query.cmap_genes(client, gene_id=None, gene_symbol=None, ensembl_id=None, gene_title=None, gene_type=None, src=None, table=None, verbose=False)
   
    Parameters
            client – Bigquery Client
            gene_id – list of gene_ids
            gene_symbol – list of gene_symbols
            ensembl_id – list of ensembl_ids
            gene_title – list of gene_titles
            gene_type – list of gene_types
            src – list of gene sources
            table – table to query. This by default points to the siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns
        Pandas DataFrame


### cmap_genetic_perts

#### Query genetic_pertinfo table


    cmapBQ.query.cmap_genetic_perts(client, pert_id=None, cmap_name=None, gene_id=None, gene_title=None, ensemble_id=None, table=None, verbose=False)

    Parameters
            client – Bigquery Client
            pert_id – List of pert_ids
            cmap_name – List of cmap_names
            gene_id – List of type INTEGER corresponding to gene_ids
            gene_title – List of gene_titles
            ensemble_id – List of ensumble_ids
            table – table to query. This by default points to the siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns: 
        Pandas Dataframe

### cmap_profiles

#### Query per sample metadata, corresponds to level 3 and level 4 data, AND operator used for multiple conditions.

    cmapBQ.query.cmap_profiles(client, sample_id=None, pert_id=None, cmap_name=None, cell_iname=None, build_name=None, return_fields='priority', limit=None, table=None, verbose=False)
    
    Parameters
            client – Bigquery client
            sample_id – list of sample_ids
            pert_id – list of pert_ids
            cmap_name – list of cmap_name
            build_name – list of builds
            return_fields – [‘priority’, ‘all’]
            limit – Maximum number of rows to return
            table – table to query. This by default points to the siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns
        Pandas Dataframe



### cmap_sig

#### Query level 5 metadata table

    cmapBQ.query.cmap_sig(client, sig_id=None, pert_id=None, cmap_name=None, cell_iname=None, build_name=None, return_fields='priority', limit=None, table=None, verbose=False)
    Parameters
            client – Bigquery Client
            sig_id – list of sig_ids
            pert_id – list of pert_ids
            cmap_name – list of cmap_name, formerly pert_iname
            cell_iname – list of cell names
            build_name – list of builds
            return_fields – [‘priority’, ‘all’]
            limit – Maximum number of rows to return
            table – table to query. This by default points to the level 5 siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns
        Pandas Dataframe



In [None]:
sig = cmap_query.cmap_sig(
    sig_id = '', 
    cmap_name = ''
)

### cmap_matrix

#### Query for numerical data for signature-gene level data.

    cmapBQ.query.cmap_matrix(client, data_level='level5', feature_space='landmark', rid=None, cid=None, verbose=False, chunk_size=1000, table=None, limit=1000)

    Parameters
            client – Bigquery Client
            data_level – Data level requested. IDs from siginfo file correspond to ‘level5’. Ids from instinfo are available in ‘level3’ and ‘level4’. Choices are [‘level5’, ‘level4’, ‘level3’]
            rid – Row ids
            cid – Column ids
            verbose – Run in verbose mode
            chunk_size – Runs queries in stages to avoid query character limit. Default 1,000
            table – Table address to query. Overrides ‘data_level’ parameter. Generally should not be used.
            verbose – Print query and table address.
    Returns
        GCToo object



### cmap_genes

#### Query geneinfo table. Geneinfo contains information about genes including ids, symbols, types, ensembl_ids, etc.

    cmapBQ.query.cmap_genes(client, gene_id=None, gene_symbol=None, ensembl_id=None, gene_title=None, gene_type=None, src=None, table=None, verbose=False)
   
    Parameters
            client – Bigquery Client
            gene_id – list of gene_ids
            gene_symbol – list of gene_symbols
            ensembl_id – list of ensembl_ids
            gene_title – list of gene_titles
            gene_type – list of gene_types
            src – list of gene sources
            table – table to query. This by default points to the siginfo table and normally should not be changed.
            verbose – Print query and table address.
    Returns
        Pandas DataFrame
