<a href="https://colab.research.google.com/github/cmap/lincs-workshop-2020/blob/main/notebooks/data_access/BQ_toolkit_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## cmapBQ Tutorial

This notebook is meant to show a few examples of exploring, selecting and retrieving data available within LINCS-CMap datasets from Google BigQuery.

`cmapBQ` allows for targeted retrieval of relevant gene expression data from the resources provided by The Broad Institute and LINCS Project

### Package installation

The cmapBQ package is available from `pip` and can be installed using the command below. Documentation is available on [Read The Docs](https://cmapbq.readthedocs.io/en/latest/)

In [None]:
!pip -q install cmapBQ

### Standard Imports

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import requests

import matplotlib.pyplot as plt

### Credentials Setup and Package imports

Getting demo credentials from S3. To access BigQuery, a service account JSON credentials file must be obtained. Running the `cmap.config.setup_credentials(credentials_path)` function will point the toolkit to the credentials connected to your Google Account. 

More information about service accounts are available here: [Getting started with authentication](https://cloud.google.com/docs/authentication/getting-started)

In [None]:
""" Delete line if without Google Cloud credentials

import requests

# URL with credentials
url = ('https://s3.amazonaws.com/data.clue.io/api/bq_creds/BQ-demo-credentials.json')

response = requests.get(url)
credentials_filepath='/content/BQ-demo-credentials.json'

with open(credentials_filepath, 'w') as f:
  f.write(response.text)

"""

Pointing cmapBQ to credentials file

In [None]:
import cmapBQ.query as cmap_query
import cmapBQ.config as cmap_config

credentials_filepath='/content/YOUR_JSON_KEY.json'
# Set up credentials
cmap_config.setup_credentials(credentials_filepath)
bq_client = cmap_config.get_bq_client()

<div style="font-size: 10pt;line-height:30px">
    
Alternative method of authentication:

In [None]:
#from google.cloud import bigquery
#os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentials_filepath
#bq_client = bigquery.Client()

# BigQuery Table Information

### The data hosted on BigQuery is organized in the following tables

<div style="font-size: 10pt;line-height:18px;font-weight:normal">
    
**compoundinfo:** <br> Metadata for all unique compounds included in the data release. Each row contains information about a unique compound such as MoA, target, etc. 
    
**instinfo:**  <br> Sample level metadata includes information for each replicate including experimental parameters such as timepoint and dose

**siginfo:**  <br> Signature (replicate collapsed) level 5 metadata. Includes experimental parameters such as timepoint and dose as well as metrics for bioactivity such as `tas` for [Transcriptional Activity Score](https://clue.io/connectopedia/signature_quality_metrics) and `cc_q75` for Replicate Correlation

**L1000 Level3:**  <br> Gene expression (GEX, Level 2) are normalized to invariant gene set curves and quantile normalized across each plate. Here, the data from each perturbagen treatment is referred to as a profile, experiment, or instance. Additional values for 11,350 additional genes not directly measured in the L10000 assay are inferred based on the normalized values for the 978 landmark genes.

    
**L1000 Level4:**  <br> Z-scores for each gene based on Level 3 with respect to the entire plate population. This comparison of profiles to their appropriate population control generates a list of differentially expressed genes.

**L1000 Level5:** <br> Replicate-collapsed z-score vectors based on Level 4. Replicate collapse generates one differential expression vector, which we term a signature. Connectivity analyses are performed on signatures.
    
**geneinfo:** <br> Metadata for gene_ids included in the data release. Each row contains mappings between gene_symbol, ensemble_id, gene_id as well as information about gene_type

**cellinfo:** <br> Metadata for cell lines included in the data release. Each row contains information such as cell_iname, ccle_name or cell_lineage

**genetic_pertinfo**: <br> Contains information related to genetic perturbagens such as type ['oe', 'sh', 'xpr'] and relevant gene_id, ensemble_id 


# Raw SQL Queries

`cmapBQ.query.list_tables()` function will display table adresses of default tables for usage in SQL queries

In [None]:
import cmapBQ.query as cmap_query

cmap_query.list_tables()

Raw SQL queries can be run on the public datasets as shown below. Syntax follows that of Google Biqquery, available here: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax

### Example SQL Query 

In [None]:
## This query may take up to a minute
query = "SELECT COUNT(DISTINCT(cid)) as num_level5_sigs FROM cmap-big-table.cmap_lincs_public_views.L1000_Level5"

cmap_query.run_query(query=query, client=bq_client).result().to_dataframe()

# cmapBQ Utilities

`cmapBQ` provides a multitude of utility functions to survey and retrieve data hosted on BigQuery. Below we will demonstrate a workflow for finding and analyzing data pertaining to an MoA of interest as an example.

# Get Table Schema Information

In [None]:
cmap_query.list_tables()

In [None]:
cmap_query.get_table_info(bq_client, 'cmap-big-table.cmap_lincs_public_views.compoundinfo') 

In [None]:
config = cmap_config.get_default_config()
compoundinfo_table = config.tables.compoundinfo

QUERY = ( 'SELECT moa, ' 
'COUNT(DISTINCT(pert_id)) AS count ' 
'FROM `{}` ' 
'GROUP BY moa')

QUERY = QUERY.format(compoundinfo_table)
cmap_query.run_query(bq_client, QUERY).result().to_dataframe()

### Metadata at a glance

In [None]:
moas = cmap_query.list_cmap_moas(bq_client)
display(moas)

In [None]:
targets = cmap_query.list_cmap_targets(bq_client)
display(targets.sort_values('count'))

In [None]:
display(
    moas[moas['moa'].str.contains('Glucocorticoid receptor agonist', na=False)]
)
display(
    targets[targets['target'].str.contains('EGFR', na=False)]
)

# Analysis 1 - Comparison of MoA concordance across various cell lines

## Compound Information Table

<div style="font-size: 12pt;line-height:20px">

If the desired target or MoA is present, we can then query the compound table to get information about what compounds may relate to that MoA

In [None]:
moa = 'Glucocorticoid receptor agonist'
#moa = 'EGFR inhibitor'

moa_cpinfo = cmap_query.cmap_compounds(client=bq_client,
  moa=moa, 
  #verbose=True,
)
moa_cpinfo.sample(10)

<div style="font-size: 12pt;line-height:20px">

Lets take the first 10 compounds and see how many signatures are available for those compounds. We can pass a list of compounds to the **cmap_sig function**, which then queries the dataset for compounds that match.

In [None]:
moa_cps = moa_cpinfo.cmap_name.unique()
moa_cps

## Cell Line information

In [None]:
core_cell_lines = ['A375', 'A549', 'HCC515', 'HEPG2', 'MCF7', 'PC3', 'VCAP', 'HT29', 'HA1E']

core_cellinfo = cmap_query.cmap_cell(bq_client, 
  cell_iname=core_cell_lines, 
  verbose=True,
)
core_cellinfo.head(10)

## Query Siginfo 

<div style="font-size: 12pt;line-height:20px">


The siginfo file provides information on the conditions for each experiment such as compound, dose, timepoint, cell line, and more.

The table also includes information regarding the signal strength and replicate correlation of the compound. The `tas` contains the signatures **Transcriptional Activity Score (TAS)** which is an aggregate measure of strength and reproducibilty.  [More information about signature quality metrics can be found on Connectopedia](https://clue.io/connectopedia/signature_quality_metrics)

In [None]:
sample_cell_lines = list(core_cellinfo.cell_iname.unique()) #core_cell_lines

sample_compounds = list(moa_cps)
print("Compounds: {}".format(sample_compounds))
print("Cell Lines: {}".format(sample_cell_lines))

siginfo_sample = cmap_query.cmap_sig(     #Query the siginfo table
    bq_client, 
    cmap_name = sample_compounds,
    cell_iname = sample_cell_lines,
    return_fields = 'all'
)

siginfo_sample = siginfo_sample.loc[     #Filter returned table 
    (siginfo_sample.nsample >= 3) &
    (siginfo_sample.pert_dose >= 10) &
    (siginfo_sample.pert_itime.eq('24 h'))
]


siginfo_sample = siginfo_sample.merge(core_cellinfo, on='cell_iname') #join with cellinfo table to get cell lineage information
siginfo_sample.sample(5)

In [None]:
plt.figure(figsize=(8,6))

sorted_index = siginfo_sample.groupby('cell_iname').median().sort_values('tas').index

sns.boxplot(
    data=siginfo_sample,
    x='cell_iname',
    y='tas',
    hue='cell_lineage',
    dodge=False,
    order=sorted_index
);

plt.title('Transcriptional Activity by Cell Line')
plt.xlabel('Cell Line')
plt.ylabel('Transcriptional Activity Score (TAS)')
plt.legend(loc='upper left', bbox_to_anchor=(1.01,1))
plt.show()

## Numerical Data

### Extracting Numerical data using `cmapBQ.query.cmap_matrix`

In [None]:
sig_ids = siginfo_sample.sig_id.unique() #sig_ids are unique for each signature and relates siginfo table to numerical data

sample_data_numerical = cmap_query.cmap_matrix(bq_client,
    data_level='level5',
    feature_space='landmark', #Choices ['landmark', 'bing', 'aig']
    cid=list(sig_ids), #columns are signatures
)

print( sample_data_numerical.data_df.shape )

`cmap_matrix` returns a GCToo object, part of the cmapPy resource. Documentation on the GCToo object structure and useful cmapPy utilities can be found in the cmapPy documentation here: https://cmappy.readthedocs.io/en/stable/

### Write to file as GCTX

In [None]:
from cmapPy.pandasGEXpress.write_gctx import write as write_gctx
from cmapPy.pandasGEXpress.write_gct import write as write_gct

#write_gctx(sample_data_numerical, filename)
#write_gct(sample_data_numerical, filename)

## Pairwise Correlations

In [None]:
corr_matrix = sample_data_numerical.data_df.corr()
print(corr_matrix.shape)

### Function definitions

In [None]:
import scipy
import scipy.cluster.hierarchy as sch
import numpy as np


def get_off_diagonals(matrix):
  """
  Extract off-diagonal elements of matrix as list of values

  Parameters
  ----------
  matrix: pandas.DataFrame or numpy.ndarray, NxN matrix
  
  Returns
  -------
  """
  return matrix.where(
    np.triu(np.ones(matrix.shape).astype(bool), k=1)
  ).stack().reset_index(drop=True)

def cluster_corr(corr_array, inplace=False):
    """
    Rearranges the correlation matrix, corr_array, so that groups of highly 
    correlated variables are next to eachother 
    
    Parameters
    ----------
    corr_array : pandas.DataFrame or numpy.ndarray
        a NxN correlation matrix 
        
    Returns
    -------
    pandas.DataFrame or numpy.ndarray
        a NxN correlation matrix with the columns and rows rearranged
    """
    pairwise_distances = sch.distance.pdist(corr_array)
    linkage = sch.linkage(pairwise_distances, method='complete')
    cluster_distance_threshold = pairwise_distances.max()/2
    idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, 
                                        criterion='distance')
    idx = np.argsort(idx_to_cluster_array)
    
    if not inplace:
        corr_array = corr_array.copy()
    
    if isinstance(corr_array, pd.DataFrame):
        return corr_array.iloc[idx, :].T.iloc[idx, :]
    return corr_array[idx, :][:, idx]

def get_subplot_dimensions(num_plots):
    if num_plots % 4 == 0:
      return num_plots//4, 4
    elif num_plots % 3 == 0:
      return num_plots//3, 3
    elif num_plots % 2 == 0:
      return num_plots//2, 2
    elif num_plots % 1 == 0:
      return num_plots//1, 1




### Histograms

In [None]:
#Lists used for sns.boxplot
dist_list = []
cell_list = siginfo_sample.groupby('cell_iname').median().sort_values('tas').index

lineage_list = []

#ncols=4
#nrows=2
subplot_size = 5
nrows, ncols = get_subplot_dimensions(len(cell_list))
fig_dist, axes_dist = plt.subplots(nrows, ncols, figsize=(subplot_size*ncols,subplot_size*nrows))


for i,cell in enumerate(cell_list):

  lineage = core_cellinfo.loc[
        core_cellinfo.cell_iname.eq(cell)
      ].cell_lineage.unique()[0]  #Lineage information for cell line

  cell_sig_ids = siginfo_sample.loc[
    siginfo_sample.cell_iname.eq(cell)
  ].sig_id.unique()  #Get sig_ids for cell line

  cell_corrs = corr_matrix.loc[cell_sig_ids, cell_sig_ids] #Extract correlations for cell line
#  cell_corrs = cluster_corr(cell_corrs) #Cluster cell correlations
  
  dist = get_off_diagonals(cell_corrs) #Extract off-diagonals
  
  dist_list.append(dist)
  lineage_list.append(lineage)

  sns.histplot(dist, 
               binwidth=0.1,
               ax=axes_dist[i // ncols][i % ncols])
  axes_dist[i // ncols][i % ncols].set_title(
    'Distribution of pairwise connections in {} \n({})'.format(
        cell, 
        lineage
      )
    )
  axes_dist[i // ncols][i % ncols].set_xlim(-1, 1)
  axes_dist[i // ncols][i % ncols].set_xlabel('Correlation')
  axes_dist[i // ncols][i % ncols].set_xticks([-1, -0.5, 0, 0.5, 1])

plt.tight_layout()
plt.show()

### Boxplots

In [None]:
plt.figure(figsize=(10,10))
colors_list = ['brown', 'red', 'orange', 'red', 'magenta', 'blue', 'cyan', 'green', 'green'] #manual coloring by cell_lineage

sns.boxplot(
    data=dist_list,
    palette=colors_list
    )

labels = ['{}\n({})'.format(cell_list[i], lineage_list[i]) for i in range(0,len(cell_list))]
plt.xticks(np.linspace(0,len(cell_list)-1, len(cell_list)), labels)

plt.ylabel('Pairwise Correlations')
plt.xlabel('Cell Line')
plt.ylim([-1,1])
plt.title('Comparison of {} concordance in Core Cell Lines'.format(moa))
plt.show()

### Heatmaps

#### Color Map Configuration

In [None]:
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap

red_blue_map = cm.get_cmap('RdBu_r', 256)
tau_red_blue_90 = red_blue_map(np.linspace(0,1,200))
white = np.array([1, 1, 1, 1])
tau_red_blue_90[70:130, :] = white
tau_red_blue_90 = ListedColormap(tau_red_blue_90)

#### Full Correlation Matrix

In [None]:
clustered_matrix = cluster_corr(sample_data_numerical.data_df.corr())

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(clustered_matrix, cmap=tau_red_blue_90, vmin=-1, vmax=1)
plt.show()

#### Pairwise correlations by Cell Line

In [None]:
cell_list = siginfo_sample.groupby('cell_iname').median().sort_values('tas').index #sorted by median tas

nrows, ncols = get_subplot_dimensions(len(cell_list))
subplot_size = 5
fig_hm, axes_hm = plt.subplots(nrows, ncols, figsize=(subplot_size*ncols*1.4,subplot_size*nrows))

for i,cell in enumerate(cell_list):

  lineage = core_cellinfo.loc[
          core_cellinfo.cell_iname.eq(cell)
        ].cell_lineage.unique()[0]

  cell_siginfo = siginfo_sample.loc[
    siginfo_sample.cell_iname.eq(cell)
  ]

  cell_sig_ids = cell_siginfo.sig_id.unique()  
  
  cell_corrs = corr_matrix.loc[cell_sig_ids, cell_sig_ids]

  cell_corrs = cluster_corr(cell_corrs)

  cell_siginfo.set_index('sig_id').loc[cell_corrs.columns] #sort to match clustered matrix
  
  sns.heatmap(cell_corrs,
              cmap=tau_red_blue_90, 
              vmin=-1, 
              vmax=1,
              ax=axes_hm[i // ncols][i % ncols])
  
  axes_hm[i // ncols][i % ncols].set_xticklabels([]) 
  axes_hm[i // ncols][i % ncols].set_yticklabels(cell_siginfo.cmap_name) # Add compound names
  axes_hm[i // ncols][i % ncols].set_xlabel('')
  axes_hm[i // ncols][i % ncols].set_ylabel('')
  axes_hm[i // ncols][i % ncols].set_title('Cell Line: {}\n Cell Lineage: {}'.format(cell, lineage))

plt.tight_layout()


# Analysis 2 - Gene Mod

## Functions

In [None]:
def genemod_histograms_by_cell(data, 
                               info,
                               cellinfo,
                               cell_order,
                               target,
                               level='level5', 
                               xlim=[-10, 10],
                               ylim=[0,10], 
                               metric_label='Normalized expression', 
                               plot_title=''):
    dist_list = []
    lineage_list = []

    subplot_size = 5
    nrows, ncols = get_subplot_dimensions(len(cell_order))
    fig_dist, axes_dist = plt.subplots(nrows, ncols, figsize=(subplot_size*ncols,subplot_size*nrows))

    for i,cell in enumerate(cell_order):
        lineage = cellinfo.loc[
              cellinfo.cell_iname.eq(cell)
            ].cell_lineage.unique()[0]  #Lineage information for cell line

        if level == 'level5':
          cell_sig_ids = info.loc[
            info.cell_iname.eq(cell)
          ].sig_id.unique()  #Get sig_ids for cell line
        elif level == 'level3':
          cell_sig_ids = info.loc[
            info.cell_iname.eq(cell)
          ].sample_id.unique()  #Get sig_ids for cell line
        
        cell_genemod = data.loc[str(target_gene_id), cell_sig_ids] #Extract correlations for cell line
  
        sns.histplot(cell_genemod, 
                    binwidth=0.1,
                    ax=axes_dist[i // ncols][i % ncols])
        
        axes_dist[i // ncols][i % ncols].set_title(
          '{} ({})'.format(
              cell, 
              lineage
            )
          )
        axes_dist[i // ncols][i % ncols].set_xlim(xlim)
        axes_dist[i // ncols][i % ncols].set_ylim(ylim)
        axes_dist[i // ncols][i % ncols].set_xlabel('{} for {}'.format(metric_label, target))

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.suptitle(plot_title, fontsize=15)
    plt.show()



def genemod_boxplot_by_cell(data, 
                            info,
                            cellinfo,
                            cell_order,
                            target,
                            colors_list = ['brown', 'red', 'orange', 'red', 'magenta', 'blue', 'cyan', 'green', 'green'],
                            level='level3', 
                            ylim=[0,14],
                            metric_label='Normalized expression',
                            plot_title=''):
    dist_list = []
    lineage_list = []

    for i,cell in enumerate(cell_order):
        lineage = cellinfo.loc[
              cellinfo.cell_iname.eq(cell)
            ].cell_lineage.unique()[0]  #Lineage information for cell line
        lineage_list.append(lineage)

        if level == 'level5':
          cell_sig_ids = info.loc[
            info.cell_iname.eq(cell)
          ].sig_id.unique()  #Get sig_ids for cell line
        elif level == 'level3':
          cell_sig_ids = info.loc[
            info.cell_iname.eq(cell)
          ].sample_id.unique()  #Get sig_ids for cell line
        
        cell_genemod = data.loc[str(target_gene_id), cell_sig_ids] #Extract correlations for cell line
        dist_list.append(cell_genemod)

    plt.figure(figsize=(10,10))
    colors_list = ['brown', 'red', 'orange', 'red', 'magenta', 'blue', 'cyan', 'green', 'green'] #manual coloring by cell_lineage

    sns.boxplot(
        data=dist_list,
        palette=colors_list
        )

    labels = ['{}\n({})'.format(cell_list[i], lineage_list[i]) for i in range(0,len(cell_list))]
    plt.xticks(np.linspace(0,len(cell_list)-1, len(cell_list)), labels)

    plt.ylabel('{} {}'.format(target, metric_label))
    plt.xlabel('Cell Line')
    plt.ylim(ylim)
    plt.title(plot_title)
    plt.show()


## Level 5 Gene Mod

Take a look the level 5 gene modulation of the target 

In [None]:
target = 'NR3C1'
target_info = cmap_query.cmap_genes(bq_client, gene_symbol='NR3C1')
target_gene_id = target_info.gene_id.item()

### Histograms

In [None]:
sample_data = sample_data_numerical.data_df

cell_sig_ids = siginfo_sample.loc[
  siginfo_sample.cell_iname.eq('MCF7')
].sig_id.unique()  #Get sig_ids for cell line

global_genemod = sample_data.loc[str(target_gene_id), :] #Extract correlations for target
sns.histplot(global_genemod, 
              binwidth=0.1).set_xlim(-10, 10)

plt.title('Global {} expression in core cell lines'.format(target))
plt.xlabel('Mod Z-Score for {}'.format(target))
plt.show()

In [None]:
genemod_histograms_by_cell(sample_data, 
                           siginfo_sample, 
                           core_cellinfo, 
                           cell_list,
                           target=target, 
                           level='level5',
                           xlim=[-5,5],
                           metric_label='Mod Z-Scores',
                           plot_title='Distribution of Mod Z-Score values in Glucocorticoid Recepter Agonist treated wells')

In [None]:
genemod_boxplot_by_cell(sample_data, 
                           siginfo_sample, 
                           core_cellinfo, 
                           cell_list,
                           target=target, 
                           level='level5',
                           metric_label='Normalized expression', 
                           ylim=[-10, 10],
                           plot_title='Distribution of NR3C1 mod z-scores in Glucocorticoid Recepter Agonist treated wells')

## Comparison of Normalized Gene Expression (level 3) profiles 




Within the siginfo table, a few fields can be used to track down the level 3 and 4 profiles that were collapsed into the level 5 signature. 


1. First is `distil_ids` which is a concatenation of `sample_id` values from the `instinfo` table. sample_ids can be recovered by splitting by the `|` delimiter

2. The `det_plates` field specifies which detection plate a profile was treated. In the `siginfo` table, this is often a concatenation of multiple values delimited by `|`. This can be useful for identifying control wells that from the same treatment plates as the treated wells.


In [None]:
siginfo_sample.sample(5)[['distil_ids', 'det_plates']]

In [None]:
distil_ids = siginfo_sample.apply(lambda row: row['distil_ids'].split('|'), axis=1)
sample_ids = [sample for id_list in distil_ids for sample in id_list]

In [None]:
print("Number of sig_ids: {}".format(len(siginfo_sample)))
print("Number of sample_ids: {}".format(len(sample_ids)))

In [None]:
instinfo = cmap_query.cmap_profiles(bq_client, sample_id=sample_ids)

In [None]:
instinfo.sample(10)

### Get Level 3 data

In [None]:
level3_data = cmap_query.cmap_matrix(bq_client, data_level='level3', cid=instinfo.sample_id.to_list(), limit=len(instinfo))

In [None]:
level3_clustered_matrix = cluster_corr(level3_data.data_df.corr())

In [None]:
global_genemod = level3_data.data_df.loc[str(target_gene_id), :] #Extract correlations for cell line
sns.histplot(global_genemod, 
              binwidth=0.1)

plt.title('Global {} expression in core cell lines'.format(target))
plt.xlabel('Mod Z-Score for {}'.format(target))
plt.show()

### Histograms

In [None]:
genemod_histograms_by_cell(level3_data.data_df,
                           instinfo,
                           core_cellinfo,
                           cell_order=cell_list,
                           target=target,
                           level='level3',
                           xlim=[0, 14],
                           ylim=None,
                           metric_label='Normalized expression', 
                           plot_title='Distribution of normalized expression in Glucocorticoid Recepter Agonist treated wells')

### Boxplots

In [None]:
genemod_boxplot_by_cell(level3_data.data_df,
                           instinfo,
                           core_cellinfo,
                           cell_order=cell_list,
                           target=target,
                           level='level3',
                           metric_label='Normalized expression', 
                           plot_title='Distribution of normalized expression in Glucocorticoid Recepter Agonist treated wells')

## Comparison of Target Gene Expression in Control Wells

### Getting data for control wells within the same plate as treatements

In [None]:
print("Number of unique plates: {}".format(len(instinfo.det_plate.unique())))

In [None]:
types = [
         'ctl_vehicle', 
#         'ctl_untrt'
    ]

ctl_instinfo = cmap_query.cmap_profiles(bq_client, pert_type=types , det_plate=list(instinfo.det_plate.unique()))

print("Length of ctl instinfo: {}".format(len(ctl_instinfo)))

In [None]:
ctl_instinfo.sample(10)

In [None]:
ctl_data = cmap_query.cmap_matrix(bq_client, data_level='level3', cid=list(ctl_instinfo.sample_id), limit=10000)

target_ctl_genemod = ctl_data.data_df.loc[str(target_gene_id), :]

### Histograms

In [None]:
sns.histplot(target_ctl_genemod, 
              binwidth=0.1)

plt.title('Distribution of normalized expression\n values for {} in core cell lines'.format(target))
plt.xlabel('Normalized expression values for {}'.format(target))
plt.show()

In [None]:
genemod_histograms_by_cell(ctl_data.data_df,
                           ctl_instinfo,
                           core_cellinfo,
                           cell_list,
                           target=target,
                           level='level3',
                           xlim=[0,14], ylim=None,
                           metric_label='normalized expression',
                           plot_title='Distribution of Normalized expression in Control Wells')

### Boxplots

In [None]:
genemod_boxplot_by_cell(ctl_data.data_df,
                           ctl_instinfo,
                           core_cellinfo,
                           cell_order=cell_list,
                           target=target,
                           level='level3',
                           metric_label='Normalized expression',
                           plot_title='Distribution of Normalized expression in Control Wells')