# Create Terra data tables

For this demonstration, we will use the image data, metadata, and [CellProfiler](https://cellprofiler.org/) pipelines from:

> [Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations](https://www.biorxiv.org/content/10.1101/2022.01.05.475090v1), Chandrasekaran et al., 2022

[Data tables](https://support.terra.bio/hc/en-us/articles/360025758392-Managing-data-with-tables-) are used to define the collection of workflow instances to be run. In this notebook, we will create the Terra Data Tables to provide the corresponding workflow parameters for the transferred data.  This notebook takes less than a minute to run to create these data tables:
* Data Table "plates"

# Setup

In [None]:
import firecloud.api as fapi
from io import StringIO
import json
import pandas as pd
import os
import string

In [None]:
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)
pd.set_option('max_colwidth', None)

## Define constants

In [None]:
#---[ Inputs ]---
# This is the featured workspace bucket in https://app.terra.bio/#workspaces/bayer-pcl-cell-imaging/cellpainting
INPUT_BUCKET = 'gs://fc-e1e6b6ac-3d52-4041-964d-43ce9beb3352'
OUTPUT_BUCKET = os.getenv('WORKSPACE_BUCKET')

#---[ Inputs ]---
IMAGES = os.path.join(INPUT_BUCKET, 'source_4_images/images/2020_11_04_CPJUMP1/images/')
# Use this folder in the workspace bucket for pe2loaddata configuration.
PE2LOADDATA_CONFIG = os.path.join(INPUT_BUCKET, 'pe2loaddata_config')
# Use this folder in the workspace bucket for CellProfiler pipeline definition files.
CPPIPE_DEFINITIONS = os.path.join(INPUT_BUCKET, 'cellprofiler_pipelines')
# Use this folder in the workspace bucket for the plate maps.
PLATE_MAPS = os.path.join(INPUT_BUCKET, 'plate_maps')

#---[ Outputs ]---
CREATE_LOAD_DATA_RESULT_DESTINATION = os.path.join(OUTPUT_BUCKET, '0_create_load_data')
ILLUMINATION_CORRECTION_RESULT_DESTINATION = os.path.join(OUTPUT_BUCKET, '2_cp_illumination_pipeline')
ANALYSIS_RESULT_DESTINATION = os.path.join(OUTPUT_BUCKET, '3_cpd_analysis_pipeline')
CYTOMINING_RESULT_DESTINATION = os.path.join(OUTPUT_BUCKET, '4_cytomining')

# Create the "plate" Terra data table

Create a Terra Data Table holding the parameters to the cell profiler workflows.

See also https://support.terra.bio/hc/en-us/articles/360025758392-Managing-data-with-tables-

In [None]:
plates = !gsutil ls {IMAGES}* | grep Images

plates

In [None]:
plate_ids = [plate.replace(IMAGES, '').split('_')[0] for plate in plates]

plate_ids

In [None]:
create_load_data_result_destinations = [os.path.join(CREATE_LOAD_DATA_RESULT_DESTINATION, plate_id) for plate_id in plate_ids]

create_load_data_result_destinations

In [None]:
illumination_correction_result_destinations = [os.path.join(ILLUMINATION_CORRECTION_RESULT_DESTINATION, plate_id) for plate_id in plate_ids]

illumination_correction_result_destinations

In [None]:
analysis_result_destinations = [os.path.join(ANALYSIS_RESULT_DESTINATION, plate_id) for plate_id in plate_ids]

analysis_result_destinations

In [None]:
cytoming_result_destinations = [os.path.join(CYTOMINING_RESULT_DESTINATION, plate_id) for plate_id in plate_ids]

cytoming_result_destinations

In [None]:
df = pd.DataFrame(data={
    'entity:plate_id': plate_ids, # Terra requires the 'entity:' prefix and the '_id' suffix.
    'images': plates,
    'create_load_data_result_destination': create_load_data_result_destinations,
    'illumination_correction_result_destination': illumination_correction_result_destinations,
    'analysis_result_destination': analysis_result_destinations,
    'cytoming_result_destination': cytoming_result_destinations   
})

df

In [None]:
# This is the correct pe2loaddata configuration file for all four plates.
df['config'] = os.path.join(PE2LOADDATA_CONFIG, 'chandrasekaran_config.yml')

In [None]:
# This is the illumination correction CellProfiler pipeline to use for all four plates.
df['illum_cppipe'] = os.path.join(CPPIPE_DEFINITIONS, 'illum_without_batchfile.cppipe')

In [None]:
# This is the analysis CellProfiler pipeline to use for all four plates.
df['analysis_cppipe'] = os.path.join(CPPIPE_DEFINITIONS, 'CPJUMP1_analysis_without_batchfile_406.cppipe')

In [None]:
# From experiment-metadata.tsv this is the correct platemap for all four plates.
# See https://github.com/jump-cellpainting/2021_Chandrasekaran_submitted/blob/main/benchmark/output/experiment-metadata.tsv
df['plate_map'] = os.path.join(PLATE_MAPS, 'JUMP-Target-1_compound_platemap.tsv')

In [None]:
df

In [None]:
response = fapi.upload_entities(
    namespace=os.getenv('WORKSPACE_NAMESPACE'),
    workspace=os.getenv('WORKSPACE_NAME'),
    entity_data=df.to_csv(path_or_buf=None, sep='\t', index=False),
    model='flexible')

response

In [None]:
response.content

# Provenance

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze

Copyright 2022 The Broad Institute, Inc. and Verily Life Sciences LLC.

Use of this source code is governed by a BSD-style license that can be found in the LICENSE file or at https://developers.google.com/open-source/licenses/bsd