# Terra Data Repository on Azure Year 1 Demo

This notebook runs through the following steps:
- Authenticate using B2C
- Create an Azure *billing Profile* in TDR
- Create a *dataset*
- Ingest 1000 Genomes data into the *dataset*
- Create a *snapshot* from the *dataset*
- Read the the metadata from the *snapshot* into a Pandas data frame
- Read a Drs object from the metadata and use it to access file data


## Import dependencies

In [1]:
%%capture
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install --upgrade data_repo_client
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install fastparquet

import pandas as pd
import datetime, uuid, urllib, os, time
from tdr_utils import TdrUtils
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
from data_repo_client import ApiClient, ApiException, Configuration, DatasetsApi, SnapshotsApi, JobsApi, ResourcesApi, DataRepositoryServiceApi

## Set configuration

Obtain a JWT by going through the auth flow here:

https://terradevb2c.b2clogin.com/terradevb2c.onmicrosoft.com/b2c_1a_signup_signin_tdr/oauth2/v2.0/authorize?client_id=bbd07d43-01cb-4b69-8fd0-5746d9a5c9fe&nonce=defaultNonce&redirect_uri=https%3A%2F%2Fjwt.ms&scope=openid&response_type=id_token&prompt=login

Save the JWT into the token field

In [10]:
# Set up configuration
config = Configuration()
config.host="http://localhost:8080"
# Paste in the JWT token obtained via the auth link above
token="eyJhbGciOiJSUzI1NiIsImtpZCI6Inp4UnJ6aEw4OTBrYi1vVkZOWkdWUFlxSzYtSkI3RFRlTGVvMllTbzRtc1UiLCJ0eXAiOiJKV1QifQ.eyJ2ZXIiOiIxLjAiLCJpc3MiOiJodHRwczovL3RlcnJhZGV2YjJjLmIyY2xvZ2luLmNvbS9mZDBiYzBlZi0xNzQ3LTRlZTYtYWIzZS1kNGQ2YmI4ODJkNDAvdjIuMC8iLCJzdWIiOiJjNjljNmFiYi1mMzQ1LTRkMWYtOTdkMi1mODBkOWQ3ZjM2NTEiLCJhdWQiOiJiYmQwN2Q0My0wMWNiLTRiNjktOGZkMC01NzQ2ZDlhNWM5ZmUiLCJleHAiOjE2OTUxNTgyNzAsImFjciI6ImIyY18xYV9zaWdudXBfc2lnbmluX3RkciIsIm5vbmNlIjoiZGVmYXVsdE5vbmNlIiwiaWF0IjoxNjk1MTU0NjcwLCJhdXRoX3RpbWUiOjE2OTUxNTQ2NzAsImVtYWlsIjoic2hvbGRlbmRldkBnbWFpbC5jb20iLCJnaXZlbl9uYW1lIjoiU2hlbGJ5IiwiZmFtaWx5X25hbWUiOiJIb2xkZW4iLCJuYW1lIjoiU2hlbGJ5IEhvbGRlbiIsImlkcCI6Imdvb2dsZS5jb20iLCJpZHBfYWNjZXNzX3Rva2VuIjoieWEyOS5hMEFmQl9ieUFtOWZROENRMkVCbEFEeWJvOUxWYWJpZkRJV0dpc1V4NmQ0bTAyQVVaeVc3MFVnaFM1XzZrVUNCQkRkbGgtRGsyanRlaGlpb3ZaWGwzTXQ4SDFrbVRFU3FxNGFvMk1sWjZIeE1BbzIzX0doT1BYdTRENzlWNUpSNTRLelhVbzhhdjJMMFdYQm96SkxpanVaMThQRC1XTjZGV0lYa2xMdzNVeWw0VnEzUWhYb1lNYUNnWUtBUllTQVJJU0ZRR09jTm5DM1Y5WlZobkpwS2hMMERldnpEVjVqZzAxODYiLCJnb29nbGVfaWQiOiIxMDcyNDcxOTcxMTU2OTc0ODU0MjQiLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EvQUNnOG9jSTd1dm9jRWlJNzQ1dFVXTVZoam9OeThMYTVoQWtEQVB6bldQdnpTekc5PXM5Ni1jIiwidGlkIjoiZmQwYmMwZWYtMTc0Ny00ZWU2LWFiM2UtZDRkNmJiODgyZDQwIiwibmJmIjoxNjk1MTU0NjcwfQ.sLLYruwdWsPWZ04lM5rKdoSXpKsJDRq0MwxAF9i_IEQIFtlg4D4pZ5aqIaTMtZkPhm84IqmOp0ZHpj8irevq5Z1zhzjFW9rKcWtFviaMH0hyqWrkhzQlN6HT28ujMqZAG4aLi0lM55cqMgFZKUQ03X-GHP4LtUMCBcYTCwO0LAXyEuuAOuS5fM1mhxzBfqgRkT9oAJ-me4yLk2UiFfFteo0nFffr84itPVREchVrK9I6zafJoro8HjUazIY5KFl7FbwrZnAOUavPxDEmCGLp3V9xrtdRklsx0YZWV1HE-OC_iqZaiEshBTUzlCux_vfCzCTKfyLDj0FOoTMhvvWdAQ"
config.access_token = token
api_client = ApiClient(configuration=config)
api_client.client_side_validation = False

# Obtain a sas token for the folder that contains the source metadata to ingest
ingest_sas_token = ""

# Azure managed managed application configuration.  These are obtained from the Azure portal
application_deployment_name = "michael"
resource_group_name = "TDR"
subscription_id = "71d52ec1-5886-480a-9d6e-ed98cbf1f69f"
tenant_id = "efc08443-0082-4d6c-8931-c5794c156abd"

# Enter a billing profile or leave blank to generate a new one when creating a new billing profile
billing_profile_id = "2444825a-733c-4306-ba16-c0f2423dc63e"

ingest_file_base = "https://tdrtestdatauscentral.blob.core.windows.net/1000genomes/metadata"

local_parquet_dir = "/tmp/az"

# Create required API Clients
jobs_api = JobsApi(api_client=api_client)
resources_api = ResourcesApi(api_client=api_client)
datasets_api = DatasetsApi(api_client=api_client)
snapshots_api = SnapshotsApi(api_client=api_client)
drs_api = DataRepositoryServiceApi(api_client=api_client)
tdr_utils = TdrUtils(jobs_api)

## Create the Terra Data Repo billing profile
This will check the current logged in user against the user specified when deploying the TDR managed application.  The following step can be skipped if a billing profile already exists in TDR that you wish to use.  Just be sure to record the billing profile id in the `billing_profile_id` variable.

In [11]:
# Create Billing Profile
if (billing_profile_id == None or billing_profile_id == ""):
    billing_profile_id = str(uuid.uuid4());

profile_request = {
  "id": billing_profile_id,
  "biller": "direct",
  "description": "Billing profile that demonstrates use of Azure resources within TDR",
  "profileName": "azureprofile-michael",
  "cloudPlatform": "azure",
  "applicationDeploymentName": application_deployment_name,
  "resourceGroupName": resource_group_name,
  "subscriptionId": subscription_id,
  "tenantId": tenant_id
}


create_profile_result = tdr_utils.wait_for_job(resources_api.create_profile(billing_profile_request=profile_request))

TypeError: Object of type ApiException is not JSON serializable

## Create Dataset
This process will trigger the creation of a storage account in the managed resource group that exists within the TDR managed application that was deployed by an Azure user in the Azure portal.  This storage account will store:
- Arbitrary table data as Parqet files
- Files that are ingested into the dataset
- Metadata for the files stored in Azure Storage Tables

In [7]:
dataset_request = {
  "defaultProfileId": billing_profile_id,
  "name": "1000genomesdataset_azure",
  "description": "Dataset to demonstrate storing TDR data and metadata in Azure resources",
  "cloudPlatform": "azure",
  "region": "westus2",
  "schema": {
    "tables": [
      {
        "name": "demo_pheno_data",
        "columns": [
          {
            "name": "pheno_data_id",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "age",
            "datatype": "integer",
            "array_of": False
          },
          {
            "name": "bmi_baseline",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "dbgap_accession_number",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "height_baseline",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "ldl",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "hdl",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "population",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "program_name",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "sample_specimen_id",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "sex",
            "datatype": "string",
            "array_of": False
          },
          {
            "name": "total_cholesterol",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "triglycerides",
            "datatype": "float",
            "array_of": False
          },
          {
            "name": "bam_file",
            "datatype": "fileref",
            "array_of": False
          },
          {
            "name": "bam_file_index",
            "datatype": "fileref",
            "array_of": False
          }
        ],
        "primaryKey": []
      }
    ],
    "assets": [
      {
        "name": "default",
        "tables": [
          {
            "name": "demo_pheno_data",
            "columns": []
          }
        ],
        "rootTable": "demo_pheno_data",
        "rootColumn": "pheno_data_id",
        "follow": []
      }
    ]
  }
}

create_dataset_result = tdr_utils.wait_for_job(datasets_api.create_dataset(dataset=dataset_request))

TypeError: Object of type ApiException is not JSON serializable

Read in the dataset that was just created with full information

In [None]:
dataset = datasets_api.retrieve_dataset(create_dataset_result['id'])
tdr_utils.pretty_print_tdr_object(dataset)

# Ingest Pedigree Data
Ingest metadata and file data into TDR.  This will use Azure Synapse to convert metadata that exists in a storage account in either CSV or newline delimited JSON into Parquet files.

If the input metadata is JSON, `fileref` fields can be specified as pointers to storage account files that will be copied into the storage account.

An example metadata record may look like:

```{"pheno_data_id":"HG00096","age":75,"bam_file": {"description":"my bam file","mimeType":"application/octet-stream","sourcePath": "https://myaccount.blob.core.windows.net/mycontainer/path/to/blob.bam?<sas token>","targetPath":"/my/custom/path/file.bam"}}```

In [None]:
ingest_request = {
  "format": "json",
  "ignore_unknown_values": True,
  "load_tag": "smallload",
  "max_bad_records": 0,
  "path": ingest_file_base + "/demo-pheno-data-small.json?" + ingest_sas_token,
  "profile_id": billing_profile_id,
  "resolve_existing_files": True,
  "table": "demo_pheno_data"
}
ingest_request_result = tdr_utils.wait_for_job(datasets_api.ingest_dataset(dataset.id, ingest=ingest_request))

## Read the Ingested Metadata

The following cells demonstrate obtaining access to download metadata from the controlled storage account and viewing them with Pandas

In [None]:
dataset_access = datasets_api.retrieve_dataset(dataset.id, include=["ACCESS_INFORMATION"])
table = next(iter(dataset_access.access_information.parquet.tables), lambda t: t.name == "demo_pheno_data")
tdr_utils.pretty_print_tdr_object(table)

In [None]:
os.system("rm -r -f %s/%s" % (local_parquet_dir, table.name))
os.system("azcopy cp '%s?%s' '%s' --recursive > /dev/null" % (table.url, table.sas_token, local_parquet_dir))
df = pd.read_parquet("%s/%s" % (local_parquet_dir, table.name))
# Convert the UUID from binary to readable UUID
df["datarepo_row_id"] = df["datarepo_row_id"].apply(tdr_utils.UUID)
df.head(5)

## Create a Snapshot
Snapshots are the mechanism by which datasets are shared with users.  The process:
- Triggers the creation of a storage account to store metadata related to the snapshot (note: file data is _not_ copied into the snapshot storage account)
- Copies relevate metadata and file information into the new storage account
- Resolved DRS URIs

In [None]:
snapshot_request = {
  "contents": [
    {
      "datasetName": dataset.name,
      "mode": "byFullView"
    }
  ],
  "description": "Demonstration of Azure resource backed snapshot",
  "name": dataset.name+"_snp",
  "profileId": billing_profile_id,
  "readers": []
}

create_snapshot_result = tdr_utils.wait_for_job(snapshots_api.create_snapshot(snapshot=snapshot_request))

# Read the Generated Snapshot Metadata

In [None]:
snapshot = snapshots_api.retrieve_snapshot(create_snapshot_result['id'], include=["ACCESS_INFORMATION"])
table = next(iter(snapshot.access_information.parquet.tables), lambda t: t.name == "demo_pheno_data")
tdr_utils.pretty_print_tdr_object(table)

In [None]:
pd.set_option('display.max_colwidth', 1000)
os.system("rm -r  -f %s/%s.parquet" % (local_parquet_dir, table.name))
os.system("azcopy cp '%s?%s' '%s' --recursive > /dev/null" % (table.url, table.sas_token, local_parquet_dir))
df = pd.read_parquet("%s/%s.parquet" % (local_parquet_dir, table.name))
# Convert the UUID from binary to readable UUID
df["datarepo_row_id"] = df["datarepo_row_id"].apply(tdr_utils.UUID)
df.head(5)

# Share the Snapshot
Below is an example of how to share the created snapshot with a Terra user

In [None]:
policy_member_request = {
    "email": "rtitle@azure.dev.envs-terra.bio"
}
tdr_utils.pretty_print_tdr_object(snapshots_api.add_snapshot_policy_member(snapshot.id, 'steward', policy_member = policy_member_request))


In [None]:
policy_member_request = {
    "email": "jdewar.broad.testing@gmail.com"
}
tdr_utils.pretty_print_tdr_object(snapshots_api.add_snapshot_policy_member(snapshot.id, 'steward', policy_member = policy_member_request))

## Given a Drs ID, access file data
From the previous cell, copy a DRS ID and extract the object ID (in the format `v1_<uuid>_<uuid>`) and save if in the drs_id variable.

The following cell obtains a signed URL from TDR using a DRS ID then reads the first few bytes.

In [None]:
# Get access to DRS object
drs_id = ""
drs_object = drs_api.get_object(drs_id)
drs_access = drs_api.get_access_url(drs_id, drs_object.access_methods[0].access_id)
tdr_utils.pretty_print_tdr_object(drs_access)

## Read the first 100 bytes of the file represented by DRS ID

In [None]:
# Read the first 100 bytes of the file represented by DRS ID
print(urllib.request.urlopen(drs_access.url).read()[:100])

## Delete Objects That Were Created
Run through the following steps in order for objects that were created in this notebook

In [None]:
tdr_utils.wait_for_job(snapshots_api.delete_snapshot(snapshot.id))

In [None]:
tdr_utils.wait_for_job(datasets_api.delete_dataset(dataset.id))

In [None]:
tdr_utils.wait_for_job(profile_api.delete_profile("b0fe07f4-2f16-44f3-b077-ca9066ca5613"))