# Assignment 3.1: Impacting the Business with a Distributed Data Science Pipeline (Part 2)

### Sources

Beat Acute Myeloid Leukemia (AML) 1.0 was accessed on 13Mar2023 from https://registry.opendata.aws/beataml. OHSU BeatAML Datasets Link: https://ctd2-data.nci.nih.gov/Public/OHSU-1/BeatAML_Waves1_2/

OpenCell Datasets Link: https://opencell.czbiohub.org/download

### Datasets S3 Location

Importing Raw Datasets from AWS S3. Use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands:

In [2]:
!aws s3 ls s3://team4rawdatasets/

                           PRE CSV/
                           PRE Output/


# Check Pre-Requisites from the 01_setup/ Folder

In [3]:
%store -r setup_instance_check_passed

In [4]:
try:
    setup_instance_check_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Instance Check.")
    print("+++++++++++++++++++++++++++++++")

In [5]:
print(setup_instance_check_passed)

True


In [6]:
%store -r setup_dependencies_passed

In [7]:
try:
    setup_dependencies_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++")

In [8]:
print(setup_dependencies_passed)

True


In [9]:
%store -r setup_s3_bucket_passed

In [10]:
try:
    setup_s3_bucket_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++")

In [11]:
print(setup_s3_bucket_passed)

True


In [12]:
%store -r setup_iam_roles_passed

In [13]:
try:
    setup_iam_roles_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++")

In [14]:
print(setup_iam_roles_passed)

True


In [15]:
if not setup_instance_check_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Instance Check.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_dependencies_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_s3_bucket_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_iam_roles_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [16]:
import boto3
import sagemaker
import pandas as pd

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

# Set S3 Source Location (Public S3 Bucket)

In [17]:
s3_clsm_csv = "s3://team4rawdatasets/CSV/Input/OHSU_BeatAML_ClinicalSummary/OHSU_BeatAMLWaves1_2_Tyner_ClinicalSummary.csv"

In [19]:
s3_pi_csv = "s3://team4rawdatasets/CSV/Input/OpenCell_ProteinInteraction/opencell-protein-interactions.csv"

In [21]:
%store s3_clsm_csv
%store s3_pi_csv

Stored 's3_clsm_csv' (str)
Stored 's3_pi_csv' (str)


In [22]:
%store

Stored variables and their in-db values:
balanced_bias_data_jsonlines_s3_uri             -> 's3://sagemaker-us-east-1-424823189023/bias-detect
balanced_bias_data_s3_uri                       -> 's3://sagemaker-us-east-1-424823189023/bias-detect
bias_data_s3_uri                                -> 's3://sagemaker-us-east-1-424823189023/bias-detect
ingest_create_athena_db_passed                  -> True
s3_clsm_csv                                     -> 's3://team4rawdatasets/CSV/Input/OHSU_BeatAML_Clin
s3_path_csv                                     -> 's3://team4rawdatasets/CSV'
s3_pi_csv                                       -> 's3://team4rawdatasets/CSV/Input/OpenCell_ProteinI
s3_public_path_csv                              -> 's3://team4rawdatasets/CSV'
setup_dependencies_passed                       -> True
setup_iam_roles_passed                          -> True
setup_instance_check_passed                     -> True
setup_s3_bucket_passed                          -> True


# Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [23]:
#!pip install --disable-pip-version-check -q PyAthena==2.1.0
from pyathena import connect

# Athena Database and Table

In [24]:
# Set Athena parameters
database_name = "bcr"
table_clsm = "ohsu_beataml_clinicalsummary"
table_pi = "opencell_proteininteraction"

In [25]:
#Retrive datasets
!aws s3 cp 's3://team4rawdatasets/CSV/Input/OHSU_BeatAML_ClinicalSummary/OHSU_BeatAMLWaves1_2_Tyner_ClinicalSummary.csv' ./data/
!aws s3 cp 's3://team4rawdatasets/CSV/Input/OpenCell_ProteinInteraction/opencell-protein-interactions.csv' ./data/

download: s3://team4rawdatasets/CSV/Input/OHSU_BeatAML_ClinicalSummary/OHSU_BeatAMLWaves1_2_Tyner_ClinicalSummary.csv to data/OHSU_BeatAMLWaves1_2_Tyner_ClinicalSummary.csv
download: s3://team4rawdatasets/CSV/Input/OpenCell_ProteinInteraction/opencell-protein-interactions.csv to data/opencell-protein-interactions.csv


In [32]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [33]:
# SQL statement
statement = """
SELECT 
FROM {}.{} 
""".format(
    database_name, table_clsm
)

print(statement)


SELECT 
FROM bcr.ohsu_beataml_clinicalsummary 



# Release Resources

In [26]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [2]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>