<a href="" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  <font color='black'> **Automl Solubility Data Example** </font>
Based in the example in this Jupter notebook:
https://notebook.community/GoogleCloudPlatform/python-docs-samples/tables/automl/notebooks/census_income_prediction/getting_started_notebook

After you download the patients data from Kaggle to your local drive, go to AI Platform Notebooks. Select "Upload Files" on the top left menu and follow the next steps.

In [13]:
# Use the latest major GA version of the framework.
! pip install --upgrade --quiet --user google-cloud-automl

In [20]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from google.cloud import automl_v1beta1 as automl

import matplotlib.pyplot as plt
from ipywidgets import interact
import ipywidgets as widgets

In [4]:
# Read GCP project id from env.
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID=shell_output[0]
print("GCP project ID:" + PROJECT_ID)

from google.cloud.bigquery import magics
import google.auth
credentials, project = google.auth.default()

magics.context.credentials = credentials

GCP project ID:hpc-sandbox-306718


In [5]:
BUCKET_NAME = "ml-data-hpc" #@param {type:"string"}
! gsutil ls -al gs://$BUCKET_NAME
    
  
    

   3750208  2021-03-18T14:41:52Z  gs://ml-data-hpc/curated-solubility-dataset.csv#1616078512476369  metageneration=1
TOTAL: 1 objects, 3750208 bytes (3.58 MiB)


In [6]:
#@title Constants { vertical-output: true }

# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = 'curated_solubility_AqSolDB' #@param {type: 'string'}
# The GCS data to import data from (doesn't need to exist).
INPUT_CSV_NAME = 'curated-solubility-dataset.csv' #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = '6388444032355270656' #@param {type: 'string'}

COMPUTE_REGION='us-central1'

GCS_DATASET_URI='gs://ml-data-hpc/curated-solubility-dataset.csv'

assert all([
    PROJECT_ID,
    COMPUTE_REGION,
    DATASET_DISPLAY_NAME,
    GCS_DATASET_URI,
    INPUT_CSV_NAME,
    MODEL_DISPLAY_NAME,
    BUCKET_NAME
])

In [8]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)

In [9]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets

{'curated_solubility_AqSolDB': 'projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488',
 'C_elegans_live_dead_assay': 'projects/204898618097/locations/us-central1/datasets/ICN8549266405667635200'}

In [10]:
# Create dataset.
dataset = tables_client.create_dataset(
          dataset_display_name=DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset

name: "projects/204898618097/locations/us-central1/datasets/TBL4757833653747187712"
display_name: "curated_solubility_AqSolDB"
create_time {
  seconds: 1618934723
  nanos: 45765000
}
etag: "AB3BwFoig3KqQreDFTjKVv7u1GZkcNPdytuMQdimjvoZoFCkdm_OGQheBQZnXjCGfXo="
tables_dataset_metadata {
  stats_update_time {
  }
}

In [11]:
# Make sure data is there
GCS_DATASET_URI = 'gs://{}/{}.csv'.format(BUCKET_NAME, INPUT_CSV_NAME)
! gsutil ls gs://$BUCKET_NAME || gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME
#! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income.csv $GCS_DATASET_URI

gs://ml-data-hpc/curated-solubility-dataset.csv


In [65]:
# DO NOT RUN MORE THAN ONCE
# Read the data source from GCS. 
import_data_response = tables_client.import_data(
    dataset=dataset,
    gcs_input_uris=GCS_DATASET_URI
)
print('Dataset import operation: {}'.format(import_data_response.operation))

# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))



Dataset import operation: name: "projects/204898618097/locations/us-central1/operations/TBL1203651046967083008"
metadata {
  type_url: "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata"
  value: "\032\014\010\364\253\373\203\006\020\320\360\312\234\002\"\014\010\364\253\373\203\006\020\320\360\312\234\002z\000"
}



FailedPrecondition: 400 Cannot move files across regions. Please use a regional bucket in the same location and with same storage class as AutoML. Required Location: us-central1, required location type: Region, required storage class: Standard.

In [13]:
# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset

name: "projects/204898618097/locations/us-central1/datasets/TBL4757833653747187712"
display_name: "curated_solubility_AqSolDB"
create_time {
  seconds: 1618934723
  nanos: 45765000
}
etag: "AB3BwFqPqsQ_wWRo584kSnykrgOMN3ifadEl2zlVI5I3_oJEJzQ8yopHVqy5EQvZFlI="
tables_dataset_metadata {
  stats_update_time {
  }
}

In [40]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets['curated_solubility_AqSolDB']

'projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488'

In [50]:
dataset = [ dataset for dataset in list_datasets if dataset.display_name=='curated_solubility_AqSolDB' ]
dataset

[name: "projects/204898618097/locations/us-central1/datasets/TBL4757833653747187712"
 display_name: "curated_solubility_AqSolDB"
 create_time {
   seconds: 1618934723
   nanos: 45765000
 }
 etag: "AB3BwFpxcnvQGGaYg8R6IQd0vkzUGirPQuMWZk7ZePa7f5JyvAyucp28kep5pVec1ew="
 tables_dataset_metadata {
   stats_update_time {
   }
 },
 name: "projects/204898618097/locations/us-central1/datasets/TBL1788272649449766912"
 display_name: "curated_solubility_AqSolDB"
 create_time {
   seconds: 1618873235
   nanos: 460565000
 }
 etag: "AB3BwFp9VMq0fpta68HGJ-8lmGrJT4uneBOIYTbyE1cc2UNq3PvTEN47waSmJQHe_HM="
 tables_dataset_metadata {
   stats_update_time {
   }
 },
 name: "projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488"
 display_name: "curated_solubility_AqSolDB"
 create_time {
   seconds: 1616071606
   nanos: 424428000
 }
 etag: "AB3BwFqxiantdmD8STznAHZNdmX9jaZ8zwwJA9lglK3TcF5_ZWlAN22qUuETtaUWSGPc"
 example_count: 9982
 tables_dataset_metadata {
   primary_table_spec_id: "5426

In [17]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models

{'curated_solubilit_20210318085445': 'projects/204898618097/locations/us-central1/models/TBL3336281117509550080',
 'untitled_16159270_20210316094815': 'projects/204898618097/locations/us-central1/models/TBL6654589617951473664'}

In [55]:
# List table specs.
list_table_specs_response = tables_client.list_table_specs(dataset=dataset)
table_specs = [s for s in list_table_specs_response]

# List column specs.
list_column_specs_response = tables_client.list_column_specs(dataset=dataset)
column_specs = {s.display_name: s for s in list_column_specs_response}


# Print Features and data_type.
features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) 
            for key, value in column_specs.items()]
print('Feature list:\n')
for feature in features:
    print(feature[0],':', feature[1])





AttributeError: 'list' object has no attribute 'name'

In [35]:
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1
    
plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()

NameError: name 'column_specs' is not defined

In [19]:
model = tables_client.get_model(model_display_name='curated_solubilit_20210318085445')
model

name: "projects/204898618097/locations/us-central1/models/TBL3336281117509550080"
display_name: "curated_solubilit_20210318085445"
dataset_id: "TBL3193642023793983488"
create_time {
  seconds: 1616072185
  nanos: 207897000
}
deployment_state: UNDEPLOYED
update_time {
  seconds: 1616076177
  nanos: 954451000
}
tables_model_metadata {
  target_column_spec {
    name: "projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488/tableSpecs/542600192214433792/columnSpecs/4300177331848216576"
    data_type {
      type_code: FLOAT64
    }
    display_name: "Solubility"
  }
  input_feature_column_specs {
    name: "projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488/tableSpecs/542600192214433792/columnSpecs/4444292519924072448"
    data_type {
      type_code: CATEGORY
    }
    display_name: "Group"
  }
  input_feature_column_specs {
    name: "projects/204898618097/locations/us-central1/datasets/TBL3193642023793983488/tableSpecs/542600192214433792/col

In [None]:
# DO NOT RUN IF THE MODEL IS ALREADY TRAINED
model_train_hours = 1 #@param {type:'integer'}

create_model_response = tables_client.create_model(
    model_display_name=MODEL_DISPLAY_NAME,
    dataset=dataset,
    train_budget_milli_node_hours=model_train_hours*1000
)

operation_id = create_model_response.operation.name

print('Create model operation: {}'.format(create_model_response.operation))

In [20]:
tables_client.deploy_model(model=model).result()

KeyboardInterrupt: 

In [32]:
gcs_output_folder_name="pred"
SAMPLE_INPUT = 'gs://ml-data-biochem/curated-solubility-dataset-batch-100.csv'
GCS_BATCH_PREDICT_OUTPUT = 'gs://ml-data-biochem/{}/'.format(gcs_output_folder_name)



In [34]:
batch_predict_response = tables_client.batch_predict(
    model=model, 
    gcs_input_uris=SAMPLE_INPUT,
    gcs_output_uri_prefix=GCS_BATCH_PREDICT_OUTPUT,
)
print('Batch prediction operation: {}'.format(
    batch_predict_response.operation))

# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata

Batch prediction operation: name: "projects/204898618097/locations/us-central1/operations/TBL6351546896028270592"
metadata {
  type_url: "type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata"
  value: "\032\014\010\361\324\374\203\006\020\270\265\214\306\002\"\014\010\361\324\374\203\006\020\270\265\214\306\002\202\001C\nA\n?\n=gs://ml-data-biochem/curated-solubility-dataset-batch-100.csv"
}



create_time {
  seconds: 1618946673
  nanos: 683875000
}
update_time {
  seconds: 1618947002
  nanos: 329507000
}
batch_predict_details {
  input_config {
    gcs_source {
      input_uris: "gs://ml-data-biochem/curated-solubility-dataset-batch-100.csv"
    }
  }
  output_info {
    gcs_output_directory: "gs://ml-data-biochem/pred/prediction-curated_solubilit_20210318085445-2021-04-20T19:24:33.552063Z"
  }
}

In [None]:
# CLEAN UP ALL RESOURCES USED
# Delete model resource.
#tables_client.delete_model(model_name=model_name)

# Delete dataset resource.
#tables_client.delete_dataset(dataset_name=dataset_name)

# Delete Cloud Storage objects that were created.
#! gsutil -m rm -r gs://$BUCKET_NAME
  
# If training model is still running, cancel it.
#automl_client.transport._operations_client.cancel_operation(operation_id)

In [57]:
## ID,Name,InChI,InChIKey,SMILES,Solubility,SD,Ocurrences,Group,MolWt,MolLogP,MolMR,HeavyAtomCount,NumHAcceptors,NumHDonors,NumHeteroatoms,NumRotatableBonds,NumValenceElectrons,NumAromaticRings,NumSaturatedRings,NumAliphaticRings,RingCount,TPSA,LabuteASA,BalabanJ,BertzCT
## A-3,"N,N,N-trimethyloctadecan-1-aminium bromide","InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22(2,3)4;/h5-21H2,1-4H3;1H/q+1;/p-1",SZEMGTQCPRNXEG-UHFFFAOYSA-M,[Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C,-3.6161271205000003,0.0,1,G1,392.5100000000002,3.9581000000000017,102.44540000000009,23.0,0.0,0.0,2.0,17.0,142.0,0.0,0.0,0.0,0.0,0.0,158.52060058500794,0.0,210.377334085253
## A-4,Benzo[cd]indol-2(1H)-one,"InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-11)10(7)8/h1-6H,(H,12,13)",GPYLCFQEKPUWLD-UHFFFAOYSA-N,O=C1Nc2cccc3cccc1c23,-3.2547670983,0.0,1,G1,169.18299999999996,2.405500000000001,51.90120000000002,13.0,1.0,1.0,2.0,0.0,62.0,2.0,0.0,1.0,3.0,29.1,75.18356289575135,2.5829963936570306,511.2292477154965

In [None]:
#ID,Name,InChI,InChIKey,SMILES,Solubility,SD,Ocurrences,Group,MolWt,MolLogP,MolMR,HeavyAtomCount,NumHAcceptors,NumHDonors,NumHeteroatoms,NumRotatableBonds,NumValenceElectrons,NumAromaticRings,NumSaturatedRings,NumAliphaticRings,RingCount,TPSA,LabuteASA,BalabanJ,BertzCT,predicted_Solubility
#A-65,"2-(benzotriazol-2-yl)-6-[[3-(benzotriazol-2-yl)-2-hydroxy-5-(2,4,4-trimethylpentan-2-yl)phenyl]methyl]-4-(2,4,4-trimethylpentan-2-yl)phenol","InChI=1S/C41H50N6O2/c1-38(2,3)24-40(7,8)28-20-26(36(48)34(22-28)46-42-30-15-11-12-16-31(30)43-46)19-27-21-29(41(9,10)25-39(4,5)6)23-35(37(27)49)47-44-32-17-13-14-18-33(32)45-47/h11-18,20-23,48-49H,19,24-25H2,1-10H3",FQUNFJULCYSSOP-UHFFFAOYSA-N,CC(C)(C)CC(C)(C)c1cc(Cc2cc(cc(n3nc4ccccc4n3)c2O)C(C)(C)CC(C)(C)C)c(O)c(c1)n5nc6ccccc6n5,-7.973715535399999,1.855923840245174,3,G4,658.8910000000001,9.583999999999996,198.0615999999994,49.0,8.0,2.0,8.0,8.0,256.0,6.0,0.0,0.0,6.0,101.88,289.349168262918,1.5526911798946945,1937.7168881281682,-7.0111432075500488
#A-24,2-(4-chloro-2-methylphenoxy)propanoic acid,"InChI=1S/C10H11ClO3/c1-6-5-8(11)3-4-9(6)14-7(2)10(12)13/h3-5,7H,1-2H3,(H,12,13)",WNTGYJSOUMFZEP-UHFFFAOYSA-N,CC(Oc1ccc(Cl)cc1C)C(O)=O,-2.4660307863,0.06062087680051357,4,G5,214.64799999999997,2.5003199999999994,53.91480000000002,14.0,2.0,1.0,4.0,3.0,76.0,1.0,0.0,0.0,1.0,46.53,87.26373888571173,2.817664884623791,349.2203893594689,-2.8079211711883545
#A-9,"4-({4-[bis(oxiran-2-ylmethyl)amino]phenyl}methyl)-N,N-bis(oxiran-2-ylmethyl)aniline","InChI=1S/C25H30N2O4/c1-5-20(26(10-22-14-28-22)11-23-15-29-23)6-2-18(1)9-19-3-7-21(8-4-19)27(12-24-16-30-24)13-25-17-31-25/h1-8,22-25H,9-17H2",FAUAZXVRLVIARB-UHFFFAOYSA-N,C1OC1CN(CC2CO2)c3ccc(Cc4ccc(cc4)N(CC5CO5)CC6CO6)cc3,-4.6620645831,0.0,1,G1,422.5250000000002,2.4854000000000003,119.07600000000004,31.0,6.0,0.0,6.0,12.0,164.0,2.0,4.0,4.0,6.0,56.6,183.18326845703425,1.0844273169718197,769.8999341256456,-3.538607120513916