![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2F00+-+Setup&file=00+-+Environment+Setup.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/00%20-%20Setup/00%20-%20Environment%20Setup.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2F00%2520-%2520Setup%2F00%2520-%2520Environment%2520Setup.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/00%20-%20Setup/00%20-%20Environment%20Setup.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/00%20-%20Setup/00%20-%20Environment%20Setup.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# 00 - Environment Setup

This is the notebook that sets up the GCP project for the other notebooks in this repository.  Based on the [readme.md](../readme.md), you already have this repository of notebooks pulled as a local resource in your Vertex AI Workbench based notebook instance.

**Video Walkthrough of this notebook:**

Includes conversational walkthrough and more explanatory information than the notebook:
<p align="center" width="100%" width="100%"><center><a href="https://youtu.be/pnQ5Rv4ZQfo" target="_blank" rel="noopener noreferrer"><img src="../architectures/thumbnails/playbutton/00.png" width="40%"></a></center></p>

**Conceptual Flow & Workflow**

<p align="center">
  <img alt="Conceptual Flow" src="../architectures/slides/00_arch.png" width="45%">
&nbsp; &nbsp; &nbsp; &nbsp;
  <img alt="Workflow" src="../architectures/slides/00_console.png" width="45%">
</p>

---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'prj-prod-dataplatform'

In [2]:
REGION = 'asia-southeast1-a'

packages:

In [3]:
from google.cloud import storage
from google.cloud import bigquery

import pandas as pd
from sklearn import datasets

clients:

In [4]:
gcs = storage.Client(project = PROJECT_ID)
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [6]:
BUCKET = 'prod-tonik-dl-staging-data'

---
## Create Storage Bucket
Check to see if bucket already exist and create if missing:
- [GCS Python Client](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.client.Client)

In [7]:
if not gcs.lookup_bucket(BUCKET):
    bucketDef = gcs.bucket(BUCKET)
    bucket = gcs.create_bucket(bucketDef, project=PROJECT_ID, location=REGION)
    print(f'Created Bucket: {gcs.lookup_bucket(BUCKET).name}')
else:
    bucketDef = gcs.bucket(BUCKET)
    print(f'Bucket already exist: {bucketDef.name}')

Bucket already exist: prod-tonik-dl-staging-data


In [8]:
print(f'Review the storage bucket in the console here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID};tab=objects&project={PROJECT_ID}')

Review the storage bucket in the console here:
https://console.cloud.google.com/storage/browser/prj-prod-dataplatform;tab=objects&project=prj-prod-dataplatform


In [17]:
DATANAME = "LoanmasterSample"

In [14]:
%%bigquery df
WITH
  b AS (
  SELECT
    loanAccountNumber,
    min_inst_def30,
    obs_min_inst_def30
  FROM
    prj-prod-dataplatform.risk_credit_mis.loan_deliquency_data
  WHERE
    obs_min_inst_def30 >= 2),
lmt as
(SELECT
  lmt.loanAccountNumber,
  lmt.customerId,
  lmt.digitalLoanAccountId,
  lmt.tsa_onboarding_time,
  lmt.startApplyDateTime,
  lmt.termsAndConditionsSubmitDateTime,
  lmt.isTermsAndConditionsAccepted,
  lmt.disbursementDateTime,
  lmt.flagDisbursement,
  lmt.loanPaidStatus,
  case when b.obs_min_inst_def30 >=2 and b.min_inst_def30 in (1,2) then lmt.loanAccountNumber end FSPD30_loancnt,
  case when b.obs_min_inst_def30 >=2 then lmt.loanAccountNumber end obsFSPD30_loancnt
FROM
  `risk_credit_mis.loan_master_table` lmt
INNER JOIN
  b
ON
  lmt.loanAccountNumber = b.loanAccountNumber 
)
select 
distinct
  lmt.customerId,
  lmt.digitalLoanAccountId,
  lmt.loanAccountNumber,
  lmt.tsa_onboarding_time,
  lmt.startApplyDateTime,
  lmt.termsAndConditionsSubmitDateTime,
  lmt.isTermsAndConditionsAccepted,
  lmt.disbursementDateTime,
  lmt.flagDisbursement,
  lmt.loanPaidStatus,
  t3.creditScoreUpdated   ,
  t3.fraudScore   ,	
  t3.fraudScoreUpdated    ,
  t3.calculateddate   ,
  t4.run_date ,
  ca.package_name ,
  ca.first_install_time    ,
  t4.GeneralInfo.brand     ,
  t4.Hardware.device__brand   ,
  t4.Hardware.device__manufacturer   ,
  t4.Hardware.device__model,
  t4.GeneralData.telephony_info__network_operator_name,
  t4.GeneralData.telephony_info__network_operator,
  t4.GeneralData.sim_operator_name,
  lmt.FSPD30_loancnt,     ---- FSPD30 = 1 when this value is not null(provided this as there were be duplicate rows in this dataset because of package name)
  lmt.obsFSPD30_loancnt   ---- obsFSPD30 = 1 when this value is not null (provided this as there were be duplicate rows in this dataset because of package name)
from lmt
LEFT JOIN
`prj-prod-dataplatform.dl_loans_db_raw.tdbk_digital_loan_application` t2
ON lmt.digitalLoanAccountId = t2.digitalLoanAccountId
LEFT JOIN
`prj-prod-dataplatform.dl_loans_db_raw.tdbk_credolab_track` t3
ON t2.credolabRefNumber = t3.refno
LEFT JOIN
`prj-prod-dataplatform.credolab_raw.android_credolab_datasets_struct_columns` t4
ON t3.refno = t4.deviceId
inner join
`prj-prod-dataplatform.core_raw.loan_accounts` loan
on loan.CUSTOMERID = lmt.customerId
 INNER JOIN
(select deviceId, af.package_name as package_name, af.first_install_time as first_install_time from `prj-prod-dataplatform.credolab_raw.android_credolab_Application`  ,
unnest(Application) as af) ca
ON ca.deviceId = t3.refno
where date(lmt.startApplyDateTime) >='2024-06-01'   ---- Please change the date as per your requirement. This is Loan Application Apply Date
and lmt.FSPD30_loancnt is not null
order by lmt.customerId
limit 1000   --- Please remove this when running the query
;

Query is running:   0%|          |

Downloading:   0%|          |

In [15]:
df.head()

Unnamed: 0,customerId,digitalLoanAccountId,loanAccountNumber,tsa_onboarding_time,startApplyDateTime,termsAndConditionsSubmitDateTime,isTermsAndConditionsAccepted,disbursementDateTime,flagDisbursement,loanPaidStatus,...,first_install_time,brand,device__brand,device__manufacturer,device__model,telephony_info__network_operator_name,telephony_info__network_operator,sim_operator_name,FSPD30_loancnt,obsFSPD30_loancnt
0,1302142,d3038dd0-0e81-4983-91d8-9fc9065a8ed1,60813021420026,2022-04-04 21:16:19,2024-06-01 08:13:50,2024-06-05 09:08:57,1,2024-06-05 09:43:16,1,Normal,...,2009-01-01 08:00:00+00:00,vivo,vivo,,V2352A,,51502,GLOBE,60813021420026,60813021420026
1,1302142,d3038dd0-0e81-4983-91d8-9fc9065a8ed1,60813021420026,2022-04-04 21:16:19,2024-06-01 08:13:50,2024-06-05 09:08:57,1,2024-06-05 09:43:16,1,Normal,...,2009-01-01 08:00:00+00:00,vivo,vivo,,V2352A,,51502,GLOBE,60813021420026,60813021420026
2,1302142,d3038dd0-0e81-4983-91d8-9fc9065a8ed1,60813021420026,2022-04-04 21:16:19,2024-06-01 08:13:50,2024-06-05 09:08:57,1,2024-06-05 09:43:16,1,Normal,...,2009-01-01 08:00:00+00:00,vivo,vivo,,V2352A,,51502,GLOBE,60813021420026,60813021420026
3,1302142,d3038dd0-0e81-4983-91d8-9fc9065a8ed1,60813021420026,2022-04-04 21:16:19,2024-06-01 08:13:50,2024-06-05 09:08:57,1,2024-06-05 09:43:16,1,Normal,...,2009-01-01 08:00:00+00:00,vivo,vivo,,V2352A,,51502,GLOBE,60813021420026,60813021420026
4,1302142,d3038dd0-0e81-4983-91d8-9fc9065a8ed1,60813021420026,2022-04-04 21:16:19,2024-06-01 08:13:50,2024-06-05 09:08:57,1,2024-06-05 09:43:16,1,Normal,...,2009-01-01 08:00:00+00:00,vivo,vivo,,V2352A,,51502,GLOBE,60813021420026,60813021420026


In [25]:
client = storage.Client()
bucket = client.get_bucket(BUCKET)
# Create a blob
blob = bucket.blob(f"report_dumps/'VertexAi'/data/{DATANAME}.csv")

# Save DataFrame to CSV
df.to_csv(f"{DATANAME}.csv", index=False)

# Upload the CSV to the bucket
blob.upload_from_filename(f"{DATANAME}.csv")


---
<a id = 'permissions'></a>
## Service Account & Permissions

This notebook instance is running as a service account in GCP.  This service account will also be used to run other services in Vertex AI like training jobs and pipelines.  The service account will need permission to interact with object in Cloud Storage which requires the role ([roles/storage.objectAdmin](https://cloud.google.com/storage/docs/access-control/iam-roles)).  

Get the current service account:

In [9]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'32828934978-compute@developer.gserviceaccount.com'

Enable the Cloud Resource Manager API:

In [10]:
!gcloud services enable cloudresourcemanager.googleapis.com



To take a quick anonymous survey, run:
  $ gcloud survey



List the service accounts current roles:

In [11]:
!gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$SERVICE_ACCOUNT" --format='table(bindings.role)' --flatten="bindings[].members"

[1;31mERROR:[0m (gcloud.projects.get-iam-policy) User [32828934978-compute@developer.gserviceaccount.com] does not have permission to access projects instance [prj-prod-dataplatform:getIamPolicy] (or it may not exist): The caller does not have permission


If the resulting list is missing `roles/storage.objectAdmin` or another role that contains this permission, like the basic role `roles/owner`, then it will need to be added for the service account. Use these instructions to complete this:

In [12]:
print(f'Go To IAM in the Google Cloud Console:\nhttps://console.cloud.google.com/iam-admin/iam?orgonly=true&project={PROJECT_ID}&supportedpurview=organizationId')

Go To IAM in the Google Cloud Console:
https://console.cloud.google.com/iam-admin/iam?orgonly=true&project=prj-prod-dataplatform&supportedpurview=organizationId


From the console link above, or by going to https:/console.cloud.google.com and navigating to "IAM & Admin > IAM":
- Locate the row for the service account listed above: `<project number>-compute@developer.gserviceaccount.com`
- Under the `inheritance` column click the pencil icon to edit roles
- In the fly over menu, under `Assign roles` select `Add Another Role`
- Click the `Select a role` box and type `Storage Object Admin`, then select `Storage Object Admin`
- Click Save
- Rerun the list of services below and verify the role has been added:

In [13]:
!gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$SERVICE_ACCOUNT" --format='table(bindings.role)' --flatten="bindings[].members"

[1;31mERROR:[0m (gcloud.projects.get-iam-policy) User [32828934978-compute@developer.gserviceaccount.com] does not have permission to access projects instance [prj-prod-dataplatform:getIamPolicy] (or it may not exist): The caller does not have permission


---
## Install KFP
If you get an error after a step, rerun it.  The dependecies sometimes resolve.
- [Install the Kubeflow Pipelines SDK](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/install-sdk/)

In [26]:
!pip install kfp -U -q

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-pipeline-components 2.6.0 requires kfp<=2.4.0,>=2.0.0b10, but you have kfp 2.9.0 which is incompatible.[0m[31m
[0m

In [27]:
!pip install google-cloud-pipeline-components -U -q

[0m

In [29]:
!pip install plotly -q

[0m

---
## Update AIPlatform Package:

The `google-cloud-aiplatform` package updates frequently.  Update it for latest functionality.

- [aiplatform Python Client](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform)
- [GitHub Repo for api-common-protos](https://github.com/googleapis/api-common-protos)

For a better understanding of the Vertex AI APIs client, version, and layers please review the tip here [aiplatform_notes.md](../Tips/aiplatform_notes.md).

In [33]:
!pip install googleapis-common-protos -U -q --user

[0m

In [34]:
!pip install google-cloud-aiplatform -U -q

[0m

In [35]:
from google.cloud import aiplatform
aiplatform.__version__

'1.68.0'