This Python application use the Vertex AI library on Google Cloud Platform (GCP) for generating Terraform code to deploy infrastructure and pyspark code to extract, transform and load information with natural language.
- BigQuery and Cloud Storage: They will be the data storage.
- Dataproc: We will use Dataproc Batches to execute the generated pyspark code to process the information.
- Vertex AI Integration: Leverage the Vertex AI library from Google Cloud Platform to generate Terraform code for infrastructure deployment and ETL the information.
- Terraform: We will use code in HCL to automate the creation and destruction of infrastructure.
- Python >= 3.5
- GCP Account with Vertex AI API access
- Google Cloud SDK installed and configured
- Terraform installed
-
Clone the repository:
git clone https://github.com/alfonsozamorac/etl-genai.git cd etl-genai
-
Creation of Virtual Env:
python3 -m venv venv source venv/bin/activate
-
Install GCP library:
pip3 install --upgrade google-cloud-aiplatform
-
Export your GCP credentials:
export GOOGLE_APPLICATION_CREDENTIALS="your_credentials_file.json"
-
You can export your GCP project to generate the request automatically:
export MY_GCP_PROJECT="your_project_name"
-
Execute py program to generate the hcl and pyspark code for players example:
python3 ai_gen.py players ${MY_GCP_PROJECT}
-
Create infraestructure with terraform:
terraform -chdir="generated/players/terraform" init terraform -chdir="generated/players/terraform" apply --auto-approve
-
Insert data into Bigquery table:
bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False < example_data/players/insert_players_example.sql
-
Create a Batch in Dataproc with pyspark code:
gcloud dataproc batches submit pyspark generated/players/python/etl.py --version=1.1 --batch=players-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
-
Go to GCP UI and see the results, or execute the below query to get them:
bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM myligue.players_ranking"
-
Destroy infraestructure:
terraform -chdir="generated/players/terraform" destroy --auto-approve
-
Execute py program to generate the hcl and pyspark code for customers example:
python3 ai_gen.py customers ${MY_GCP_PROJECT}
-
Create infraestructure with terraform:
terraform -chdir="generated/customers/terraform" init terraform -chdir="generated/customers/terraform" apply --auto-approve
-
Insert data into Cloud Storage folders:
gcloud storage cp example_data/customers/sales.csv gs://sales-etl-bucket/input/sales gcloud storage cp example_data/customers/customers.csv gs://sales-etl-bucket/input/customers
-
Create a Batch in Dataproc with pyspark code.
gcloud dataproc batches submit pyspark generated/customers/python/etl.py --version=1.1 --batch=customers-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
-
Go to GCP UI and see the results, or execute the below query to get them:
bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM raw_sales_data.customer_table" bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM raw_sales_data.sales_table" bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM master_sales_data.bigtable_info"
-
Destroy infraestructure:
terraform -chdir="generated/customers/terraform" destroy --auto-approve
-
Insert a new K,V element in 'examples' dictionary variable in ai_gen.py.
-
Execute py program to generate the hcl and pyspark code for your example:
python3 ai_gen.py ${your_example_name} ${MY_GCP_PROJECT}
-
Create infraestructure with terraform:
terraform -chdir="generated/${your_example_name}/terraform" init terraform -chdir="generated/${your_example_name}/terraform" apply --auto-approve
-
Insert data into Cloud Storage or Bigquery
-
Create a Batch in Dataproc with pyspark code.
gcloud dataproc batches submit pyspark generated/customers/python/etl.py --version=1.1 --batch=${your_example_name}-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
-
Check results into Cloud Storage or Bigquery
-
Destroy infraestructure:
terraform -chdir="generated/${your_example_name}/terraform" destroy --auto-approve