PySpark ETL and Infrastructure as Code (IaC) with Vertex AI on GCP

Documentation

Overview

This Python application use the Vertex AI library on Google Cloud Platform (GCP) for generating Terraform code to deploy infrastructure and pyspark code to extract, transform and load information with natural language.

Features

BigQuery and Cloud Storage: They will be the data storage.
Dataproc: We will use Dataproc Batches to execute the generated pyspark code to process the information.
Vertex AI Integration: Leverage the Vertex AI library from Google Cloud Platform to generate Terraform code for infrastructure deployment and ETL the information.
Terraform: We will use code in HCL to automate the creation and destruction of infrastructure.

Prerequisites

Python >= 3.5
GCP Account with Vertex AI API access
Google Cloud SDK installed and configured
Terraform installed

Installation

Clone the repository:

git clone https://github.com/alfonsozamorac/etl-genai.git
cd etl-genai

Creation of Virtual Env:

python3 -m venv venv
source venv/bin/activate

Install GCP library:

pip3 install --upgrade google-cloud-aiplatform

Export your GCP credentials:

export GOOGLE_APPLICATION_CREDENTIALS="your_credentials_file.json"

You can export your GCP project to generate the request automatically:
```
export MY_GCP_PROJECT="your_project_name"
```

Usage example players

Execute py program to generate the hcl and pyspark code for players example:
```
python3 ai_gen.py players ${MY_GCP_PROJECT}
```

Create infraestructure with terraform:

terraform -chdir="generated/players/terraform" init
terraform -chdir="generated/players/terraform" apply --auto-approve

Insert data into Bigquery table:

bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False < example_data/players/insert_players_example.sql

Create a Batch in Dataproc with pyspark code:

gcloud dataproc batches submit pyspark generated/players/python/etl.py --version=1.1 --batch=players-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

Go to GCP UI and see the results, or execute the below query to get them:

bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM myligue.players_ranking"

Destroy infraestructure:

terraform -chdir="generated/players/terraform" destroy --auto-approve

Usage example customers

Execute py program to generate the hcl and pyspark code for customers example:
```
python3 ai_gen.py customers ${MY_GCP_PROJECT}
```

Create infraestructure with terraform:

terraform -chdir="generated/customers/terraform" init
terraform -chdir="generated/customers/terraform" apply --auto-approve

Insert data into Cloud Storage folders:

gcloud storage cp example_data/customers/sales.csv gs://sales-etl-bucket/input/sales
gcloud storage cp example_data/customers/customers.csv gs://sales-etl-bucket/input/customers

Create a Batch in Dataproc with pyspark code.

gcloud dataproc batches submit pyspark generated/customers/python/etl.py --version=1.1 --batch=customers-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

Go to GCP UI and see the results, or execute the below query to get them:

bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM raw_sales_data.customer_table"
bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM raw_sales_data.sales_table"
bq query --project_id=${MY_GCP_PROJECT} --use_legacy_sql=False "SELECT * FROM master_sales_data.bigtable_info"

Destroy infraestructure:

terraform -chdir="generated/customers/terraform" destroy --auto-approve

Usage create new example

Insert a new K,V element in 'examples' dictionary variable in ai_gen.py.
Execute py program to generate the hcl and pyspark code for your example:
```
python3 ai_gen.py ${your_example_name} ${MY_GCP_PROJECT}
```

Create infraestructure with terraform:

terraform -chdir="generated/${your_example_name}/terraform" init
terraform -chdir="generated/${your_example_name}/terraform" apply --auto-approve

Insert data into Cloud Storage or Bigquery

Create a Batch in Dataproc with pyspark code.

gcloud dataproc batches submit pyspark generated/customers/python/etl.py --version=1.1 --batch=${your_example_name}-genai-$RANDOM --region="europe-west1" --deps-bucket=gs://temporary-files-dataproc --project=${MY_GCP_PROJECT} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

Check results into Cloud Storage or Bigquery

Destroy infraestructure:

terraform -chdir="generated/${your_example_name}/terraform" destroy --auto-approve

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example_data		example_data
generated		generated
prompt		prompt
.gitignore		.gitignore
README.md		README.md
ai_gen.py		ai_gen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example_data

example_data

generated

generated

prompt

prompt

.gitignore

.gitignore

README.md

README.md

ai_gen.py

ai_gen.py

Repository files navigation

PySpark ETL and Infrastructure as Code (IaC) with Vertex AI on GCP

Documentation

Overview

Features

Prerequisites

Installation

Usage example players

Usage example customers

Usage create new example

About

Releases

Packages

Languages

alfonsozamorac/etl-genai

Folders and files

Latest commit

History

Repository files navigation

PySpark ETL and Infrastructure as Code (IaC) with Vertex AI on GCP

Documentation

Overview

Features

Prerequisites

Installation

Usage example players

Usage example customers

Usage create new example

About

Resources

Stars

Watchers

Forks

Languages