<a href="https://colab.research.google.com/github/easycloudapi/learn_gcp/blob/main/learning_resources/04_GCP_Dataproc_Service.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataproc Cluster

# Sample Pipeline 1: Preprocessing BigQuery Data with PySpark on Dataproc

Ref: https://codelabs.developers.google.com/codelabs/pyspark-bigquery#1

## Step 1: Enabling the Compute Engine, Dataproc and BigQuery Storage APIs

```shell
gcloud services enable compute.googleapis.com \
dataproc.googleapis.com \
bigquerystorage.googleapis.com
```

## Step 2: Config The Environment

```shell
# Environment Variable Setup
PROJECT_ID=test-flask-api-372913  \
REGION=us-central1  \
CLUSTER_NAME=demo-cluster-01 \
BUCKET_NAME=demo-$(date +%Y%m%d%H%m)
```

```shell
# Project Configuration Setup
gcloud config set project $PROJECT_ID  \
gcloud config set dataproc/region $REGION
```

## Step 3: Create a Dataproc Cluster

```shell
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
    --worker-machine-type n1-standard-2 \
    --master-boot-disk-size 50GB
    --num-workers 2 \
    --worker-boot-disk-size 100GB
    --image-version 1.5-debian \
    --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
    --metadata 'PIP_PACKAGES=google-cloud-storage' \
    --optional-components=ANACONDA \
    --enable-component-gateway
```

```shell
# working code
gcloud dataproc clusters create ${CLUSTER_NAME} --enable-component-gateway --region ${REGION} --master-machine-type n2-standard-2 --master-boot-disk-size 100 --num-workers 2 --worker-machine-type n2-standard-2 --worker-boot-disk-size 100 --image-version 2.1-debian11 --optional-components JUPYTER --scopes 'https://www.googleapis.com/auth/cloud-platform' --project test-flask-api-372913
```

## Step 3.1: Create a GCS Bucket

```shell
gsutil mb gs://${BUCKET_NAME}
```

## Step 4: Data Exploration In **Bigquery**


```sql
select * from fh-bigquery.reddit_posts.2017_01 limit 10;
```

## Step 5: Clone **pyspark_bq_show_dataproc.py** from Repo And Submit the Job

[Click Here to Get **pyspark_bq_show_dataproc.py** Code](https://github.com/easycloudapi/learn_gcp/blob/main/python_helpers/pyspark_bq_show_dataproc.py)

```shell
# print as anoutput for "subreddit name" and "count of each subreddit"
gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} \
    --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    --driver-log-levels root=FATAL \
    pyspark_bq_show_dataproc.py
```

## Step 6: Clone **pyspark_bq_to_gcs_dataproc.py** from Repo And Submit the Job

[Click Here to Get The **pyspark_bq_to_gcs_dataproc.py** Code](https://github.com/easycloudapi/learn_gcp/blob/main/python_helpers/pyspark_bq_to_gcs_dataproc.py)

```shell
YEAR=2018 \
MONTH=12

# Move data from BQ to GCS bucket
gcloud dataproc jobs submit pyspark \
    --cluster ${CLUSTER_NAME} \
    --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    --driver-log-levels root=FATAL \
    pyspark_bq_to_gcs_dataproc.py \
    -- ${YEAR} ${MONTH} ${BUCKET_NAME}
```

## Step 7: Exploring the Dataproc and Spark UIs

**Dataproc UI**

![image](https://github.com/easycloudapi/learn_gcp/assets/108976294/81c846a1-0dd6-456a-9820-aee880069be0)

**Spark UI**

(Goto dataproc Clusters -> Click on cluster name -> Then click on "**WEB INTERFACES**")

![image](https://github.com/easycloudapi/learn_gcp/assets/108976294/cf4b389e-7d15-411e-9651-ea490771c6fc)

