# Spark & Google Cloud

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

First we will move our local data to a bucket in GCS.
You need authenticate:

`gcloud auth activate-service-account --key-file KEY.json`

Then navigate to the folder where the .parquet files are, to be able to copy them to the bucket:
`gsutil -m cp -r pq/ gs://BUCKET/pq`


**Explanation:**

- `m`, multithread, since we're addting a lot of files, runs process in paralel

- `cp`, copy

- `-r` recursive, because we're copying multiple files

----

## Reading files in the cloud

**Dowload Hadoop connector**


To be able to read files that are stored in the bucket locally we need to download the GCS connector *Hadoop*.

- Create a folder `lib` and run the following:

`gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar gcs-connector-hadoop3-2.2.5.jar` 



In [12]:
credentials_location = 'peppy.json'
bucket_name = "imgabi-zoomcamp-kestra"

conf = SparkConf() \
    .setMaster('local[*]') \
    .setAppName('test') \
    .set("spark.jars", "./lib/gcs-connector-hadoop3-2.2.5.jar") \
    .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)

# line 03: location of jar file


In [13]:
sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

# here we have:
# we you seee filesystem is gs
# you need to use Google Credentials
hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")

In [14]:
# create the session
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

In [16]:
# if it works, you can read the data
# USE

df_green = spark.read.parquet(f'gs://{bucket_name}/pq/green/*/*')

In [17]:
df_green.count()

                                                                                

1734051

In [19]:
spark.stop()
sc.stop()

## Create a Local Spark Cluster

First you need to find the location in which Spark is installed.
Since we installed it with brew, run:

`brew --prefix apache-spark`

In my case, I can navigate to this folder: `/usr/local/Cellar/apache-spark/3.5.4/`.

From there, run:

`./sbin/start-master.sh`

Then you will be able to open Spark dashboard at localhost:8080.

We will see that there are no workers listed. To create one, we can do the following:

`./sbin/start-worker.sh spark://<master-spark-URL>`

<img src="./img/spark-master.png" width="80%">

In [20]:
# now we can connect to the SparkMaster instead of using local[*]
# to do that we need to change the configuration of the SparkConf
import pyspark
from pyspark.sql import SparkSession

# start the session
spark = SparkSession.builder \
    .master("spark://localhost:7077") \
    .appName('test') \
    .getOrCreate()

25/02/24 18:21:46 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.178.241: Command exited with code 143
25/02/24 18:22:05 WARN StandaloneAppClient$ClientEndpoint: Connection to Gabis-iMac-Pro.local:7077 failed; waiting for master to reconnect...
25/02/24 18:22:05 WARN StandaloneSchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...


We can see now we have one worker, and one application (which we just created above).

<img src="./img/spark-master-2.png" width="80%">

## Creating a Spark Job

We will reuse the code we created earlier, with the taxi schema. 
The code is adapted so it can receive parameters via de command line.
It can be found at [spark_parquet.py](./spark_parquet.py).

This is how you would run it mannually

```python
python spark_parquet.py \
    --input_green="data/pq/green/2020/*" \
    --input_yellow="data/pq/yellow/2020/*" \
    --output="data/report-2020"
```


However, we can use `spark-submit`to be able to send the job directly. We also identify the master here, and run it in the shell:

```bash
spark-submit \
    --master="spark://<URL>" \
    my_script.py \
        --input_green=data/pq/green/2020/*/ \
        --input_yellow=data/pq/yellow/2020/*/ \
        --output=data/report-2020
```

Once we are done with the work we need to stop the worker and the master:

`./sbin/stop-worker.sh`

`./sbin/stop-master.sh`


## Setting up a Dataproc Cluster

- Go to GCS, and select `dataproc cluster` (you might need to enable it, if it's the first time you're using it)

- Create a new cluster, with default settings. Make sure it's on the same location as your bucket.

- Via the terminal, you can copy the code [spark_parquet.py](./spark_parquet.py) to your bucket:

`gsutil cp spark_parquet.py gs://YOUR_BUCKET/code/spark_parquet.py`

- Click on the cluster you created to submit a new job.
Remember we need to submit the arguments. This is done by adding one argument at a time. 
We also replace the values with our bucket name:

```
        --input_green=gs://BUCKET/pq/green/2020/*/ \
        --input_yellow=gs://BUCKET/pq/yellow/2020/*/ \
        --output=gs://BUCKET/report-2020
```

<img src="./img/spark-job.png" width="70%">

Once the job finishes running, we can see that there is a folder inside the BUCKET with the report-2020.


#### Submitting job via SDK

You can also submit jobs with the SDK. 
**NOTE:** Make sure you have enough permissions in your service account (*simple solution: Dataproc Administrator*)

```bash
gcloud dataproc jobs submit pyspark \
    --cluster=<your-cluster-name> \
    --region=<region-of-your-cluster> \
    gs://<url-of-your-script-in-bucket> \
    -- \
        --param1=<your-param-value> \
        --param2=<your-param-value>
```

