# Creating A Local Spark Cluster

Until now, we've been creating our spark sessions using:
``` python
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("NYTaxi") \
    .getOrCreate()
```

Now, we will be starting a cluster locally using Spark's "Standalone Mode" and connecting to it. We are loosely following the instructions [here](https://spark.apache.org/docs/latest/spark-standalone.html). The referenced scripts are all in `$SPARK_HOME/sbin`. We started the master with `start-master.sh`.  
This time, we specify the master URL which we can either get from the `.out` log file or from the MasterWebUI which is now available on port `8080` (as opposed to `4040` using the 'local' method above).

In [1]:
import pyspark
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .master("spark://dezc-vm2.us-west1-a.c.brilliant-vent-400717.internal:7077") \
    .appName("NYTaxi") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/31 21:07:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

Let's run something!

In [4]:
df_green = spark.read.parquet("data/pq/green/*/*")
df_green.printSchema()

23/10/31 21:07:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
23/10/31 21:07:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- ehail_fee: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- trip_type: integer (nullable = true)
 |-- congestion_surcharge: double (nullable = true)



We started a job without starting any workers first, hence the warning messages. We needed to start a cluster manually using `start-worker.sh <master-spark-URL>` first. Then the task ran and we got output.

Next, we will convert this notebook to a script and modify it to accept parameters. I'm actually going to remove everything below this cell now, but you can see the resulting Python script, [10_spark_sql_script.py](10_spark_sql_script.py), under the same folder.

> **Note 1:** Before we attempt to run the script, we have to free up the resources this notebook is taking in Spark. Until we do, we will see the same warnings we saw in the code cell above, because the first application (this notebook) will take up all the resources because we never specified how many executors it needed so it took everything that was available. The first application can be `kill`ed from the MasterWebUI. (Although it may have been killed once we shut down the Jupyter kernel).

## The Evolution of the Script

The first itearion of the script essentially had the exact same logic, only with added arument-parsing for flexibility. We ran it with:
``` bash
python 10_spark_sql_script.py \
    --input_green=data/pq/green/2020/*/ \
    --input_yellow=data/pq/yellow/2020/*/ \
    --output=data/report-2020
```
which resulted in:

In [5]:
!tree data/report-2020/

[01;34mdata/report-2020/[00m
├── _SUCCESS
└── part-00000-c82ba606-5ad2-4ce7-86a1-f72094cf6aad-c000.snappy.parquet

0 directories, 2 files


But hard-coding the master like this
```python
spark = SparkSession.builder \
    .master("spark://dezc-vm2.us-west1-a.c.brilliant-vent-400717.internal:7077") \
    .appName("NYTaxi") \
    .getOrCreate()
```
is not ideal in practice. What if we wanted to run this using Airflow? What if we wanted more flexibility, specifying the number of executors we wanted and how much RAM we wanted them to have? That is why we have removed the line that specifies and hardcodes the Spark master above (`.master(...)`) in the final script. We then use the `spark-submit` script to submit our job to the cluster. See the full documentation [here](https://spark.apache.org/docs/latest/submitting-applications.html).
``` bash
spark-submit \
    --master="spark://dezc-vm2.us-west1-a.c.brilliant-vent-400717.internal:7077" \
    10_spark_sql_script.py \
        --input_green=data/pq/green/2021/*/ \
        --input_yellow=data/pq/yellow/2021/*/ \
        --output=data/report-2021
```

We see a lot more output to the console using `spark-submit`, which is good because we didn't add much logging to our script!!

In [7]:
!tree data/report-2021/

[01;34mdata/report-2021/[00m
├── _SUCCESS
└── part-00000-8b6a8b04-408c-4c4e-bb23-d8a49767d31e-c000.snappy.parquet

0 directories, 2 files


Don't forget to:

In [8]:
!$SPARK_HOME/sbin/stop-master.sh

stopping org.apache.spark.deploy.master.Master


and:

In [9]:
!$SPARK_HOME/sbin/stop-worker.sh

stopping org.apache.spark.deploy.worker.Worker


In the next part, we're going to create a Spark cluster on GCP and use it.