# PySpark Basic: Standalone Cluster
## Previous Steps
It would be good to run the [pyspark basic (dataframe) on spark local environment](./dataframe.ipynb) before you run this example. Also, this example requires spark environment. Please follow [the instructions](spark.md) to make sure that you've installed spark application on your environment before you begin. This is simple script to verify if a spark application is installed and configured.

## Launch a Standalone Cluster
You need to make sure that your standalone spark cluster is running. If your cluster is not running. Run your local standalone spark cluster:

In [1]:
# run a spark standalone cluster
!sh sparkctl.sh -r

starting org.apache.spark.deploy.master.Master, logging to /home/pen/.local/lib/spark-3.5.4-bin-hadoop3/logs/spark-pen-org.apache.spark.deploy.master.Master-1-emma.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/pen/.local/lib/spark-3.5.4-bin-hadoop3/logs/spark-pen-org.apache.spark.deploy.worker.Worker-1-emma.out


## Iinitialize PySpark Environment

This example requires spark environment. Please make sure that you've installed spark application on your environment before you begin. This is simple script to verify if a spark application is installed and configured.
```
# It is required to set the SPARK_HOME environment variable.
# Please make sure the variable indicates to the right path to your spark.
if [ -z $SPARK_HOME ] ; then
  export SPARK_HOME="$HOME/.local/lib/spark-3.5.4-bin-hadoop3"
fi
```

In [2]:
# validate findspark
!pip list | grep spark

findspark                                2.0.1


In [3]:
import findspark
findspark.init()

## PySpark on Stndalone Cluster

In [4]:
import pyspark
from pyspark.sql import SparkSession

# create a new spark context
#sc = pyspark.SparkContext(master="spark://localhost:7077", appName="pyspark-basic")
#spark = SparkSession(sc)
#
# or
spark = SparkSession.builder.master("spark://localhost:7077").appName("pyspark-basic").getOrCreate()

25/03/25 23:08:30 WARN Utils: Your hostname, emma resolves to a loopback address: 127.0.1.1; using 172.18.245.201 instead (on interface eth0)
25/03/25 23:08:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/25 23:08:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Pi example


In [5]:
import random

num_samples = 100000000

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = spark.sparkContext.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples

print(pi)

[Stage 0:>                                                          (0 + 2) / 2]

3.1415364


                                                                                

## Clean up

In [6]:
# stop the current spark session for cleanup
spark.stop()

In [7]:
# terminate the running spark standalone cluster
!sh sparkctl.sh -t

stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master


# Additional Resources
- [Apache Spark Examples](https://spark.apache.org/examples.html)
- [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)

# References
- [Spark Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)
- [Submitting Spark Applications](https://spark.apache.org/docs/latest/submitting-applications.html)
- [Cluster Mode Overview](https://spark.apache.org/docs/latest/cluster-overview.html)