# Smoke Testing a Spark Cluster

### Here is a minimal cell for connecting to an Apache Spark cluster
This code only assumes that the environment variable `SPARK_CLUSTER` is set, which Open Data Hub does by default

In [1]:
from pyspark.sql import SparkSession
import os

# The name of your Spark cluster hostname or ip address
spark_cluster = os.environ['SPARK_CLUSTER']

spark = SparkSession.builder \
    .master('spark://{cluster}:7077'.format(cluster=spark_cluster)) \
    .appName('Spark-Smoke-Test') \
    .getOrCreate()

### A simple test that your spark cluster is visible to you and operating
When you are unsure of your environment, it's useful to run a very simple test computation, such as
the following, where we declare a simple Spark RDD of integers and tell spark to compute their sum

In [2]:
rdd = spark.sparkContext.parallelize(range(10), 2)
rdd.sum()

45

### Spark transformations
Recall that Spark transformations are _lazy_: declaring transforms does not cause Spark to run the actual computations.
The following transformations will not execute unless we ask Spark to give us a result that requires it.

In [3]:
rdd2 = rdd.map(lambda x: 2 * x)
rdd3 = rdd2.map(lambda x: x + 1)

### Spark actions
Actions are Spark operations that require Spark to run physical computations, to return a concrete result.
One of the simplest operations - `collect` - tells Spark to execute the physical computations and return the final result to the driver program, in this case our Jupyter notebook:

In [4]:
rdd3.collect()

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]