# UAT to ensure Kubeflow notebooks can access Spark

This notebook verifies that the Kubeflow notebooks that are attached the label `access-spark-notebook=true` are able to access Spark in the notebook environment. The test attempts to import `pyspark`, create a trivial `SparkSession` and run a trivial job to calculate the number of vowels in a sample string.

The expected outcome is that number of vowels in the sample string is computed to be `130`.

This notebook requires Kubeflow + Spark setup to have been deployed.

### Import required packages

In [None]:
from operator import add
from pyspark.sql import SparkSession
import os

### Ensure that environment variables have been injected

In [None]:
assert "SPARK_SERVICE_ACCOUNT" in os.environ
assert "SPARK_NAMESPACE" in os.environ

### Define `count_vowels` function

In [None]:
def count_vowels(text: str) -> int:
    count = 0
    for char in text:
        if char.lower() in "aeiou":
            count += 1
    return count

### Prepare sample data

In [None]:
lines = """Canonical's Charmed Data Platform solution for Apache Spark runs Spark jobs on your Kubernetes cluster.
You can get started right away with MicroK8s - the mightiest tiny Kubernetes distro around! 
The spark-client snap simplifies the setup process to run Spark jobs against your Kubernetes cluster. 
Spark on Kubernetes is a complex environment with many moving parts.
Sometimes, small mistakes can take a lot of time to debug and figure out.
"""

### Create a `SparkSession`

In [None]:
# Create a Spark session
session = SparkSession.builder.appName("CountVowels").getOrCreate()

### Perform computation and assert the correctness of result

In [None]:
num_vowels = session.sparkContext.parallelize(lines.splitlines(), 2).map(count_vowels).reduce(add)
print(f"The number of vowels in the string is {num_vowels}")

expected = count_vowels(lines)
assert num_vowels == expected, f"Expected {expected} vowels, but got {num_vowels}"

### Stop the `SparkSession`

In [None]:
session.stop()