# Spark Sample Notebook

This notebook demonstrates how to connect to the Bitnami Root Spark Docker container and run basic Spark operations.

In [None]:
# Step 1: Initialize Spark connection
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a Spark session connected to your Docker cluster
spark = SparkSession.builder \
    .appName("VSCode Sample Notebook") \
    .master("spark://localhost:7077") \
    .config("spark.driver.memory", "1g") \
    .getOrCreate()

# Verify connection
print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## Creating and Working with DataFrames

The following cells demonstrate how to create and manipulate Spark DataFrames.

In [None]:
# Step 2: Create a sample DataFrame
data = [
    ("Alice", 34, "Data Scientist"),
    ("Bob", 45, "Software Engineer"),
    ("Charlie", 29, "Data Analyst"),
    ("Diana", 41, "DevOps Engineer"),
    ("Evan", 37, "ML Engineer")
]

# Define the schema
schema = ["Name", "Age", "Occupation"]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Display the DataFrame
print("Sample DataFrame:")
df.show()

# Print the schema
print("DataFrame Schema:")
df.printSchema()

## Data Transformation Examples

Now let's perform some basic transformations on our data.

In [None]:
# Step 3: Perform transformations
from pyspark.sql.functions import col, upper, avg, desc

# Example 1: Filter data
filtered_df = df.filter(col("Age") > 35)
print("People older than 35:")
filtered_df.show()

# Example 2: Transform data
transformed_df = df.withColumn("UPPERCASE_NAME", upper(col("Name")))
print("Names in uppercase:")
transformed_df.select("Name", "UPPERCASE_NAME").show()

# Example 3: Aggregations
avg_age = df.groupBy("Occupation").agg(avg("Age").alias("Average_Age"))
print("Average age by occupation:")
avg_age.orderBy(desc("Average_Age")).show()

## Creating a Temporary View and Using SQL

Spark allows you to run SQL queries on DataFrames by creating temporary views.

In [None]:
# Step 4: Using SQL with Spark
# Create a temporary view
df.createOrReplaceTempView("employees")

# Run SQL queries
sql_result = spark.sql("""
    SELECT Occupation, COUNT(*) as Count, AVG(Age) as Avg_Age
    FROM employees
    GROUP BY Occupation
    ORDER BY Count DESC
""")

print("SQL Query Result:")
sql_result.show()

## Working with External Data

You can also read from and write to various file formats.

In [None]:
# Step 5: Write DataFrame to CSV
# Write to a CSV file in the shared volume
df.write.mode("overwrite").option("header", "true").csv("/data/employees.csv")
print("DataFrame written to /data/employees.csv")

# Read it back
read_df = spark.read.option("header", "true").csv("/data/employees.csv")
print("DataFrame read from CSV:")
read_df.show()

## Closing the Spark Session

Always remember to close your Spark session when you're done.

In [None]:
# Step 6: Stop the Spark session when finished
spark.stop()
print("Spark session stopped")