# PySpark Koans - Practice Notebook

This notebook contains all 59 koans as exercises. Fill in the blanks marked with `___` to complete each koan.

**Note**: These koans are designed to work with the browser-based pandas shim. To run with real PySpark, you'll need a Spark environment.

## Categories:
- **Koans 1-30**: PySpark Basics and Operations
- **Koans 101-110**: Delta Lake
- **Koans 201-210**: Unity Catalog
- **Koans 301-310**: Pandas API on Spark

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import pyspark.pandas as ps

# For browser-based version, spark is already initialized
# This notebook assumes you have PySpark available

print("âœ“ Environment ready")

## Basics

In [None]:
# Koan 1: Creating a DataFrame
# Category: Basics

# Setup
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
columns = ["name", "age"]

# Exercise: Create a DataFrame from the data and columns
df = spark.___(data, columns)

# The DataFrame should have 3 rows
assert df.count() == 3, f"Expected 3 rows, got {df.count()}"
print("âœ“ DataFrame created with correct row count")

# The DataFrame should have 2 columns
assert len(df.columns) == 2, f"Expected 2 columns, got {len(df.columns)}"
print("âœ“ DataFrame has correct number of columns")

print("\\nðŸŽ‰ Koan complete! You've learned to create a DataFrame.")

In [None]:
# Koan 2: Selecting Columns
# Category: Basics

# Setup
data = [("Alice", 34, "NYC"), ("Bob", 45, "LA"), ("Charlie", 29, "Chicago")]
df = spark.createDataFrame(data, ["name", "age", "city"])

# Exercise: Select only the 'name' and 'city' columns
result = df.___("name", "___")

# Result should have exactly 2 columns
assert len(result.columns) == 2, f"Expected 2 columns, got {len(result.columns)}"
print("âœ“ Correct number of columns selected")

# Result should contain 'name' and 'city'
assert "name" in result.columns, "Missing 'name' column"
assert "city" in result.columns, "Missing 'city' column"
print("âœ“ Correct columns selected")

print("\\nðŸŽ‰ Koan complete! You've learned to select columns.")

In [None]:
# Koan 3: Filtering Rows
# Category: Basics

# Setup
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29), ("Diana", 52)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Filter to only include people over 35
result = df.___(col("age") ___ 35)

# Should have 2 people over 35
assert result.count() == 2, f"Expected 2 rows, got {result.count()}"
print("âœ“ Correct number of rows filtered")

# Collect and verify
rows = result.collect()
ages = [row["age"] for row in rows]
assert all(age > 35 for age in ages), "Some ages are not > 35"
print("âœ“ All remaining rows have age > 35")

print("\\nðŸŽ‰ Koan complete! You've learned to filter rows.")

In [None]:
# Koan 4: Adding Columns
# Category: Basics

# Setup
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Add a new column 'age_in_months' that multiplies age by 12
result = df.___("age_in_months", col("___") * 12)

# Should still have 3 rows
assert result.count() == 3
print("âœ“ Row count unchanged")

# Should now have 3 columns
assert len(result.columns) == 3, f"Expected 3 columns, got {len(result.columns)}"
print("âœ“ New column added")

# Check calculation is correct
first_row = result.filter(col("name") == "Alice").collect()[0]
assert first_row["age_in_months"] == 408, f"Expected 408, got {first_row['age_in_months']}"
print("âœ“ Calculation is correct (34 * 12 = 408)")

print("\\nðŸŽ‰ Koan complete! You've learned to add columns.")

In [None]:
# Koan 5: Grouping and Aggregating
# Category: Basics

# Setup
data = [
    ("Sales", "Alice", 5000),
    ("Sales", "Bob", 4500),
    ("Engineering", "Charlie", 6000),
    ("Engineering", "Diana", 6500),
    ("Engineering", "Eve", 5500)
]
df = spark.createDataFrame(data, ["department", "name", "salary"])

# Exercise: Group by department and calculate average salary
result = df.___("department").agg(
    round(___("salary"), 2).alias("avg_salary")
)

# Should have 2 departments
assert result.count() == 2, f"Expected 2 groups, got {result.count()}"
print("âœ“ Correct number of groups")

# Check Engineering average (6000 + 6500 + 5500) / 3 = 6000
eng_row = result.filter(col("department") == "Engineering").collect()[0]
assert eng_row["avg_salary"] == 6000.0, f"Expected 6000.0, got {eng_row['avg_salary']}"
print("âœ“ Engineering average salary is correct")

print("\\nðŸŽ‰ Koan complete! You've learned to group and aggregate.")

In [None]:
# Koan 6: Dropping Columns
# Category: Basics

# Setup
data = [("Alice", 34, "NYC", "F"), ("Bob", 45, "LA", "M")]
df = spark.createDataFrame(data, ["name", "age", "city", "gender"])

# Exercise: Drop the 'gender' column
result = df.___("gender")

assert "gender" not in result.columns, "gender column should be dropped"
assert len(result.columns) == 3, f"Expected 3 columns, got {len(result.columns)}"
print("âœ“ Dropped gender column")

# Drop multiple columns
result2 = df.___("city", "gender")
assert len(result2.columns) == 2, f"Expected 2 columns, got {len(result2.columns)}"
print("âœ“ Dropped multiple columns")

print("\\nðŸŽ‰ Koan complete! You've learned to drop columns.")

In [None]:
# Koan 7: Distinct Values
# Category: Basics

# Setup
data = [("Alice", "NYC"), ("Bob", "LA"), ("Alice", "NYC"), ("Charlie", "NYC")]
df = spark.createDataFrame(data, ["name", "city"])

# Exercise: Get distinct rows
result = df.___()

assert result.count() == 3, f"Expected 3 distinct rows, got {result.count()}"
print("âœ“ Got distinct rows")

# Get distinct cities only
cities = df.select("city").___()
assert cities.count() == 2, f"Expected 2 distinct cities, got {cities.count()}"
print("âœ“ Got distinct cities (NYC, LA)")

print("\\nðŸŽ‰ Koan complete! You've learned to get distinct values.")

## Column Operations

In [None]:
# Koan 9: Renaming Columns
# Category: Column Operations

# Setup
data = [("Alice", 34), ("Bob", 45)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Rename 'name' to 'employee_name'
result = df.___(___, "employee_name")

assert "employee_name" in result.columns, "Should have employee_name column"
assert "name" not in result.columns, "Should not have name column anymore"
print("âœ“ Renamed name to employee_name")

# Rename using alias in select
result2 = df.select(col("name").___("full_name"), col("age"))
assert "full_name" in result2.columns, "Should have full_name column"
print("âœ“ Used alias in select")

print("\\nðŸŽ‰ Koan complete! You've learned to rename columns.")

In [None]:
# Koan 10: Literal Values
# Category: Column Operations

# Setup
data = [("Alice", 34), ("Bob", 45)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Add a column 'country' with value 'USA' for all rows
result = df.withColumn("country", ___("USA"))

rows = result.collect()
assert all(row["country"] == "USA" for row in rows), "All rows should have country=USA"
print("âœ“ Added literal column")

# Add a numeric literal
result2 = df.withColumn("bonus", ___(1000))
assert result2.collect()[0]["bonus"] == 1000, "Bonus should be 1000"
print("âœ“ Added numeric literal")

print("\\nðŸŽ‰ Koan complete! You've learned to use literal values.")

In [None]:
# Koan 11: Conditional Logic with when/otherwise
# Category: Column Operations

# Setup
data = [("Alice", 34), ("Bob", 45), ("Charlie", 17), ("Diana", 65)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Create an 'age_group' column based on age
result = df.withColumn(
    "age_group",
    ___(col("age") < 18, "minor")
    .when(col("age") < 65, "adult")
    ._____("senior")
)

rows = result.collect()
groups = {row["name"]: row["age_group"] for row in rows}

assert groups["Charlie"] == "minor", f"Charlie should be minor, got {groups['Charlie']}"
print("âœ“ Charlie (17) is minor")

assert groups["Alice"] == "adult", f"Alice should be adult, got {groups['Alice']}"
print("âœ“ Alice (34) is adult")

assert groups["Diana"] == "senior", f"Diana should be senior, got {groups['Diana']}"
print("âœ“ Diana (65) is senior")

print("\\nðŸŽ‰ Koan complete! You've learned conditional column logic.")

In [None]:
# Koan 12: Type Casting
# Category: Column Operations

# Setup
data = [("Alice", "34"), ("Bob", "45")]
df = spark.createDataFrame(data, ["name", "age_str"])

# Exercise: Cast age_str from string to integer
result = df.withColumn("age", col("age_str").cast("___ "))

# Verify we can do math on the new column
result = result.withColumn("age_plus_10", col("age") + 10)

rows = result.collect()
assert rows[0]["age_plus_10"] == 44, f"Expected 44, got {rows[0]['age_plus_10']}"
print("âœ“ Cast to integer and performed math")

# Cast to double
result2 = df.withColumn("age_float", col("age_str").cast("___ "))
print("âœ“ Cast to double")

print("\\nðŸŽ‰ Koan complete! You've learned to cast types.")

## String Functions

In [None]:
# Koan 13: String Functions - Case
# Category: String Functions

# Setup
data = [("alice smith",), ("BOB JONES",), ("Charlie Brown",)]
df = spark.createDataFrame(data, ["name"])

# Exercise: Convert to uppercase
result = df.withColumn("upper_name", ___(col("name")))
assert result.collect()[0]["upper_name"] == "ALICE SMITH"
print("âœ“ Converted to uppercase")

# Convert to lowercase
result = df.withColumn("lower_name", ___(col("name")))
assert result.collect()[1]["lower_name"] == "bob jones"
print("âœ“ Converted to lowercase")

# Convert to title case (capitalize first letter of each word)
result = df.withColumn("title_name", ___(col("name")))
assert result.collect()[0]["title_name"] == "Alice Smith"
print("âœ“ Converted to title case")

print("\\nðŸŽ‰ Koan complete! You've learned string case functions.")

In [None]:
# Koan 14: String Functions - Concatenation
# Category: String Functions

# Setup
data = [("Alice", "Smith"), ("Bob", "Jones")]
df = spark.createDataFrame(data, ["first", "last"])

# Exercise: Concatenate first and last name with a space
result = df.withColumn("full_name", ___(col("first"), lit(" "), col("last")))
assert result.collect()[0]["full_name"] == "Alice Smith"
print("âœ“ Concatenated with concat()")

# Use concat_ws (with separator) - cleaner for multiple values
result2 = df.withColumn("full_name", ___(" ", col("first"), col("last")))
assert result2.collect()[0]["full_name"] == "Alice Smith"
print("âœ“ Concatenated with concat_ws()")

print("\\nðŸŽ‰ Koan complete! You've learned string concatenation.")

In [None]:
# Koan 15: String Functions - Substring and Length
# Category: String Functions

# Setup
data = [("Alice",), ("Bob",), ("Charlotte",)]
df = spark.createDataFrame(data, ["name"])

# Exercise: Get the length of each name
result = df.withColumn("name_length", ___(col("name")))
lengths = [row["name_length"] for row in result.collect()]
assert lengths == [5, 3, 9], f"Expected [5, 3, 9], got {lengths}"
print("âœ“ Calculated string lengths")

# Get first 3 characters (substring is 1-indexed!)
result2 = df.withColumn("first_three", ___(col("name"), 1, 3))
firsts = [row["first_three"] for row in result2.collect()]
assert firsts == ["Ali", "Bob", "Cha"], f"Expected ['Ali', 'Bob', 'Cha'], got {firsts}"
print("âœ“ Extracted first 3 characters")

print("\\nðŸŽ‰ Koan complete! You've learned substring and length.")

In [None]:
# Koan 16: String Functions - Trim and Pad
# Category: String Functions

# Setup
data = [("  Alice  ",), ("Bob",), (" Charlie ",)]
df = spark.createDataFrame(data, ["name"])

# Exercise: Trim whitespace from both sides
result = df.withColumn("trimmed", ___(col("name")))
trimmed = [row["trimmed"] for row in result.collect()]
assert trimmed == ["Alice", "Bob", "Charlie"], f"Expected trimmed names, got {trimmed}"
print("âœ“ Trimmed whitespace")

# Pad names to 10 characters with asterisks
result2 = df.withColumn("trimmed", trim(col("name")))
result2 = result2.withColumn("padded", ___(col("trimmed"), 10, "*"))
assert result2.collect()[1]["padded"] == "*******Bob"
print("âœ“ Left-padded with asterisks")

print("\\nðŸŽ‰ Koan complete! You've learned trim and pad functions.")

## Aggregations

In [None]:
# Koan 17: Grouping and Aggregating
# Category: Aggregations

# Setup
data = [
    ("Sales", "Alice", 5000),
    ("Sales", "Bob", 4500),
    ("Engineering", "Charlie", 6000),
    ("Engineering", "Diana", 6500),
    ("Engineering", "Eve", 5500)
]
df = spark.createDataFrame(data, ["department", "name", "salary"])

# Exercise: Group by department and calculate average salary
result = df.___("department").agg(
    round(___("salary"), 2).alias("avg_salary")
)

# Should have 2 departments
assert result.count() == 2, f"Expected 2 groups, got {result.count()}"
print("âœ“ Correct number of groups")

# Check Engineering average (6000 + 6500 + 5500) / 3 = 6000
eng_row = result.filter(col("department") == "Engineering").collect()[0]
assert eng_row["avg_salary"] == 6000.0, f"Expected 6000.0, got {eng_row['avg_salary']}"
print("âœ“ Engineering average salary is correct")

print("\\nðŸŽ‰ Koan complete! You've learned to group and aggregate.")

In [None]:
# Koan 18: Multiple Aggregations
# Category: Aggregations

# Setup
data = [
    ("Sales", 5000), ("Sales", 4500), ("Sales", 6000),
    ("Engineering", 6000), ("Engineering", 6500)
]
df = spark.createDataFrame(data, ["department", "salary"])

# Exercise: Calculate min, max, avg, and count per department
result = df.groupBy("department").agg(
    ___("salary").alias("min_salary"),
    ___("salary").alias("max_salary"),
    avg("salary").alias("avg_salary"),
    ___("salary").alias("emp_count")
)

sales = result.filter(col("department") == "Sales").collect()[0]

assert sales["min_salary"] == 4500, f"Min should be 4500, got {sales['min_salary']}"
print("âœ“ Min salary correct")

assert sales["max_salary"] == 6000, f"Max should be 6000, got {sales['max_salary']}"
print("âœ“ Max salary correct")

assert sales["emp_count"] == 3, f"Count should be 3, got {sales['emp_count']}"
print("âœ“ Employee count correct")

print("\\nðŸŽ‰ Koan complete! You've learned multiple aggregations.")

In [None]:
# Koan 19: Aggregate Without Grouping
# Category: Aggregations

# Setup
data = [(100,), (200,), (300,), (400,), (500,)]
df = spark.createDataFrame(data, ["value"])

# Exercise: Calculate sum of all values without grouping
result = df.___(spark_sum("value").alias("total"))

total = result.collect()[0]["total"]
assert total == 1500, f"Expected 1500, got {total}"
print("âœ“ Sum calculated: 1500")

# Calculate multiple aggregates
result2 = df.agg(
    spark_sum("value").alias("total"),
    ___("value").alias("average"),
    count("value").alias("num_rows")
)

row = result2.collect()[0]
assert row["average"] == 300.0, f"Expected 300.0, got {row['average']}"
assert row["num_rows"] == 5, f"Expected 5, got {row['num_rows']}"
print("âœ“ Multiple aggregates calculated")

print("\\nðŸŽ‰ Koan complete! You've learned global aggregations.")

## Joins

In [None]:
# Koan 20: Inner Join
# Category: Joins

# Setup
employees = spark.createDataFrame([
    (1, "Alice", 101),
    (2, "Bob", 102),
    (3, "Charlie", 101)
], ["emp_id", "name", "dept_id"])

departments = spark.createDataFrame([
    (101, "Engineering"),
    (102, "Sales"),
    (103, "Marketing")
], ["dept_id", "dept_name"])

# Exercise: Join employees with departments on dept_id
result = employees.___(departments, ___, "inner")

# Should have 3 rows (all employees have matching departments)
assert result.count() == 3, f"Expected 3 rows, got {result.count()}"
print("âœ“ Correct number of joined rows")

# Should have columns from both DataFrames
assert "name" in result.columns, "Missing 'name' column"
assert "dept_name" in result.columns, "Missing 'dept_name' column"
print("âœ“ Columns from both DataFrames present")

# Alice should be in Engineering
alice = result.filter(col("name") == "Alice").collect()[0]
assert alice["dept_name"] == "Engineering", f"Expected Engineering, got {alice['dept_name']}"
print("âœ“ Join matched correctly")

print("\\nðŸŽ‰ Koan complete! You've learned inner joins.")

In [None]:
# Koan 21: Left Outer Join
# Category: Joins

# Setup
employees = spark.createDataFrame([
    (1, "Alice", 101),
    (2, "Bob", 102),
    (3, "Charlie", 999)  # No matching department!
], ["emp_id", "name", "dept_id"])

departments = spark.createDataFrame([
    (101, "Engineering"),
    (102, "Sales")
], ["dept_id", "dept_name"])

# Exercise: Left join to keep all employees, even without matching dept
result = employees.join(departments, "dept_id", "___")

# Should have 3 rows (all employees kept)
assert result.count() == 3, f"Expected 3 rows, got {result.count()}"
print("âœ“ All employees kept")

# Charlie should have null department name
charlie = result.filter(col("name") == "Charlie").collect()[0]
assert charlie["dept_name"] is None, f"Expected None, got {charlie['dept_name']}"
print("âœ“ Charlie has no matching department (null)")

print("\\nðŸŽ‰ Koan complete! You've learned left outer joins.")

In [None]:
# Koan 22: Join on Multiple Columns
# Category: Joins

# Setup
orders = spark.createDataFrame([
    ("2024", "Q1", "Alice", 100),
    ("2024", "Q2", "Alice", 150),
    ("2024", "Q1", "Bob", 200)
], ["year", "quarter", "rep", "amount"])

targets = spark.createDataFrame([
    ("2024", "Q1", 120),
    ("2024", "Q2", 140)
], ["year", "quarter", "target"])

# Exercise: Join on both year and quarter
result = orders.join(targets, [___, ___], "inner")

# Should have 3 rows
assert result.count() == 3, f"Expected 3 rows, got {result.count()}"
print("âœ“ Joined on multiple columns")

# Check that Alice Q1 has target 120
alice_q1 = result.filter((col("rep") == "Alice") & (col("quarter") == "Q1")).collect()[0]
assert alice_q1["target"] == 120, f"Expected target 120, got {alice_q1['target']}"
print("âœ“ Targets matched correctly")

print("\\nðŸŽ‰ Koan complete! You've learned multi-column joins.")

## Window Functions

In [None]:
# Koan 23: Window Functions - Running Total
# Category: Window Functions

# Setup
data = [
    ("2024-01-01", 100),
    ("2024-01-02", 150),
    ("2024-01-03", 200),
    ("2024-01-04", 175)
]
df = spark.createDataFrame(data, ["date", "sales"])

# Exercise: Create a window that orders by date and includes all previous rows
window_spec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.___)

# Add running total column
result = df.withColumn("running_total", ___("sales").over(window_spec))

# Check the running totals
rows = result.orderBy("date").collect()

assert rows[0]["running_total"] == 100, "Day 1 should be 100"
print("âœ“ Day 1: 100")

assert rows[1]["running_total"] == 250, "Day 2 should be 250 (100+150)"
print("âœ“ Day 2: 250")

assert rows[3]["running_total"] == 625, "Day 4 should be 625"
print("âœ“ Day 4: 625 (cumulative)")

print("\\nðŸŽ‰ Koan complete! You've learned window running totals.")

In [None]:
# Koan 24: Window Functions - Row Number
# Category: Window Functions

# Setup
data = [
    ("Sales", "Alice", 5000),
    ("Sales", "Bob", 5500),
    ("Engineering", "Charlie", 6000),
    ("Engineering", "Diana", 6500),
    ("Engineering", "Eve", 5500)
]
df = spark.createDataFrame(data, ["dept", "name", "salary"])

# Exercise: Rank employees within each department by salary (highest first)
window_spec = Window.partitionBy("___").orderBy(col("salary").desc())

result = df.withColumn("rank", ___().___(window_spec))

# Check rankings
eng = result.filter(col("dept") == "Engineering").orderBy("rank").collect()
assert eng[0]["name"] == "Diana", f"Diana should be #1 in Engineering, got {eng[0]['name']}"
assert eng[0]["rank"] == 1
print("âœ“ Diana is #1 in Engineering ($6500)")

assert eng[1]["name"] == "Charlie", f"Charlie should be #2, got {eng[1]['name']}"
print("âœ“ Charlie is #2 in Engineering ($6000)")

print("\\nðŸŽ‰ Koan complete! You've learned row_number().")

In [None]:
# Koan 25: Window Functions - Lag and Lead
# Category: Window Functions

# Setup
data = [
    ("2024-01-01", 100),
    ("2024-01-02", 150),
    ("2024-01-03", 120),
    ("2024-01-04", 200)
]
df = spark.createDataFrame(data, ["date", "price"])

# Exercise: Get yesterday's price and calculate daily change
window_spec = Window.orderBy("date")

result = df.withColumn("prev_price", ___("price", 1).over(window_spec))
result = result.withColumn("change", col("price") - col("prev_price"))

rows = result.orderBy("date").collect()

# First row has no previous
assert rows[0]["prev_price"] is None, "First row should have no prev_price"
print("âœ“ First row has no previous")

# Second row: prev=100, change=50
assert rows[1]["prev_price"] == 100, f"Expected prev=100, got {rows[1]['prev_price']}"
assert rows[1]["change"] == 50, f"Expected change=50, got {rows[1]['change']}"
print("âœ“ Day 2: prev=100, change=+50")

# Get tomorrow's price
result2 = df.withColumn("next_price", ___("price", 1).over(window_spec))
rows2 = result2.orderBy("date").collect()
assert rows2[0]["next_price"] == 150, f"Expected next=150, got {rows2[0]['next_price']}"
print("âœ“ Lead shows next day's price")

print("\\nðŸŽ‰ Koan complete! You've learned lag and lead.")

## Null Handling

In [None]:
# Koan 26: Handling Nulls - Detection
# Category: Null Handling

# Setup
data = [("Alice", 34), ("Bob", None), ("Charlie", 29), (None, 45)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Filter to rows where age is not null
result = df.filter(col("age").___())

assert result.count() == 3, f"Expected 3 rows with age, got {result.count()}"
print("âœ“ Filtered to non-null ages")

# Filter to rows where age IS null
nulls = df.filter(col("age").___())
assert nulls.count() == 1, f"Expected 1 null age, got {nulls.count()}"
print("âœ“ Found rows with null age")

# Check for null name
null_names = df.filter(col("name").isNull())
assert null_names.count() == 1
print("âœ“ Found row with null name")

print("\\nðŸŽ‰ Koan complete! You've learned null detection.")

In [None]:
# Koan 27: Handling Nulls - Fill and Drop
# Category: Null Handling

# Setup
data = [("Alice", 34), ("Bob", None), (None, 29), ("Diana", None)]
df = spark.createDataFrame(data, ["name", "age"])

# Exercise: Fill null ages with 0
result = df.___(0, subset=["age"])

ages = [row["age"] for row in result.collect()]
assert None not in ages, "Should have no null ages"
assert ages.count(0) == 2, "Should have 2 zeros"
print("âœ“ Filled null ages with 0")

# Fill null names with "Unknown"
result2 = df.fillna("Unknown", subset=["name"])
names = [row["name"] for row in result2.collect()]
assert "Unknown" in names, "Should have Unknown name"
print("âœ“ Filled null names")

# Drop rows with ANY null values
result3 = df.___()
assert result3.count() == 1, f"Expected 1 complete row, got {result3.count()}"
print("âœ“ Dropped rows with nulls")

print("\\nðŸŽ‰ Koan complete! You've learned to handle nulls.")

## Advanced

In [None]:
# Koan 28: Union DataFrames
# Category: Advanced

# Setup
df1 = spark.createDataFrame([("Alice", 34), ("Bob", 45)], ["name", "age"])
df2 = spark.createDataFrame([("Charlie", 29), ("Diana", 52)], ["name", "age"])

# Exercise: Combine two DataFrames with the same schema
result = df1.___(df2)

assert result.count() == 4, f"Expected 4 rows, got {result.count()}"
print("âœ“ Combined DataFrames")

names = [row["name"] for row in result.collect()]
assert "Alice" in names and "Charlie" in names, "Should have names from both DFs"
print("âœ“ Contains data from both DataFrames")

print("\\nðŸŽ‰ Koan complete! You've learned to union DataFrames.")

In [None]:
# Koan 29: Explode Arrays
# Category: Advanced

# Setup
data = [("Alice", "python,sql,spark"), ("Bob", "java,scala")]
df = spark.createDataFrame(data, ["name", "skills_str"])

# First split the string into an array
df = df.withColumn("skills", split(col("skills_str"), ","))

# Exercise: Explode the skills array into separate rows
result = df.select("name", ___(col("skills")).alias("skill"))

assert result.count() == 5, f"Expected 5 rows, got {result.count()}"
print("âœ“ Exploded to 5 rows")

alice_skills = [row["skill"] for row in result.filter(col("name") == "Alice").collect()]
assert len(alice_skills) == 3, f"Alice should have 3 skills, got {len(alice_skills)}"
assert "spark" in alice_skills
print("âœ“ Alice has 3 skills including spark")

print("\\nðŸŽ‰ Koan complete! You've learned to explode arrays.")

In [None]:
# Koan 30: Pivot Tables
# Category: Advanced

# Setup
data = [
    ("Alice", "Q1", 100), ("Alice", "Q2", 150),
    ("Bob", "Q1", 200), ("Bob", "Q2", 180)
]
df = spark.createDataFrame(data, ["name", "quarter", "sales"])

# Exercise: Pivot to get quarters as columns
result = df.groupBy("name").___(___).agg(spark_sum("sales"))

# Should have columns: name, Q1, Q2
assert "Q1" in result.columns, "Should have Q1 column"
assert "Q2" in result.columns, "Should have Q2 column"
print("âœ“ Pivoted quarters to columns")

alice = result.filter(col("name") == "Alice").collect()[0]
assert alice["Q1"] == 100, f"Expected Q1=100, got {alice['Q1']}"
assert alice["Q2"] == 150, f"Expected Q2=150, got {alice['Q2']}"
print("âœ“ Values correctly placed in columns")

print("\\nðŸŽ‰ Koan complete! You've learned pivot tables.")

## Delta Lake, Unity Catalog, and Pandas API on Spark

The remaining koans (101-210, 301-310) require specialized environments and are documented in the solutions notebook.