## Interview Questions

### Q1. What is the difference between a transformation and an action in Spark? Give at least 3 examples of each.

**transformation**:  

- It defines a new RDD or DataFrame from an existing one but **does not trigger execution immediately**. 
- Transformations are **lazy**, meaning Spark only records the lineage (execution plan) and waits until a result is actually needed. 
- This allows Spark to **optimize the execution plan before running it**. 
- Examples of transformations include `select()`, `filter()`, `map()`, `withColumn()`, and `groupBy()`.

**action**:  

- An **action**, on the other hand, **triggers the actual execution** of the Spark job 
- and **returns a result to the driver** or **writes data to external storage**. 
- Actions force Spark to materialize the computation defined by the transformations. 
- Examples of actions include `count()`, `collect()`, `show()`, `take()`, and `saveAsTextFile()`. 

In short, transformations build the plan, while actions execute it.



### Q2. Why does `spark.read.csv()` load all columns as `StringType` by default? What problems can this cause in aggregation or filtering?

By default, `spark.read.csv()` loads all columns as `StringType` because 
- CSV files are schema-less text files, 
- and Spark avoids making incorrect assumptions about data types. 
- Inferring schema requires scanning the data, which can be expensive for large datasets, 
- so Spark prioritizes safety and performance unless `inferSchema=true` or an explicit schema is provided.

This behavior can **cause issues in aggregation and filtering**.  

For example, 
- numeric aggregations like `sum()` or `avg()` will fail or produce incorrect results if the column is treated as a string. 
- Similarly, filtering conditions can behave unexpectedly, such as `"100" < "20"` evaluating as true due to lexicographical comparison. 

To avoid these problems, it is best practice to 
- explicitly define a schema 
- or cast columns to the appropriate data types before performing analytics.





### Q3. What does `explode()` do in Spark? In what scenario can `explode()` cause serious performance issues?

The `explode()` function in Spark 
- takes an array or map column and **creates a new row for each element** in that collection. 
- Essentially, it “flattens” nested data so that each element becomes its own row while duplicating the values of the other columns.    

This is commonly used when 
- working with JSON data, 
- nested arrays, 
- or event logs.

However, `explode()` can cause serious performance issues when 
- applied to columns with **very large arrays or highly skewed data**. 
- In such cases, a single input row can generate thousands or millions of output rows, leading to data explosion, increased memory usage, and expensive shuffles downstream. 
- If not carefully managed, this can result in slow jobs or even out-of-memory errors on executors.





### Q4. Why is `groupBy()` considered a wide transformation? What happens under the hood when a wide transformation is executed?

Unlike narrow transformations, where each output partition depends on a single input partition, wide transformations depend on multiple input partitions.   


`groupBy()` is considered a **wide transformation** because 
- it requires data with the same key to be brought together, 
- which often means data must be redistributed across partitions. 

Under the hood, when a wide transformation like `groupBy()` is executed:  
- Spark performs a **shuffle**. 
- This involves 
  - repartitioning the data based on the grouping key, 
  - writing intermediate data to disk, 
  - transferring it across the network, 
  - and then reading it back on the destination executors. 
- This shuffle process is expensive in terms of I/O, network, and latency, which is why wide transformations are typically the **main performance bottleneck** in Spark applications.

## Coding: 

### Question 1: 
Assume you have a DataFrame peopleDF with the following schema:   
name: string   
age: integer  
city: string   

Task: 
1.  Remove duplicate rows  
2.  Keep only people whose age is between 20 and 40  
3.  Create a new column is_adult:  
    - 1 if age ≥ 18  
    - 0 otherwise  
4.  Select only the following columns:  
    - name  
    - age  
    - is_adult  
5.  Sort the result by age in descending order  
Use DataFrame API only

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

# Initialize Spark
spark = SparkSession.builder.appName("Question1_Solution").getOrCreate()

# CREATE SAMPLE DATAFRAME
data = [
    ('Alice', 25, 'NYC'),
    ('Bob', 35, 'LA'),
    ('Charlie', 15, 'Chicago'),
    ('Diana', 28, 'Boston'),
    ('Eve', 45, 'Seattle'),
    ('Frank', 22, 'Austin'),
    ('Grace', 38, 'Denver'),
    ('Henry', 50, 'Miami'),
    ('Alice', 25, 'NYC'),      # Duplicate
    ('Bob', 35, 'LA'),         # Duplicate
    ('Ivy', 19, 'Portland'),
    ('Jack', 40, 'Phoenix'),
    ('Kate', 17, 'Dallas'),
    ('Leo', 30, 'Houston')
]

peopleDF = spark.createDataFrame(data, ['name', 'age', 'city'])

print("\n📋 Original DataFrame:")
peopleDF.show()
print(f"Total rows: {peopleDF.count()}")


📋 Original DataFrame:
+-------+---+--------+
|   name|age|    city|
+-------+---+--------+
|  Alice| 25|     NYC|
|    Bob| 35|      LA|
|Charlie| 15| Chicago|
|  Diana| 28|  Boston|
|    Eve| 45| Seattle|
|  Frank| 22|  Austin|
|  Grace| 38|  Denver|
|  Henry| 50|   Miami|
|  Alice| 25|     NYC|
|    Bob| 35|      LA|
|    Ivy| 19|Portland|
|   Jack| 40| Phoenix|
|   Kate| 17|  Dallas|
|    Leo| 30| Houston|
+-------+---+--------+

Total rows: 14


In [0]:
# before removing duplicates 
# Find duplicates: rows that appear more than once
print("\nRows with duplicates (showing all occurrences):")

# Group by all columns and count
duplicates = peopleDF.groupBy('name', 'age', 'city') \
    .count() \
    .filter(col('count') > 1) \
    .drop('count')

# Join back to get all duplicate rows
duplicate_rows = peopleDF.join(duplicates, ['name', 'age', 'city'], 'inner')
duplicate_rows.show()


Rows with duplicates (showing all occurrences):
+-----+---+----+
| name|age|city|
+-----+---+----+
|Alice| 25| NYC|
|  Bob| 35|  LA|
|Alice| 25| NYC|
|  Bob| 35|  LA|
+-----+---+----+



In [0]:
# 1. REMOVE DUPLICATE ROWS
peopleDF_no_duplicates = peopleDF.dropDuplicates()

print("\n✅ After removing duplicates:")
peopleDF_no_duplicates.show()
print(f"Total rows: {peopleDF_no_duplicates.count()}")


✅ After removing duplicates:
+-------+---+--------+
|   name|age|    city|
+-------+---+--------+
|  Alice| 25|     NYC|
|    Bob| 35|      LA|
|Charlie| 15| Chicago|
|  Diana| 28|  Boston|
|    Eve| 45| Seattle|
|  Frank| 22|  Austin|
|  Grace| 38|  Denver|
|  Henry| 50|   Miami|
|    Ivy| 19|Portland|
|   Jack| 40| Phoenix|
|   Kate| 17|  Dallas|
|    Leo| 30| Houston|
+-------+---+--------+

Total rows: 12


In [0]:
# 2. FILTER AGE BETWEEN 20 AND 40
# ============================================
peopleDF_filtered = peopleDF_no_duplicates.filter(
    (col('age') >= 20) & (col('age') <= 40)
)

print("\n✅ After filtering (age between 20 and 40):")
peopleDF_filtered.show()
print(f"Total rows: {peopleDF_filtered.count()}")


✅ After filtering (age between 20 and 40):
+-----+---+-------+
| name|age|   city|
+-----+---+-------+
|Alice| 25|    NYC|
|  Bob| 35|     LA|
|Diana| 28| Boston|
|Frank| 22| Austin|
|Grace| 38| Denver|
| Jack| 40|Phoenix|
|  Leo| 30|Houston|
+-----+---+-------+

Total rows: 7


In [0]:
# 3. CREATE NEW COLUMN is_adult

peopleDF_with_adult = peopleDF_no_duplicates.withColumn(
    'is_adult',
    when(col('age') >= 18, 1).otherwise(0)
)

print("\n✅ After adding is_adult column:")
peopleDF_with_adult.show()


✅ After adding is_adult column:
+-------+---+--------+--------+
|   name|age|    city|is_adult|
+-------+---+--------+--------+
|  Alice| 25|     NYC|       1|
|    Bob| 35|      LA|       1|
|Charlie| 15| Chicago|       0|
|  Diana| 28|  Boston|       1|
|    Eve| 45| Seattle|       1|
|  Frank| 22|  Austin|       1|
|  Grace| 38|  Denver|       1|
|  Henry| 50|   Miami|       1|
|    Ivy| 19|Portland|       1|
|   Jack| 40| Phoenix|       1|
|   Kate| 17|  Dallas|       0|
|    Leo| 30| Houston|       1|
+-------+---+--------+--------+



In [0]:
# 4. SELECT SPECIFIC COLUMNS

peopleDF_selected = peopleDF_with_adult.select('name', 'age', 'is_adult')

print("\n✅ After selecting columns:")
peopleDF_selected.show()


✅ After selecting columns:
+-------+---+--------+
|   name|age|is_adult|
+-------+---+--------+
|  Alice| 25|       1|
|    Bob| 35|       1|
|Charlie| 15|       0|
|  Diana| 28|       1|
|    Eve| 45|       1|
|  Frank| 22|       1|
|  Grace| 38|       1|
|  Henry| 50|       1|
|    Ivy| 19|       1|
|   Jack| 40|       1|
|   Kate| 17|       0|
|    Leo| 30|       1|
+-------+---+--------+



In [0]:
# 5. SORT BY AGE DESCENDING

final_result = peopleDF_selected.orderBy(col('age').desc())

print("\n✅ FINAL RESULT:")
final_result.show()
spark.stop()


✅ FINAL RESULT:
+-------+---+--------+
|   name|age|is_adult|
+-------+---+--------+
|  Henry| 50|       1|
|    Eve| 45|       1|
|   Jack| 40|       1|
|  Grace| 38|       1|
|    Bob| 35|       1|
|    Leo| 30|       1|
|  Diana| 28|       1|
|  Alice| 25|       1|
|  Frank| 22|       1|
|    Ivy| 19|       1|
|   Kate| 17|       0|
|Charlie| 15|       0|
+-------+---+--------+



### Question 2: 
You are given a CSV file users.csv with the following columns:    

user_id,name,signup_date,score     
 
Example data:   
1,Alice,2023-01-01,90     
2,Bob,2023-02-10,85    
 
Task:    
1.  Define an explicit schema using StructType  
2.  Read the CSV using spark.read with:  
    - header = true  
    - your custom schema  
3.  Print the schema  
4.  Show the first 5 rows

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Question2_Solution").getOrCreate()

# users.csv has been uploaded by Catalog
dbutils.fs.ls('/Workspace/Users/shen.cheng@northeastern.edu/Drafts/')

[FileInfo(path='dbfs:/Workspace/Users/shen.cheng@northeastern.edu/Drafts/3_1_QA.ipynb', name='3_1_QA.ipynb', size=18816, modificationTime=1766149670995),
 FileInfo(path='dbfs:/Workspace/Users/shen.cheng@northeastern.edu/Drafts/sparkSQL Intro.ipynb', name='sparkSQL Intro.ipynb', size=88376, modificationTime=1766145497474),
 FileInfo(path='dbfs:/Workspace/Users/shen.cheng@northeastern.edu/Drafts/users.csv', name='users.csv', size=246, modificationTime=1766147665752)]

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
from pyspark.sql.functions import col, to_date

# 1. DEFINE EXPLICIT SCHEMA

schema = StructType([
    StructField("user_id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=False),
    StructField("signup_date", DateType(), nullable=True),
    StructField("score", IntegerType(), nullable=True)
])

# 2. Read from FileStore 
df = spark.table("workspace.default.users")

# Note: The table already has a schema, but we defined our expected schema above
# In a real CSV scenario, would use:
# df = spark.read.option("header", "true").schema(schema).csv("path/to/file.csv")

# 3. Print schema
df.printSchema()

# 4. Show first 5 rows
df.show(5)

root
 |-- user_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- signup_date: date (nullable = true)
 |-- score: long (nullable = true)

+-------+-------+-----------+-----+
|user_id|   name|signup_date|score|
+-------+-------+-----------+-----+
|      1|  Alice| 2023-01-01|   90|
|      2|    Bob| 2023-02-10|   85|
|      3|Charlie| 2023-03-15|   78|
|      4|  Diana| 2023-04-20|   92|
|      5|    Eve| 2023-05-05|   88|
+-------+-------+-----------+-----+
only showing top 5 rows
