Employee Salary Analysis with PySpark

Problem Statement:
This script processes employee salary data to:
1. Filter employees earning above a specified salary threshold ($60,000)
2. Count how many high-earning employees exist in each department
3. Provide insights into departmental salary distributions

Input Data Structure:
- Department: String (Engineering, HR, Marketing, Sales)
- Salary: Integer (annual salary amounts)

Key Operations:
1. Creates Spark DataFrame from sample data
2. Applies salary filter using column expression
3. Aggregates results by department
4. Displays department-wise counts of high earners

Business Applications:
- Compensation analysis
- Departmental budget planning
- Pay equity assessments
- Talent retention strategies

Technical Details:
- Uses PySpark DataFrame API
- Demonstrates filtering and grouping operations
- Runs on local Spark session (scalable to cluster)

# Note: In production, you would:
# 1. Read from actual data source (CSV, database, etc.)
# 2. Parameterize the salary threshold
# 3. Add error handling and logging

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a Spark session
spark = SparkSession.builder \
    .appName("Filter Employees by Salary and Count by Department") \
    .getOrCreate()

# Sample data
data = [

    ("Engineering", 70000),

    ("Engineering", 80000),

    ("HR", 50000),

    ("HR", 55000),

    ("Marketing", 60000),

    ("Marketing", 65000),

    ("Sales", 40000)

]
columns = ["Department", "Salary"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Define the salary threshold
salary_threshold = 60000

# Filter employees with salary greater than the threshold
filtered_df = df.filter(col("Salary") > salary_threshold)
# Count the number of employees in each department
count_by_department = filtered_df.groupBy("Department").count()

# Show the result
count_by_department.show()
# Stop the Spark session
spark.stop()