Guided Exercise - PySpark - PySpark for Data Engineering at Sun Life #46

akash-coded · 2024-08-28T08:55:24Z

akash-coded
Aug 28, 2024
Maintainer

Guided Exercise: PySpark for Data Engineers at Sun Life

Scenario

As a Data Engineer at Sun Life, you're responsible for transforming and analyzing large datasets to support business decisions. You work closely with the ETL (Extract, Transform, Load) team to build efficient data pipelines using PySpark. Your task is to prepare and analyze employee data to generate insights that will assist in HR decision-making, such as determining salary adjustments, identifying high-performing employees, and optimizing department budgets.

Objective

In this exercise, you'll apply PySpark DataFrame operations to perform data transformation and analysis. You'll start with basic operations and gradually move to more complex tasks, integrating the expr library for advanced transformations.

Step 1: Setting Up Your Environment

First, you'll set up your PySpark environment in Google Colab.

# Install PySpark if not already installed
!pip install pyspark

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

# Create a Spark session
spark = SparkSession.builder \
    .appName("Sun Life Data Engineering") \
    .getOrCreate()

Conceptual Explanation:

SparkSession: This is your entry point for working with PySpark. It allows you to create DataFrames, interact with Spark’s distributed computing capabilities, and manage configurations.

Step 2: Loading and Exploring the Dataset

You'll start by loading a sample dataset containing employee information. The dataset includes columns like EmployeeID, Name, Department, Age, Gender, Salary, and YearsAtCompany.

# Load sample data
data = [
    (1, "John Doe", "Finance", 29, "M", 60000, 5),
    (2, "Jane Smith", "IT", 34, "F", 75000, 10),
    (3, "Jake Brown", "HR", 41, "M", 50000, 15),
    (4, "Emily White", "Finance", 28, "F", 58000, 4),
    (5, "Michael Black", "IT", 39, "M", 72000, 12)
]

columns = ["EmployeeID", "Name", "Department", "Age", "Gender", "Salary", "YearsAtCompany"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

Conceptual Explanation:

DataFrame: PySpark DataFrame is a distributed collection of data organized into named columns. It’s similar to a table in a relational database.

Step 3: Basic Data Transformation

Your first task is to create a new column, Bonus, which calculates a 10% bonus based on the employee's salary.

# Add a Bonus column (10% of Salary)
df = df.withColumn("Bonus", ___("Salary") * 0.1)

# Show updated DataFrame
df.show()

Fill in the Blank:

Replace ___ with the appropriate function to select the Salary column.

Hint: Use col() to refer to the Salary column.

Conceptual Explanation:

withColumn: Adds a new column or replaces an existing one based on the specified transformation.

Step 4: Filtering Data

Filter the DataFrame to show only employees in the IT department who have been with the company for more than 5 years.

# Filter IT employees with more than 5 years at the company
df_filtered = df.filter((df.Department == "IT") & (df.YearsAtCompany > ___))

# Show the filtered DataFrame
df_filtered.show()

Fill in the Blank:

Replace ___ with the appropriate number of years.

Hint: You’re looking for employees with more than 5 years of experience.

Conceptual Explanation:

filter: Used to filter rows in the DataFrame based on the specified condition.

Step 5: Grouping and Aggregating Data

Group the data by Department and calculate the average salary for each department.

# Group by Department and calculate average salary
df_grouped = df.groupBy("Department").agg(___)

# Show the grouped DataFrame
df_grouped.show()

Fill in the Blank:

Replace ___ with the appropriate aggregation function.

Hint: Use the avg function from pyspark.sql.functions.

Conceptual Explanation:

groupBy: Groups the DataFrame based on a column and allows you to perform aggregate operations like sum, average, etc.

Step 6: Advanced Data Transformation Using `expr`

Now, you'll use the expr function to perform more complex operations. Create a new column, AdjustedSalary, which increases the salary by 5% if the employee has been with the company for more than 10 years.

# Create AdjustedSalary column using expr
df = df.withColumn("AdjustedSalary", expr("___"))

# Show updated DataFrame
df.show()

Fill in the Blank:

Replace ___ with the appropriate expression. You need to use a conditional statement to check the years at the company and apply the salary increase.

Hint: Use a combination of IF in the expr function to perform this operation.

Conceptual Explanation:

expr: Allows you to perform complex SQL expressions on DataFrame columns, enabling more advanced data transformations.

Step 7: Joining DataFrames

Imagine you have another dataset containing department budgets. Your task is to join this dataset with the existing employee data.

# Sample budget data
budget_data = [
    ("Finance", 150000),
    ("IT", 200000),
    ("HR", 100000)
]

budget_columns = ["Department", "Budget"]

# Create budget DataFrame
df_budget = spark.createDataFrame(budget_data, budget_columns)

# Join employee and budget data on Department
df_joined = df.join(df_budget, "Department", "___")

# Show the joined DataFrame
df_joined.show()

Fill in the Blank:

Replace ___ with the type of join you want to perform.

Hint: Consider whether you want to include all employees or just those with matching departments.

Conceptual Explanation:

join: Combines DataFrames based on a common column, allowing you to enrich your data with additional context.

Step 8: Saving Results

Finally, you'll save the transformed DataFrame to a CSV file for further analysis.

# Save the final DataFrame to a CSV file
df_joined.write.csv("sunlife_employee_data.csv", header=True)

Conceptual Explanation:

write: Allows you to save the DataFrame to various file formats like CSV, Parquet, JSON, etc.

Exercise Wrap-Up

Congratulations! You've successfully completed a series of PySpark tasks that simulate real-world scenarios faced by Data Engineers at Sun Life. These exercises not only cover basic and intermediate PySpark operations but also introduce advanced concepts using the expr library, all within the context of practical business applications.

Let's continue with some more advanced use-cases

Step 9: Handling Missing Data

In real-world datasets, you often encounter missing data. Your next task is to handle missing values in the dataset. Assume that the Salary column has some missing values, and you need to address this issue.

Detect Missing Values: First, identify if there are any missing values in the Salary column.
Fill Missing Values: Replace missing values with the average salary of the corresponding department.

# Introduce some missing values in the Salary column
data_with_missing = [
    (1, "John Doe", "Finance", 29, "M", None, 5),
    (2, "Jane Smith", "IT", 34, "F", 75000, 10),
    (3, "Jake Brown", "HR", 41, "M", 50000, 15),
    (4, "Emily White", "Finance", 28, "F", 58000, 4),
    (5, "Michael Black", "IT", 39, "M", None, 12)
]

# Create DataFrame with missing values
df_missing = spark.createDataFrame(data_with_missing, columns)

# Detect missing values in Salary column
df_missing.select("Name", "Salary").show()

# Fill missing values with the average salary of the department
df_filled = df_missing.withColumn("Salary", expr("CASE WHEN Salary IS NULL THEN ___ ELSE Salary END"))

# Show the DataFrame with filled values
df_filled.show()

Fill in the Blank:

Replace ___ with the SQL expression to calculate the average salary for each department where the Salary is missing.

Hint: Use a subquery or window function within the expr function to calculate the average salary for each department.

Conceptual Explanation:

Handling Missing Data: In data engineering, filling in missing values is critical for maintaining data quality. PySpark allows you to handle this using conditional logic within the expr function.

Step 10: Window Functions for Advanced Analysis

Window functions in PySpark are powerful tools for performing operations across a set of rows related to the current row. For instance, you might want to calculate a running total of salaries within each department.

Calculate Running Total: Calculate the cumulative salary within each department.
Rank Employees: Rank employees based on their salary within their department.

from pyspark.sql.window import Window
from pyspark.sql.functions import sum, rank

# Define window specification
window_spec = Window.partitionBy("Department").orderBy("Salary")

# Calculate cumulative salary
df_window = df_filled.withColumn("CumulativeSalary", sum("Salary").over(window_spec))

# Rank employees by salary within their department
df_window = df_window.withColumn("Rank", rank().over(window_spec))

# Show the DataFrame with window functions applied
df_window.show()

Conceptual Explanation:

Window Functions: Allow you to perform operations like ranking, cumulative sums, and moving averages within a specific window or partition of data. This is particularly useful in scenarios where you need to analyze data trends or compare entities within groups.

Internal Working:

Window functions are computed separately from the main query, reducing the complexity of handling large datasets. PySpark optimizes these operations to minimize data shuffling across partitions, ensuring efficient execution.

Step 11: Data Aggregation with Multiple Conditions

Suppose you need to provide a report that shows the total salary for male and female employees in each department and calculates the difference between these totals.

# Aggregate salary by gender within each department
df_gender_agg = df_filled.groupBy("Department", "Gender").agg(sum("Salary").alias("TotalSalary"))

# Pivot the data to have separate columns for Male and Female salaries
df_pivot = df_gender_agg.groupBy("Department").pivot("Gender").sum("TotalSalary")

# Calculate the difference between Male and Female salaries
df_pivot = df_pivot.withColumn("SalaryDifference", expr("M - F"))

# Show the pivoted DataFrame
df_pivot.show()

Conceptual Explanation:

Pivoting: Pivot tables are useful for converting categorical data into summary tables, allowing you to quickly compare values across categories.
Conditional Aggregation: This technique helps in generating specific insights by applying different conditions to the data during aggregation.

Step 12: Optimizing Performance

As your data grows, performance becomes critical. PySpark provides several ways to optimize your jobs. Here’s how you can cache your DataFrame and repartition it to improve performance:

Cache the DataFrame: Caching helps in scenarios where you repeatedly access the same data.
Repartition the DataFrame: Repartitioning can optimize the performance of joins and aggregations.

# Cache the DataFrame
df_cached = df_pivot.cache()

# Perform an action to trigger the cache
df_cached.count()

# Repartition the DataFrame based on Department
df_repartitioned = df_cached.repartition("Department")

# Show the repartitioned DataFrame
df_repartitioned.show()

Conceptual Explanation:

Caching: Stores the DataFrame in memory, reducing the need to recompute it each time it is accessed.
Repartitioning: Alters the partitioning of the DataFrame, which can reduce shuffling and improve performance during operations that involve grouping or joining.

Internal Working:

PySpark's lazy evaluation means that transformations aren't executed until an action is triggered. By caching data, you prevent re-execution of these transformations, speeding up the workflow. Repartitioning is particularly important when dealing with large datasets that can cause uneven load distribution across the cluster.

Step 13: Final Task – Creating a Comprehensive ETL Pipeline

For the final task, combine everything you've learned to create a full ETL pipeline. This pipeline will:

Extract: Load the employee data and a new dataset containing performance metrics.
Transform: Clean, filter, and enhance the data using the techniques covered so far.
Load: Save the transformed data to a new CSV file for further analysis.

# Load the performance metrics data
performance_data = [
    (1, 4.5),
    (2, 4.0),
    (3, 3.5),
    (4, 4.7),
    (5, 4.2)
]

performance_columns = ["EmployeeID", "PerformanceRating"]

df_performance = spark.createDataFrame(performance_data, performance_columns)

# Join the performance data with the employee data
df_etl = df_filled.join(df_performance, "EmployeeID")

# Filter for high-performing employees (PerformanceRating > 4.0)
df_etl = df_etl.filter(df_etl.PerformanceRating > 4.0)

# Add a column for salary increase based on performance
df_etl = df_etl.withColumn("SalaryIncrease", expr("CASE WHEN PerformanceRating >= 4.5 THEN Salary * 0.1 ELSE Salary * 0.05 END"))

# Calculate the new salary
df_etl = df_etl.withColumn("NewSalary", df_etl.Salary + df_etl.SalaryIncrease)

# Save the final transformed data to a CSV file
df_etl.write.csv("high_performers.csv", header=True)

# Show the final DataFrame
df_etl.show()

Conceptual Explanation:

ETL Pipeline: Extract, Transform, Load (ETL) is a common data engineering process where data is extracted from sources, transformed into the desired format, and then loaded into a target system. This exercise simulates a complete ETL pipeline for a typical data engineering task at Sun Life.

Exercise Summary

By completing this exercise, you’ve built a strong foundation in PySpark, covering everything from basic transformations to complex data engineering tasks. The scenarios provided are designed to reflect real-world business challenges you might face at Sun Life or similar organizations, ensuring that you’re well-prepared to apply these skills in your professional work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guided Exercise - PySpark - PySpark for Data Engineering at Sun Life #46

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Guided Exercise - PySpark - PySpark for Data Engineering at Sun Life #46

Uh oh!

akash-coded Aug 28, 2024 Maintainer

Guided Exercise: PySpark for Data Engineers at Sun Life

Scenario

Objective

Step 1: Setting Up Your Environment

Step 2: Loading and Exploring the Dataset

Step 3: Basic Data Transformation

Step 4: Filtering Data

Step 5: Grouping and Aggregating Data

Step 6: Advanced Data Transformation Using expr

Step 7: Joining DataFrames

Step 8: Saving Results

Exercise Wrap-Up

Step 9: Handling Missing Data

Step 10: Window Functions for Advanced Analysis

Step 11: Data Aggregation with Multiple Conditions

Step 12: Optimizing Performance

Step 13: Final Task – Creating a Comprehensive ETL Pipeline

Exercise Summary

Replies: 0 comments

akash-coded
Aug 28, 2024
Maintainer

Step 6: Advanced Data Transformation Using `expr`