Guided Exercise - PySpark - PySpark for Data Engineering at Sun Life #46
akash-coded
started this conversation in
Tasks
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Guided Exercise: PySpark for Data Engineers at Sun Life
Scenario
As a Data Engineer at Sun Life, you're responsible for transforming and analyzing large datasets to support business decisions. You work closely with the ETL (Extract, Transform, Load) team to build efficient data pipelines using PySpark. Your task is to prepare and analyze employee data to generate insights that will assist in HR decision-making, such as determining salary adjustments, identifying high-performing employees, and optimizing department budgets.
Objective
In this exercise, you'll apply PySpark DataFrame operations to perform data transformation and analysis. You'll start with basic operations and gradually move to more complex tasks, integrating the
expr
library for advanced transformations.Step 1: Setting Up Your Environment
First, you'll set up your PySpark environment in Google Colab.
Conceptual Explanation:
Step 2: Loading and Exploring the Dataset
You'll start by loading a sample dataset containing employee information. The dataset includes columns like
EmployeeID
,Name
,Department
,Age
,Gender
,Salary
, andYearsAtCompany
.Conceptual Explanation:
Step 3: Basic Data Transformation
Your first task is to create a new column,
Bonus
, which calculates a 10% bonus based on the employee's salary.Fill in the Blank:
___
with the appropriate function to select theSalary
column.Hint: Use
col()
to refer to theSalary
column.Conceptual Explanation:
Step 4: Filtering Data
Filter the DataFrame to show only employees in the IT department who have been with the company for more than 5 years.
Fill in the Blank:
___
with the appropriate number of years.Hint: You’re looking for employees with more than 5 years of experience.
Conceptual Explanation:
Step 5: Grouping and Aggregating Data
Group the data by
Department
and calculate the average salary for each department.Fill in the Blank:
___
with the appropriate aggregation function.Hint: Use the
avg
function frompyspark.sql.functions
.Conceptual Explanation:
Step 6: Advanced Data Transformation Using
expr
Now, you'll use the
expr
function to perform more complex operations. Create a new column,AdjustedSalary
, which increases the salary by 5% if the employee has been with the company for more than 10 years.Fill in the Blank:
___
with the appropriate expression. You need to use a conditional statement to check the years at the company and apply the salary increase.Hint: Use a combination of
IF
in theexpr
function to perform this operation.Conceptual Explanation:
Step 7: Joining DataFrames
Imagine you have another dataset containing department budgets. Your task is to join this dataset with the existing employee data.
Fill in the Blank:
___
with the type of join you want to perform.Hint: Consider whether you want to include all employees or just those with matching departments.
Conceptual Explanation:
Step 8: Saving Results
Finally, you'll save the transformed DataFrame to a CSV file for further analysis.
Conceptual Explanation:
Exercise Wrap-Up
Congratulations! You've successfully completed a series of PySpark tasks that simulate real-world scenarios faced by Data Engineers at Sun Life. These exercises not only cover basic and intermediate PySpark operations but also introduce advanced concepts using the
expr
library, all within the context of practical business applications.Let's continue with some more advanced use-cases
Step 9: Handling Missing Data
In real-world datasets, you often encounter missing data. Your next task is to handle missing values in the dataset. Assume that the
Salary
column has some missing values, and you need to address this issue.Salary
column.Fill in the Blank:
___
with the SQL expression to calculate the average salary for each department where theSalary
is missing.Hint: Use a subquery or window function within the
expr
function to calculate the average salary for each department.Conceptual Explanation:
expr
function.Step 10: Window Functions for Advanced Analysis
Window functions in PySpark are powerful tools for performing operations across a set of rows related to the current row. For instance, you might want to calculate a running total of salaries within each department.
Conceptual Explanation:
Internal Working:
Step 11: Data Aggregation with Multiple Conditions
Suppose you need to provide a report that shows the total salary for male and female employees in each department and calculates the difference between these totals.
Conceptual Explanation:
Step 12: Optimizing Performance
As your data grows, performance becomes critical. PySpark provides several ways to optimize your jobs. Here’s how you can cache your DataFrame and repartition it to improve performance:
Conceptual Explanation:
Internal Working:
Step 13: Final Task – Creating a Comprehensive ETL Pipeline
For the final task, combine everything you've learned to create a full ETL pipeline. This pipeline will:
Conceptual Explanation:
Exercise Summary
By completing this exercise, you’ve built a strong foundation in PySpark, covering everything from basic transformations to complex data engineering tasks. The scenarios provided are designed to reflect real-world business challenges you might face at Sun Life or similar organizations, ensuring that you’re well-prepared to apply these skills in your professional work.
Beta Was this translation helpful? Give feedback.
All reactions