Complex Scenarios - ETL and ELT Pipelines using SQL - MySQL #33

akash-coded · 2023-08-07T15:27:31Z

akash-coded
Aug 7, 2023
Maintainer

ETL Pipeline for Sales Data Analysis

In this scenario, let's assume you are tasked with developing an ETL (Extract, Transform, Load) pipeline that extracts sales data from the ClassicModels database, calculates the total monthly sales by product line and sales territory, and loads the data into a new table for analysis.

Step 1: Extract

First, you would need to extract the relevant sales data from multiple tables in the ClassicModels database. This would involve joining multiple tables and selecting the necessary fields.

CREATE VIEW extracted_data AS
SELECT 
    pl.productLine, 
    YEAR(o.orderDate) AS year, 
    MONTH(o.orderDate) AS month, 
    c.territory, 
    od.quantityOrdered * od.priceEach AS sales
FROM 
    products p
JOIN 
    productlines pl ON p.productLine = pl.productLine
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
JOIN 
    customers c ON o.customerNumber = c.customerNumber;

In the above SQL, we have used a view to store the extracted data, which can be considered as a virtual table.

Step 2: Transform

Next, you would perform the necessary transformations on the extracted data. In this case, the transformation involves calculating the total monthly sales by product line and sales territory.

CREATE VIEW transformed_data AS
SELECT 
    productLine, 
    year, 
    month, 
    territory, 
    SUM(sales) AS totalSales
FROM 
    extracted_data
GROUP BY 
    productLine, 
    year, 
    month, 
    territory;

Step 3: Load

Finally, you would load the transformed data into a new table for further analysis.

CREATE TABLE monthly_sales AS
SELECT * FROM transformed_data;

Here, monthly_sales is the new table that contains the transformed sales data.

Note: This scenario showcases a straightforward ETL pipeline. However, in the real world, ETL processes can be much more complex and may involve cleaning the data, dealing with missing values, integrating data from multiple sources, and performing complex transformations.

ELT Pipeline for Sales Data Analysis

The ELT (Extract, Load, Transform) process, on the other hand, first loads raw data into the target system and then performs transformations. This approach is often used with big data platforms and data lakes.

Step 1: Extract and Load

First, you would extract the data and load it into the target system as is.

CREATE TABLE raw_data AS
SELECT 
    pl.productLine, 
    o.orderDate, 
    c.territory, 
    od.quantityOrdered * od.priceEach AS sales
FROM 
    products p
JOIN 
    productlines pl ON p.productLine = pl.productLine
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
JOIN 
    customers c ON o.customerNumber = c.customerNumber;

Step 2: Transform

After the raw data is loaded, you would perform the necessary transformations.

CREATE TABLE monthly_sales AS
SELECT 
    productLine, 
    YEAR(orderDate) AS year, 
    MONTH(orderDate) AS month, 
    territory, 
    SUM(sales) AS totalSales
FROM 
    raw_data
GROUP BY 
    productLine, 
    YEAR(orderDate), 
    MONTH(orderDate), 
    territory;

Here, monthly_sales is the new table that contains the transformed sales data.

Note: The ELT process is particularly beneficial when dealing with large volumes of data, as it enables transformations to be carried out in parallel and leverages the full computational power of the target system. However, it requires careful management of data storage and computational resources.

Advanced Data Pipeline for Market Segment Analysis

In this complex scenario, imagine you are working as a data engineer for a company that sells products worldwide and uses the ClassicModels database. Your company is interested in an advanced market segment analysis. You need to build a data pipeline to answer these questions:

What are the top 3 products by sales for each month in each territory?
What is the monthly sales growth for each product line in each territory?
How does each sales representative contribute to the sales in their respective territories?

Given the complexity and the fact that this requires several related SQL operations, you would break down this scenario into multiple stages.

Stage 1: Data Extraction and Transformation

Extract data from multiple tables in the ClassicModels database, and transform it to calculate monthly sales by product, product line, territory, and sales representative.

CREATE VIEW monthly_sales_data AS
SELECT 
    p.productCode, 
    p.productName, 
    p.productLine, 
    YEAR(o.orderDate) AS year, 
    MONTH(o.orderDate) AS month, 
    c.territory, 
    e.employeeNumber, 
    CONCAT(e.firstName, ' ', e.lastName) AS employeeName,
    SUM(od.quantityOrdered * od.priceEach) AS monthlySales
FROM 
    products p
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
JOIN 
    customers c ON o.customerNumber = c.customerNumber
JOIN 
    employees e ON c.salesRepEmployeeNumber = e.employeeNumber
GROUP BY 
    p.productCode, 
    p.productLine, 
    c.territory, 
    e.employeeNumber, 
    YEAR(o.orderDate), 
    MONTH(o.orderDate);

Stage 2: Calculating Top Products by Sales

To answer the first question, use the RANK() window function to rank products by sales in each month in each territory.

CREATE VIEW top_products AS
SELECT 
    productCode, 
    productName, 
    year, 
    month, 
    territory, 
    monthlySales,
    RANK() OVER (
        PARTITION BY year, month, territory 
        ORDER BY monthlySales DESC
    ) AS rank
FROM 
    monthly_sales_data;

You can then query the top_products view to get the top 3 products:

SELECT *
FROM top_products
WHERE rank <= 3;

Stage 3: Calculating Monthly Sales Growth

To answer the second question, use the LAG() window function to access the previous month's sales, and then calculate the monthly growth.

CREATE VIEW sales_growth AS
SELECT 
    productLine, 
    year, 
    month, 
    territory, 
    monthlySales,
    ((monthlySales - LAG(monthlySales) OVER (
        PARTITION BY productLine, territory 
        ORDER BY year, month
    )) / LAG(monthlySales) OVER (
        PARTITION BY productLine, territory 
        ORDER BY year, month
    )) * 100 AS salesGrowth
FROM 
    monthly_sales_data;

Stage 4: Calculating Sales Contribution

To answer the third question, calculate the sales contribution of each sales representative to the total sales in their respective territories.

CREATE VIEW sales_contribution AS
SELECT 
    employeeNumber, 
    employeeName, 
    year, 
    month, 
    territory, 
    monthlySales,
    (monthlySales / SUM(monthlySales) OVER (
        PARTITION BY year, month, territory
    )) * 100 AS contributionPercentage
FROM 
    monthly_sales_data;

This scenario showcases the use of multiple related SQL operations, including joins, grouping, window functions, and the creation of multiple views. It also demonstrates how to use SQL to break down a complex task into manageable stages, which is a common requirement in real-world data engineering projects.

Revenue Projections and Employee Performance Analysis

In this scenario, let's assume you are a data engineer at ClassicModels company and your task is to build a pipeline that calculates the revenue projections for the next year and analyze employee performance. You need to:

Calculate the yearly revenue growth for each product line.
Predict the next year's revenue for each product line using the yearly revenue growth.
Rank the employees based on the revenue they generated in the current year.

Given the complexity, this scenario would be broken down into multiple stages:

Stage 1: Data Extraction and Transformation

Extract data from multiple tables in the ClassicModels database, and transform it to calculate yearly sales by product line and employee.

CREATE VIEW yearly_sales_data AS
SELECT 
    p.productLine, 
    YEAR(o.orderDate) AS year, 
    e.employeeNumber, 
    CONCAT(e.firstName, ' ', e.lastName) AS employeeName,
    SUM(od.quantityOrdered * od.priceEach) AS yearlySales
FROM 
    products p
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
JOIN 
    customers c ON o.customerNumber = c.customerNumber
JOIN 
    employees e ON c.salesRepEmployeeNumber = e.employeeNumber
GROUP BY 
    p.productLine, 
    e.employeeNumber, 
    YEAR(o.orderDate);

Stage 2: Calculating Yearly Revenue Growth

To answer the first question, use the LAG() window function to access the previous year's sales, and then calculate the yearly growth.

CREATE VIEW revenue_growth AS
SELECT 
    productLine, 
    year, 
    ((yearlySales - LAG(yearlySales) OVER (
        PARTITION BY productLine 
        ORDER BY year
    )) / LAG(yearlySales) OVER (
        PARTITION BY productLine 
        ORDER BY year
    )) * 100 AS revenueGrowth
FROM 
    yearly_sales_data;

Stage 3: Predicting Next Year's Revenue

To predict the next year's revenue for each product line, you could assume that the yearly revenue growth remains constant and apply it to the current year's revenue.

CREATE VIEW revenue_projections AS
SELECT 
    productLine, 
    year + 1 AS year, 
    yearlySales * (1 + revenueGrowth / 100) AS projectedRevenue
FROM 
    revenue_growth;

Stage 4: Ranking Employees

To rank the employees based on the revenue they generated in the current year, use the RANK() window function.

CREATE VIEW employee_rankings AS
SELECT 
    employeeNumber, 
    employeeName, 
    year, 
    yearlySales,
    RANK() OVER (
        PARTITION BY year 
        ORDER BY yearlySales DESC
    ) AS rank
FROM 
    yearly_sales_data;

You can then query the employee_rankings view to get the rankings:

SELECT *
FROM employee_rankings
WHERE year = YEAR(CURDATE());

In this scenario, we made use of multiple SQL operations including joins, grouping, window functions, and creation of multiple views. We have also shown how to use SQL to decompose a complex task into simpler stages, a frequent requirement in real-world data engineering projects.

Order Fulfillment and Inventory Management

In this complex scenario, you are tasked with building a system to manage order fulfillment and inventory for ClassicModels. The system should meet the following requirements:

Track the number of each product that is currently on order.
Compare the quantity on order with the quantity in stock to identify products that may soon be out of stock.
Prioritize orders based on the order date and the total order value.
Track the fulfillment status of each order (whether it's pending, partially fulfilled, or completely fulfilled).

Given the complexity, we will again break this scenario into several stages.

Stage 1: Tracking Products on Order

You start by creating a view that calculates the quantity of each product that is currently on order:

CREATE VIEW products_on_order AS
SELECT 
    p.productCode, 
    p.productName, 
    SUM(od.quantityOrdered) AS quantityOnOrder
FROM 
    products p
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
WHERE 
    o.status = 'In Process'
GROUP BY 
    p.productCode;

Stage 2: Identifying Products That May Soon Be Out of Stock

Next, you join the products_on_order view with the products table to compare the quantity on order with the quantity in stock:

CREATE VIEW potential_stock_outs AS
SELECT 
    p.productCode, 
    p.productName, 
    p.quantityInStock, 
    poo.quantityOnOrder, 
    (p.quantityInStock - poo.quantityOnOrder) AS projectedStock
FROM 
    products p
JOIN 
    products_on_order poo ON p.productCode = poo.productCode;

Stage 3: Prioritizing Orders

Next, you create a view that ranks orders based on the order date and the total order value:

CREATE VIEW order_priorities AS
SELECT 
    o.orderNumber, 
    o.orderDate, 
    SUM(od.quantityOrdered * od.priceEach) AS totalValue,
    RANK() OVER (
        ORDER BY o.orderDate, SUM(od.quantityOrdered * od.priceEach) DESC
    ) AS priority
FROM 
    orders o
JOIN 
    orderdetails od ON o.orderNumber = od.orderNumber
WHERE 
    o.status = 'In Process'
GROUP BY 
    o.orderNumber;

Stage 4: Tracking the Fulfillment Status of Each Order

Finally, you create a view that tracks the fulfillment status of each order:

CREATE VIEW order_fulfillment_status AS
SELECT 
    o.orderNumber, 
    o.status, 
    COUNT(od.productCode) AS numProductsOrdered,
    COUNT(p.productCode) AS numProductsShipped,
    CASE
        WHEN COUNT(od.productCode) = COUNT(p.productCode) THEN 'Completely Fulfilled'
        WHEN COUNT(p.productCode) > 0 THEN 'Partially Fulfilled'
        ELSE 'Pending'
    END AS fulfillmentStatus
FROM 
    orders o
LEFT JOIN 
    orderdetails od ON o.orderNumber = od.orderNumber
LEFT JOIN 
    products p ON od.productCode = p.productCode AND p.quantityInStock >= od.quantityOrdered
GROUP BY 
    o.orderNumber;

This scenario involves multiple complex SQL operations including joins, subqueries, aggregations, window functions, and conditional logic. It also illustrates how SQL can be used to build complex data pipelines that transform raw data into actionable insights.

Customer Lifetime Value Analysis

This complex scenario requires calculating the lifetime value of customers, which is a prediction of the net profit attributed to the entire future relationship with a customer. The calculation involves understanding purchase frequency, customer value, customer lifespan, and segmentation.

Stage 1: Calculate Average Purchase Frequency

This involves calculating how often a customer places an order on average.

CREATE VIEW purchase_frequency AS
SELECT 
    customerNumber, 
    COUNT(orderNumber) / DATEDIFF(CURDATE(), MIN(orderDate)) as avg_frequency
FROM 
    orders
GROUP BY 
    customerNumber;

Stage 2: Calculate Average Order Value

This is the average amount a customer spends per order.

CREATE VIEW average_order_value AS
SELECT 
    customerNumber, 
    AVG(amount) as avg_order_value
FROM 
    payments
GROUP BY 
    customerNumber;

Stage 3: Calculate Customer Value

This is the average purchase frequency multiplied by the average order value.

CREATE VIEW customer_value AS
SELECT 
    pf.customerNumber, 
    pf.avg_frequency * aov.avg_order_value as customer_value
FROM 
    purchase_frequency pf
JOIN 
    average_order_value aov ON pf.customerNumber = aov.customerNumber;

Stage 4: Calculate Customer Lifespan

This is the average number of years a customer continues to buy from the company.

CREATE VIEW customer_lifespan AS
SELECT 
    customerNumber, 
    DATEDIFF(MAX(orderDate), MIN(orderDate))/365 as lifespan
FROM 
    orders
GROUP BY 
    customerNumber;

Stage 5: Calculate Customer Lifetime Value (CLV)

Finally, calculate the CLV by multiplying the customer value by the customer lifespan.

CREATE VIEW customer_lifetime_value AS
SELECT 
    cv.customerNumber, 
    cv.customer_value * cl.lifespan as CLV
FROM 
    customer_value cv
JOIN 
    customer_lifespan cl ON cv.customerNumber = cl.customerNumber;

Here, you have built a complex customer lifetime value model using several SQL views, each building on the last. You've combined different aspects of SQL including joins, aggregations, mathematical operations, and date functions to calculate key business metrics. It's a great demonstration of how SQL can be used in practical data engineering scenarios to derive insights that drive business decisions.

Data Verification and Integrity Check in ETL Pipeline

In data engineering, ETL (Extract, Transform, Load) processes are common. They involve extracting data from different sources, transforming it to fit operational needs, then loading it into the database. SQL, especially with window functions and other advanced techniques, can play a critical role in data verification and integrity checks.

Here is an example where we create a composite ETL process using classicmodels database.

Stage 1: Data Extraction

We are going to extract data from orders and orderdetails tables. For simplicity, let's consider we are only interested in the order number, product code, quantity ordered, and order date.

CREATE VIEW extracted_data AS
SELECT 
    o.orderNumber, 
    od.productCode, 
    od.quantityOrdered,
    o.orderDate
FROM 
    orders o
JOIN 
    orderdetails od ON o.orderNumber = od.orderNumber;

Stage 2: Data Transformation

Let's say our operational need requires month-wise and product-wise aggregation of quantity. We also want to flag months where a particular product has been ordered more than the average quantity.

CREATE VIEW transformed_data AS
SELECT 
    productCode, 
    DATE_FORMAT(orderDate, '%Y-%m') as orderMonth, 
    SUM(quantityOrdered) as monthlyQuantity,
    SUM(quantityOrdered) OVER (PARTITION BY productCode ORDER BY DATE_FORMAT(orderDate, '%Y-%m')) / 
    ROW_NUMBER() OVER (PARTITION BY productCode ORDER BY DATE_FORMAT(orderDate, '%Y-%m')) > SUM(quantityOrdered) 
    as isAboveAverage
FROM 
    extracted_data
GROUP BY 
    productCode, DATE_FORMAT(orderDate, '%Y-%m');

Stage 3: Data Loading

Assume we have a new table product_performance where we load our transformed data.

INSERT INTO product_performance
SELECT 
    *
FROM 
    transformed_data;

In this scenario, we used SQL window functions for calculating running averages and row numbers for each product. We also demonstrated how data can be flagged based on certain conditions during the transformation stage. This showcases SQL's power in dealing with complex ETL processes.

These are abstract examples. In real-world applications, the data extraction phase may involve pulling data from APIs, flat files, or different database systems. The transformation phase could involve complex business rules, and the loading phase might involve writing to a data warehouse or a cloud storage system. Regardless, SQL (with or without window functions) is a crucial tool in every data engineer's toolbox to ensure data integrity and quality in these processes.

Streamlined Data Analysis Workflow with Window Functions and Stored Procedures

In many scenarios, organizations might need to generate complex analyses and reports on a regular basis. For such tasks, SQL window functions and stored procedures can help in creating streamlined and efficient workflows.

Consider a scenario where the management of ClassicModels wants a monthly report of the top 5 products (by sales) for each product line. They also want to know if these top products have improved their ranking over time.

Step 1: Data Preparation using Window Functions

First, we need to prepare the data for analysis. We will create a view that shows the product sales by month and their rank within their product line.

CREATE VIEW product_sales_ranking AS 
SELECT 
    p.productCode, 
    p.productLine, 
    DATE_FORMAT(o.orderDate, '%Y-%m') AS orderMonth, 
    SUM(od.quantityOrdered * od.priceEach) AS sales,
    RANK() OVER (
        PARTITION BY p.productLine, DATE_FORMAT(o.orderDate, '%Y-%m') 
        ORDER BY SUM(od.quantityOrdered * od.priceEach) DESC
    ) as rankInProductLine
FROM 
    products p
JOIN 
    orderdetails od ON p.productCode = od.productCode
JOIN 
    orders o ON od.orderNumber = o.orderNumber
GROUP BY 
    p.productCode, p.productLine, DATE_FORMAT(o.orderDate, '%Y-%m');

Step 2: Analysis Workflow with Stored Procedure

Next, we create a stored procedure to extract the top 5 products for each product line for a given month.

DELIMITER //

CREATE PROCEDURE GetTopProducts(IN target_month CHAR(7))
BEGIN
    SELECT 
        productLine, 
        productCode, 
        sales, 
        rankInProductLine
    FROM 
        product_sales_ranking 
    WHERE 
        orderMonth = target_month AND 
        rankInProductLine <= 5;
END //

DELIMITER ;

The team can now generate a monthly report by simply calling this stored procedure with the target month as an argument. The procedure will provide a list of top products, their sales, and their rankings within their product lines, helping the team to track and analyze product performance effectively.

CALL GetTopProducts('2023-07');

Again, this example shows how advanced SQL features can be integrated into complex data analysis workflows. Using window functions, we can perform complex calculations on data subsets, and stored procedures allow us to encapsulate and reuse those computations.

Please note, the examples shown here are simplified for clarity. Real-world use-cases would potentially involve more complex SQL queries, additional error handling, and further optimization techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complex Scenarios - ETL and ELT Pipelines using SQL - MySQL #33

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Complex Scenarios - ETL and ELT Pipelines using SQL - MySQL #33

Uh oh!

Uh oh!

akash-coded Aug 7, 2023 Maintainer

ETL Pipeline for Sales Data Analysis

ELT Pipeline for Sales Data Analysis

Advanced Data Pipeline for Market Segment Analysis

Revenue Projections and Employee Performance Analysis

Order Fulfillment and Inventory Management

Customer Lifetime Value Analysis

Data Verification and Integrity Check in ETL Pipeline

Streamlined Data Analysis Workflow with Window Functions and Stored Procedures

Replies: 0 comments

akash-coded
Aug 7, 2023
Maintainer