# Introduction to SQL for Excel Users – Part 21: Correlated Subqueries

[Original post](https://www.daveondata.com/blog/introduction-to-sql-for-excel-users-part-21-correlated-subqueries/)

## The Scenario

NOTE – This post contains no Excel content as the topic is “self-contained” to SQL.

For this post the hypothetical scenario is that the AdventureWorks sales manager would like to see the largest sale ever made for each of the AdventureWorks sales reps.

OK, you’re prolly thinking, “Easy peasy, Dave! I can just use the mighty ROW_NUMBER for this.”

Yep, that’s true.

So I’m going to make it more interesting.

Not only does the sales manager want to know the largest sale of each sales rep, but also what percentage of the rep’s total lifetime sales does the largest sale represent.

For exaple, if sales rep John Doe’s largest sale ever was $100,000 and John Doe’s total lifetime sales are $1,000,000, the largest sale represents 10% of John’s total lifetime sales.

Pretty cool, eh? 😁

NOTE – For this post I’m going to ignore years of service of the sales reps (e.g., a rep with 10 years on the job is going to have high lifetime totals sales).

## Query Prototype

When crafting a non-trivial SQL query it is often helpful to prototype the query in stages.

First up, I need to get all the sales orders for the AdventureWorks sales reps:

In [None]:
SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
    ,E.HireDate
    ,FRS.SalesOrderNumber
    ,CAST(FRS.OrderDate AS Date) AS OrderDate
    ,SUM(FRS.SalesAmount) AS SalesAmount
FROM FactResellerSales FRS 
    INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate

If I run the SQL ☝ in SQL Server Management Studio (SSMS), the results ☝ illustrate there are:

- Many sales reps
- Many sales orders per sales rep

Given this situation with the data, the simplest prototype is to write the SQL for a single, hard-coded sales rep.

I will choose Alberts, Amy as the hard-coded sales rep.

I can wrap the SQL ☝ as a CTE and then perform the needed calcs:



In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
)
SELECT SRD.SalesRep
      ,MAX(SRD.SalesAmount) AS LargestSalesAmount
      ,SUM(SRD.SalesAmount) AS TotalLifetimeSales
      ,MAX(SRD.SalesAmount) / SUM(SRD.SalesAmount) AS LargestPctOfTotal
FROM SalesRepData SRD
WHERE SRD.SalesRep = 'Alberts, Amy'
GROUP BY SRD.SalesRep;

wesome!

It would be easy (although a bit tedious) to add more sales reps to the SQL code.

For example:

In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
)
SELECT SRD.SalesRep
      ,MAX(SRD.SalesAmount) AS LargestSalesAmount
      ,SUM(SRD.SalesAmount) AS TotalLifetimeSales
      ,MAX(SRD.SalesAmount) / SUM(SRD.SalesAmount) AS LargestPctOfTotal
FROM SalesRepData SRD
WHERE SRD.SalesRep IN ('Alberts, Amy', 'Campbell, David', 'Vargas, Garrett')
GROUP BY SRD.SalesRep;

Of course, this strategy of just adding more and more sales reps to the IN clause isn’t scalable.

What I want is the ability for the outer query to automatically be executed once for each sales rep.

Enter correlated subqueries.

NOTE – This is a contrived example. This scenario can be implemented in SQL a number of ways. I’m using correlated subqueries for teaching purposes. 👨‍🏫

## Correlated Subqueries

SQL supports writing subqueries in such a way where the innner query takes a dependency on the outer query.

Conceptually, this dependency means that the inner query is executed once for each row of the outer query.

As usual, some SQL code makes this more clear.

First, I will modify the query with a CTE to pull DISTINCT sales rep names:

In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
),
DistinctSalesRep AS
(
    SELECT DISTINCT SRD.SalesRep
    FROM SalesRepData SRD
)
SELECT DSR.SalesRep
FROM DistinctSalesRep DSR
ORDER BY DSR.SalesRep;

Sweet!

As you prolly surmised, using DISTINCT in your SQL produces only unique values.

In this case, the unique sales rep names.

Now that I have an outer query of DISTINCT sales rep names, I can use a correlated subquery to retrieve data for each sales rep name:

In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
),
DistinctSalesRep AS
(
    SELECT DISTINCT SRD.SalesRep
    FROM SalesRepData SRD
)
SELECT DSR.SalesRep
      ,(SELECT MAX(SRD.SalesAmount)
        FROM SalesRepData SRD
        WHERE SRD.SalesRep = DSR.SalesRep
        GROUP BY SRD.SalesRep) AS LargestSalesAmount
FROM DistinctSalesRep DSR
ORDER BY DSR.SalesRep;

The secret sauce in the SQL ☝ is two-fold:

1. The subquery in the SELECT list.
1. The WHERE condition of SRD.SalesRep = DSR.SalesRep.

Regarding #1, you can place a subquery in the SELECT list if produces a virtual table of a single column.

Regarding #2, the row-by-row behavior of correlated subqueries now makes sense.

Due to the WHERE, the inner query (aliased as SRD) can only return results based on the rows of the outer query (aliased as DSR)

I can now add the remaining correlated subqueries to fulfill the scenario

In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
),
DistinctSalesRep AS
(
    SELECT DISTINCT SRD.SalesRep
    FROM SalesRepData SRD
)
SELECT DSR.SalesRep
      ,(SELECT MAX(SRD.SalesAmount)
        FROM SalesRepData SRD
        WHERE SRD.SalesRep = DSR.SalesRep
        GROUP BY SRD.SalesRep) AS LargestSalesAmount
      ,(SELECT SUM(SRD.SalesAmount)
        FROM SalesRepData SRD
        WHERE SRD.SalesRep = DSR.SalesRep
        GROUP BY SRD.SalesRep) AS TotalLifetimeSales
      ,(SELECT MAX(SRD.SalesAmount) / SUM(SRD.SalesAmount)
        FROM SalesRepData SRD
        WHERE SRD.SalesRep = DSR.SalesRep
        GROUP BY SRD.SalesRep) AS LargestPctOfTotal
FROM DistinctSalesRep DSR
ORDER BY DSR.SalesRep;

Voila!

The above produces the results needed for the scenario, but there’s a problem.

Correlated subqueries are ugly.

Correlated subqueries are hard to debug and maintain.

Guess what?

CTEs to the rescue!

## CTEs Rule!

There are times where correlated subqueries can be just the ticket in your SQL.

Often, you can avoid the use of correlated subqueries using JOINs or CTEs.

Here’s how I could implement the same results using CTEs.

NOTE – The following SQL could be written a number of ways. I wrote it this way for teaching purposes. 😉

In [None]:
WITH SalesRepData AS
(
    SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
        ,E.HireDate
        ,FRS.SalesOrderNumber
        ,CAST(FRS.OrderDate AS Date) AS OrderDate
        ,SUM(FRS.SalesAmount) AS SalesAmount
    FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
    GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
),
IndividualSalesRepData AS
(
    SELECT SRD.SalesRep
          ,MAX(SRD.SalesAmount) AS LargestSalesAmount
          ,SUM(SRD.SalesAmount) AS TotalLifetimeSales
          ,MAX(SRD.SalesAmount) / SUM(SRD.SalesAmount) AS LargestPctOfTotal
    FROM SalesRepData SRD
    GROUP BY SRD.SalesRep
)
SELECT ISRD.SalesRep
      ,ISRD.LargestSalesAmount
      ,ISRD.TotalLifetimeSales
      ,ISRD.LargestPctOfTotal
FROM IndividualSalesRepData ISRD
ORDER BY ISRD.SalesRep;

We’ve seen the SSMS results multiple times, so I won’t repeat them.

You can often uses CTEs to replace correlated subqueries not only in SELECT lists, but also in WHERE clauses.

Yes, you can use correlated subqueries in WHERE clauses! 😲

## The Learning Arc

The next post will continue coverage of correlated subqueries in WHERE clauses.

Rememeber – use CTEs when you can instead of correlated subqueries. 😉

Stay healthy and happy data sleuthing!