# Introduction to SQL for Excel Users – Part 20: Basic Subqueries

[Original post](https://www.daveondata.com/blog/introduction-to-sql-for-excel-users-part-20-basic-subqueries/)

## I See Virtual Tables. They’re Everywhere.

NOTE – This post contains no Excel content as the topic is “self-contained” to SQL.

As I’ve mentioned several times in this series, a critical concept in SQL is that of the virtual table.

I’ve worked with virtual tables quite a bit in this series via the use of CTEs and JOINs.

The concept of a virtual table is closely tied to the concept of SQL subqueries.

For example, CTEs are a great mechanism for writing subqueries.

Since this is all a little abstract, I will demonstrate with some code.

## Subqueries – the Inner and the Outer

When working with subqueries it is helpful to think of inner queries and outer queries.

You know from the use of CTEs, that the query inside the WITH is the inner query and that the outer query is the SELECT at the very end.

Take the following SQL from the last post:

In [None]:
WITH SalesRepData AS
(
     SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
           ,E.HireDate
           ,FRS.SalesOrderNumber
           ,CAST(FRS.OrderDate AS Date) AS OrderDate
           ,SUM(FRS.SalesAmount) AS SalesAmount
     FROM FactResellerSales FRS 
         INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
     GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
)
SELECT SRD.SalesRep
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN 1 ELSE 0 END) AS TotalSalesMadeFirst90Days
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN SRD.SalesAmount ELSE 0.0 END) AS TotalSalesAmountFirst90Days
FROM SalesRepData SRD
GROUP BY SRD.SalesRep
ORDER BY SRD.SalesRep;

In the SQL ☝, SalesRepData produces a virtual table that is the result of running the inner query.

Technically, SalesRepData is a subquery.

The following SQL is equivalent to the SQL ☝:

In [None]:
SELECT SRD.SalesRep
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN 1 ELSE 0 END) AS TotalSalesMadeFirst90Days
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN SRD.SalesAmount ELSE 0.0 END) AS TotalSalesAmountFirst90Days
FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
            ,E.HireDate
            ,FRS.SalesOrderNumber
            ,CAST(FRS.OrderDate AS Date) AS OrderDate
            ,SUM(FRS.SalesAmount) AS SalesAmount
      FROM FactResellerSales FRS 
        INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
      GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD
GROUP BY SRD.SalesRep
ORDER BY SRD.SalesRep;

While not the most attractive SQL in the world, the code ☝ is legit. You can execute both queries and both return the exact same output.

In fact, the second piece of SQL is, conceptually, what happens behind the scenes in the DB.

Using CTEs are mainly about making your SQL code easier to read and maintain.

You can use subqueries anywhere a physical table (e.g., DimEmployee) or a virtual table is legit in a piece of SQL.

To emphasize this point, I’m going to do something silly.

I’ll take the last piece of SQL and replace DimEmployee in the INNER JOIN with a subquery (which, of course produces a virtual table):

In [None]:
SELECT SRD.SalesRep
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN 1 ELSE 0 END) AS TotalSalesMadeFirst90Days
      ,SUM(CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN SRD.SalesAmount ELSE 0.0 END) AS TotalSalesAmountFirst90Days
FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
            ,E.HireDate
            ,FRS.SalesOrderNumber
            ,CAST(FRS.OrderDate AS Date) AS OrderDate
            ,SUM(FRS.SalesAmount) AS SalesAmount
      FROM FactResellerSales FRS 
        INNER JOIN (SELECT *
                    FROM DimEmployee) E ON (FRS.EmployeeKey = E.EmployeeKey)
      GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD
GROUP BY SRD.SalesRep
ORDER BY SRD.SalesRep;

Notice how the SQL keeps getting uglier, but is still legit?

If you run this last piece of SQL you get the same results as the first two pieces of SQL, but the code is more difficult to understand.

Can you see why I started with CTEs and not subqueries? 😁

CTEs are awesome!

Lemme show you why…

## Beware the Subquery

I’m going to continue the hypothetical analysis from the last post.

The question being analyzed is if AdventureWorks sales rep performance in the first 90 days of employement is associated with sales rep performance at the 1-year mark.

I’m going to implement the SQL for this analysis without any CTEs.

It ain’t gonna be pretty. 🤣

Take the following SQL that pulls the 1-year sales rep performance (yes, I could have used DATEDIFF in a WHERE instead):

In [None]:
SELECT SRD365.SalesRep
      ,COUNT(SRD365.SalesOrderNumber) AS TotalSalesMadeFirst365Days
      ,SUM(SRD365.SalesAmount) AS TotalSalesAmountFirst365Days
FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
            ,FRS.SalesOrderNumber
            ,CAST(FRS.OrderDate AS Date) AS OrderDate
            ,SUM(FRS.SalesAmount) AS SalesAmount
            ,CASE WHEN DATEDIFF(DAY, E.HireDate, FRS.OrderDate) <= 364 THEN 1 ELSE 0 END AS SaleMadeFirst365Days
     FROM FactResellerSales FRS 
         INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
     GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD365
WHERE SRD365.SaleMadeFirst365Days = 1
GROUP BY SRD365.SalesRep
ORDER BY SRD365.SalesRep;

To make this post a bit simpler, I will only consider AdventureWorks sales reps that made sales in their first year of employment.

I shouldn’t make the assumption that every sales rep made a sale in the first 90 days.

Therefore, I need to LEFT OUTER JOIN my 90-day performance subquery:

The first step is to structure my SQL so that I can LEFT OUTER JOIN sales rep performance in the first 90 days to the 1-year performance.

Subquery time!

In [None]:
SELECT SRD365.SalesRep
      ,SRD365.TotalSalesMadeFirst365Days
      ,SRD365.TotalSalesAmountFirst365Days
FROM (SELECT SRD365.SalesRep
            ,COUNT(SRD365.SalesOrderNumber) AS TotalSalesMadeFirst365Days
            ,SUM(SRD365.SalesAmount) AS TotalSalesAmountFirst365Days
      FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
                  ,FRS.SalesOrderNumber
                  ,CAST(FRS.OrderDate AS Date) AS OrderDate
                  ,SUM(FRS.SalesAmount) AS SalesAmount
                  ,CASE WHEN DATEDIFF(DAY, E.HireDate, FRS.OrderDate) <= 364 THEN 1 ELSE 0 END AS SaleMadeFirst365Days
            FROM FactResellerSales FRS 
                INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
            GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD365
      WHERE SRD365.SaleMadeFirst365Days = 1
      GROUP BY SRD365.SalesRep) SRD365
ORDER BY SRD365.SalesRep;

Sweet!

The subquery ☝ produces a virtual left table named SRD365.

Now I can left join the 90-day data with another subquery:

In [None]:
SELECT SRD365.SalesRep
      ,SRD90.TotalSalesMadeFirst90Days
      ,SRD90.TotalSalesAmountFirst90Days
      ,SRD365.TotalSalesMadeFirst365Days
      ,SRD365.TotalSalesAmountFirst365Days
FROM (SELECT SRD365.SalesRep
            ,COUNT(SRD365.SalesOrderNumber) AS TotalSalesMadeFirst365Days
            ,SUM(SRD365.SalesAmount) AS TotalSalesAmountFirst365Days
      FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
                  ,FRS.SalesOrderNumber
                  ,CAST(FRS.OrderDate AS Date) AS OrderDate
                  ,SUM(FRS.SalesAmount) AS SalesAmount
                  ,CASE WHEN DATEDIFF(DAY, E.HireDate, FRS.OrderDate) <= 364 THEN 1 ELSE 0 END AS SaleMadeFirst365Days
            FROM FactResellerSales FRS 
                INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
            GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD365
      WHERE SRD365.SaleMadeFirst365Days = 1
      GROUP BY SRD365.SalesRep) SRD365
    LEFT OUTER JOIN (SELECT SRD.SalesRep
                           ,COUNT(SRD.SalesOrderNumber) AS TotalSalesMadeFirst90Days
                           ,SUM(SRD.SalesAmount) AS TotalSalesAmountFirst90Days
                     FROM (SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
                                 ,FRS.SalesOrderNumber
                                 ,CAST(FRS.OrderDate AS Date) AS OrderDate
                                 ,SUM(FRS.SalesAmount) AS SalesAmount
                                 ,CASE WHEN DATEDIFF(DAY, E.HireDate, FRS.OrderDate) <= 89 THEN 1 ELSE 0 END AS SaleMadeFirst90Days
                           FROM FactResellerSales FRS 
                            INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
                           GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate) SRD
                     WHERE SRD.SaleMadeFirst90Days = 1
                     GROUP BY SRD.SalesRep) SRD90 ON (SRD365.SalesRep = SRD90.SalesRep)
ORDER BY SRD365.SalesRep; 

OK, that’s some useful data, but that SQL is ugly.

I’ve been coding SQL for more than 20 years and I would not want to maintain the ☝ code.

CTEs to the rescue!

## The Beauty of CTEs

NOTE – I know the following SQL code could be structured more optimally in a number of ways. I wrote it this way to make everything crystal clear to learners. 😁

Here’s the CTE-beautified code:

In [None]:
WITH SalesRepData AS
(
     SELECT CONCAT(E.LastName, ', ', E.FirstName) AS SalesRep 
           ,E.HireDate
           ,FRS.SalesOrderNumber
           ,CAST(FRS.OrderDate AS Date) AS OrderDate
           ,SUM(FRS.SalesAmount) AS SalesAmount
     FROM FactResellerSales FRS 
         INNER JOIN DimEmployee E ON (FRS.EmployeeKey = E.EmployeeKey)
     GROUP BY E.FirstName, E.LastName, E.HireDate, FRS.SalesOrderNumber, FRS.OrderDate
),
SalesRepPerformance AS
(
    SELECT SRD.SalesRep
          ,CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN 1 ELSE 0 END AS SalesMadeFirst90Days
          ,CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 89 THEN SRD.SalesAmount ELSE 0.0 END AS SalesAmountFirst90Days
          ,CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 364 THEN 1 ELSE 0 END AS SalesMadeFirst365Days
          ,CASE WHEN DATEDIFF(DAY, SRD.HireDate, SRD.OrderDate) <= 364 THEN SRD.SalesAmount ELSE 0.0 END AS SalesAmountFirst365Days
    FROM SalesRepData SRD
),
RepPerformance365Days AS
(
    SELECT SRP.SalesRep
          ,SUM(SRP.SalesMadeFirst365Days) AS TotalSalesMadeFirst365Days
          ,SUM(SRP.SalesAmountFirst365Days) AS TotalSalesAmountFirst365Days
    FROM SalesRepPerformance SRP
    WHERE SRP.SalesMadeFirst365Days = 1
    GROUP BY SRP.SalesRep
 
),
RepPerformance90Days AS
(
    SELECT SRP.SalesRep
          ,SUM(SRP.SalesMadeFirst90Days) AS TotalSalesMadeFirst90Days
          ,SUM(SRP.SalesAmountFirst90Days) AS TotalSalesAmountFirst90Days
    FROM SalesRepPerformance SRP
    WHERE SRP.SalesMadeFirst90Days = 1
    GROUP BY SRP.SalesRep
 
)
SELECT RP365.SalesRep
      ,COALESCE(RP90.TotalSalesMadeFirst90Days, 0.0) AS TotalSalesMadeFirst90Days
      ,COALESCE(RP90.TotalSalesAmountFirst90Days, 0.0) AS TotalSalesAmountFirst90Days
      ,RP365.TotalSalesMadeFirst365Days
      ,RP365.TotalSalesAmountFirst365Days
FROM RepPerformance365Days RP365
    LEFT OUTER JOIN RepPerformance90Days RP90 ON (RP365.SalesRep = RP90.SalesRep)
ORDER BY RP365.SalesRep

Notice how I used a new piece of SQL goodnes – COALESCE.

You uses COALESCE to transform NULLs.

You feed COALESCE a list of values to check (e.g., a column or a hard-coded value) and it return back the first value that isn’t NULL.

In this case, if a value is NULL return back 0.0.

Awesome!

With the data ☝, the AdventureWorks sales manager can conduct some interesting analyses.

For example, the manager could copy-and-paste the SSMS results into Excel and perform a linear regression analysis.

I will mention again that this is a trivial example.

However, SQL skills allow you to perform retrieve and shape data at scale (e.g., 10s of millions of records).



## The Learning Arc

The next post will continue coverage of subqueries to show other examples of when they can be used in your queries.

Hopefully I’ve demonstrated that SQL allow you great flexibility in crafting code using subqueries.

Hopefully I’ve also demonstrated that you wanna use CTEs all the time. 😉

Stay healthy and happy data sleuthing!