# Introduction to SQL for Excel Users – Part 16: Basic INNER JOINs

[Original post](https://www.daveondata.com/blog/introduction-to-sql-for-excel-users-part-16-basic-inner-joins/)

## LEFT JOINs Revisited

_NOTE – There will be no Excel coverage in this post as it isn’t necessary for the concepts and I wanted to keep the post to a reasonable length._

In the previous post, I covered the basics of the mighty LEFT OUTER JOIN.

If you haven’t already, go read that post.

Here’s a tasty bit of SQL from the last post:

In [None]:
SELECT E.EmployeeKey
      ,E.FirstName
      ,E.LastName
      ,SQ.SalesAmountQuota
FROM DimEmployee E 
    LEFT OUTER JOIN FactSalesQuota SQ ON (E.EmployeeKey = SQ.EmployeeKey)
WHERE E.EmployeeKey IN (271, 274, 275, 277, 282, 283) AND
      SQ.SalesAmountQuota IS NOT NULL


The SQL code ☝ is conceptually executed as follows:

1. Grab all rows from DimEmployee…
1. LEFT OUTER JOIN all rows from FactSalesQuota…
1. WHERE EmployeeKey is IN the defined list…
1. AND SalesAmountQuota IS NOT NULL

You know that LEFT OUTER JOINs keeps all rows from the left virtual table.

You also know when there are no matches from the right virtual table, NULLs are returned.

This makes step 4 ☝ super interesting.

The query results become only EmployeeKeys that have SalesAmountQuotas.

In other words, only return the rows where there are matches.

Not surprisingly, this kind of matching is done all the time in SQL.

It’s called an INNER JOIN.

## INNER JOINs

You use the mighty INNER JOIN when you want to only keep matches, including duplicates, between two tables.

To demonstrate the power of INNER JOINs, I will be using the following two tables:

- FactResellerSales
- DimReseller

The important thing to note is that both tables have a ResellerKey column.

This is great because it gives you a big hint as to what columns you can use for matching in your JOINs!

The syntax of JOINs are essentially all the same, here’s some SQL that INNER JOINs FactResellerSales to DimReseller:

In [None]:
SELECT R.ResellerName 
      ,FRS.OrderDate
      ,FRS.SalesAmount
FROM FactResellerSales FRS
    INNER JOIN DimReseller R ON (FRS.ResellerKey = R.ResellerKey)
And the SSMS results:

## Pareto Analysis

The Pareto analysis is mos def part of the 20% of analytics that drive 80% of ROI.

At base, a Pareto analysis is the application of the 80/20 Rule to business.

Classic examples include:

80% of your sales come from 20% of your customers
80% of your revenues come from 20% of your products
80% of your defects come from 20% of the causes
You get the idea.

For this post, I will combine a number of topics covered so far in the series to conduct a Pareto analysis of AdventureWorks’ resellers:

INNER JOINs
CTEs
GROUP BY & aggregate functions
Window functions
Righteous!

## Pareto Analysis with INNER JOINs

In this analysis I will see if 20% of AdventureWorks’ resellers produce 80% of reseller sales.

I'll need to know:

- The number of resellers.
- The total sales for all resellers?  The result will be the denominator in what % of all reseller sales is attributed to a single reseller.
- The total sales for each reseller.


In [None]:
-- How many resellers?
SELECT COUNT(*) AS ResellerCount
FROM DimReseller;

In [None]:
-- Total sales for all resellers?
SELECT SUM(FRS.SalesAmount) AS TotalResellerSales
FROM FactResellerSales FRS

In [None]:
-- Total sales for each reseller?
SELECT R.ResellerName 
        ,SUM(FRS.SalesAmount) AS ResellerSales
FROM FactResellerSales FRS
    INNER JOIN DimReseller R ON (FRS.ResellerKey = R.ResellerKey)
GROUP BY R.ResellerName

Sweet!

I have the basic building blocks for conducting the Pareto analysis.

Since I have multiple queries/virtual tables I need to work with, I can leverage CTEs to structure my code.

Here’s a SQL snippet that ain’t legit, it’s an interim step to the final product:

In [None]:
WITH TotalResellerSales AS
(
    SELECT SUM(FRS.SalesAmount) AS TotalResellerSales
    FROM FactResellerSales FRS 
),
ResllerSales AS
(
    SELECT R.ResellerName 
            ,SUM(FRS.SalesAmount) AS ResellerSales
    FROM FactResellerSales FRS
        INNER JOIN DimReseller R ON (FRS.ResellerKey = R.ResellerKey)
    GROUP BY R.ResellerName
)

I can use the CTEs ☝ to start building my outer query to conduct the pareto analysis.

I’ll start with calculating the cumulative total of all reseller sales by reseller:

In [None]:
WITH TotalResellerSales AS
(
    SELECT SUM(FRS.SalesAmount) AS TotalResellerSales
    FROM FactResellerSales FRS 
),
ResllerSales AS
(
    SELECT R.ResellerName 
            ,SUM(FRS.SalesAmount) AS ResellerSales
    FROM FactResellerSales FRS
        INNER JOIN DimReseller R ON (FRS.ResellerKey = R.ResellerKey)
    GROUP BY R.ResellerName
)
SELECT RS.ResellerName
       ,RS.ResellerSales
       ,SUM(RS.ResellerSales) OVER (ORDER BY RS.ResellerSales DESC) AS CumProductSales
FROM ResllerSales RS;

In the results ☝, CumProduceSales is a “running total” of total reseller sales.

Also notice how the results are in descending order by invidvidual reseller sales.

This magical result is achieved via combining the SUM aggregate function with a window defined by the OVER clause.

Pure awesomeness.

This ordering allows me to finish up the SQL code by adding a calculation for the cumulative percentage of all reseller sales:

In [None]:
WITH TotalResellerSales AS
(
    SELECT SUM(FRS.SalesAmount) AS TotalResellerSales
    FROM FactResellerSales FRS 
),
ResllerSales AS
(
    SELECT R.ResellerName 
            ,SUM(FRS.SalesAmount) AS ResellerSales
    FROM FactResellerSales FRS
        INNER JOIN DimReseller R ON (FRS.ResellerKey = R.ResellerKey)
    GROUP BY R.ResellerName
)
SELECT RS.ResellerName
       ,RS.ResellerSales
       ,SUM(RS.ResellerSales) OVER (ORDER BY RS.ResellerSales DESC) AS CumProductSales
       ,SUM(RS.ResellerSales) OVER (ORDER BY RS.ResellerSales DESC) / (SELECT TotalResellerSales FROM TotalResellerSales) AS CumPctSales
FROM ResllerSales RS;

This following snippet of code is a new idea, want to call it out:

```
SUM(RS.ResellerSales) OVER (ORDER BY RS.ResellerSales DESC) / (SELECT TotalResellerSales FROM TotalResellerSales) AS CumPctSales
```
In the snippet immediately ☝, I’m using a subquery to pull the total amount of reseller sales to use as the denominator in the calculation.

You use subqueries all the time in SQL – sometimes even when you don’t know it. 😲

Conceptually, CTEs are subqueries that you use to make your code cleaner and easier to understand.

Subqueries will come around again in the series, so I will move on to the SSMS results:

![ssms_subquery](images/16_ssms_subquery.png)

In the results ☝, I’ve scrolled down to where CumPctSales crosses the 80% threshold.

Notice that corresonds to result row 186.

In other words, the top 186 AdventureWorks resellers account for 80% of all reseller sales.

Quick calc here: 186 / 701 = 0.26106 = 26.11%

While not exactly 20%, it’s pretty close! 😉

BTW – Every data visualization tool worth using (including Excel and R) can connect to SQL Server and leverage the final results ☝ to create cool visualizations. 😁

The Learning Arc
The next post will continue coverage of JOINs since they are central to using SQL.

Specifically, I will be talking about JOIN filtering in the ON clause.

Stay healthy and happy data sleuthing!

## The Learning Arc

The next post will continue coverage of JOINs since they are central to using SQL.

Specifically, I will be talking about JOIN filtering in the ON clause.

Stay healthy and happy data sleuthing!