# T-SQL Fundamentals - Chapter 4: Subqueries

## Introduction

Subqueries are an essential tool in SQL that allow you to execute a query within another query. This chapter explores the different types of subqueries, including self-contained subqueries, correlated subqueries, and using subqueries with predicates like `EXISTS`. It also provides solutions for handling issues that might arise when working with subqueries.

### **Self-Contained Subqueries**

Self-contained subqueries are independent of the outer query and can be executed on their own.

#### Scalar Subqueries:
A scalar subquery returns a single value (one row and one column). Such a subquery can appear anywhere in 
the outer query where a single-valued expression can appear (such as WHERE or SELECT).

<u>IMPORTANT</u>: For a scalar subquery to be valid, it must return **no more than one** value.

**Examples:**

In [7]:
SELECT orderid, orderdate, custid, empid
FROM TSQLV6.Sales.Orders
WHERE orderid = (
                    SELECT MAX(O.orderid)
                    FROM TSQLV6.Sales.Orders AS O
                );

orderid,orderdate,custid,empid
11077,2022-05-06,65,1


In [8]:
-- Scalar Subquery
SELECT MAX(O.orderid)
FROM TSQLV6.Sales.Orders AS O

(No column name)
11077


In this query, the scalar subquery `(SELECT MAX(o.orderdate) FROM Sales.Orders AS o)` returns the maximum order date from the `Orders` table, and it is used as part of the outer query.

In [16]:
SELECT orderid
FROM TSQLV6.Sales.Orders
WHERE empid IN   (
                    SELECT E.empid
                    FROM TSQLV6.HR.Employees AS E
                    WHERE E.lastname LIKE N'C%'
                );

orderid
10262
10268
10276
10278
10279
10286
10287
10290
10301
10305


The subquery returns employee IDs of all employees whose last names start with the letter **C**. The outer query returns the orders where the employee ID is equal to the result of the subquery.

<u>IMPORTANT</u>: 
- This subquery can potentially return more than one value. But currently the `Employees` table contains only one employee whose last name starts with **C** (Maria Cameron with employee ID 8). 
- If the subquery returns more than one value, *the query fails*. For example, try running the query with 
employees whose last names start with **D**.

#### Multivalued Subqueries:

Multivalued subqueries return a set of values (*multiple rows but one column*) that can be used in the `IN` clause.

The form of the `IN` predicate is _`<scalar_expression> IN (<multivalued subquery>)`_.

The <u>predicate</u> evaluates to **TRUE** if *scalar_expression* is equal to any of the values returned by the subquery.

**Example:**

In [15]:
SELECT orderid
FROM TSQLV6.Sales.Orders
WHERE empid IN  (
                    SELECT E.empid
                    FROM TSQLV6.HR.Employees AS E
                    WHERE E.lastname LIKE N'D%'
                );

-- As with any other predicate, you can negate the IN predicate with the NOT operator: 'NOT IN'

orderid
10258
10270
10275
10285
10292
10293
10304
10306
10311
10314


Returning orders that were handled by employees with a last name starting with a certain letter.

In [17]:
-- Same result using JOIN

SELECT O.orderid
FROM TSQLV6.HR.Employees AS E
    INNER JOIN TSQLV6.Sales.Orders AS O ON E.empid = O.empid
WHERE E.lastname LIKE N'D%';

orderid
10258
10270
10275
10285
10292
10293
10304
10306
10311
10314


**<u>Note</u>** 
- SQL supports other predicates that operate on a multivalued subquery; those are `SOME`, `ANY`, and `ALL`. They are rarely used and therefore are not covered in this course.

### **Correlated Subqueries**

Correlated subqueries depend on the outer query. They use columns from the outer query in the inner query.

This means the subquery is dependent on the outer query and cannot be invoked as a standalone query. 

**Example:**

In [19]:
SELECT custid, orderid, orderdate, empid
FROM TSQLV6.Sales.Orders AS o1
WHERE orderid = 
                (
                    SELECT MAX(o2.orderid) 
                    FROM TSQLV6.Sales.Orders AS o2 
                    WHERE o2.custid = o1.custid
                );

-- Check with o1.custid = 85 in line 7

custid,orderid,orderdate,empid
91,11044,2022-04-23,4
90,11005,2022-04-07,2
89,11066,2022-05-01,7
88,10935,2022-03-09,4
87,11025,2022-04-15,6
86,11046,2022-04-23,8
85,10739,2021-11-12,3
84,10850,2022-01-23,1
83,10994,2022-04-02,2
82,10822,2022-01-08,6


- For each row in `o1`, the subquery returns the maximum `orderid` for the current customer. 
- If the outer order ID and the `orderid` returned by the subquery match, the query returns the outer row.

### **Using the EXISTS Predicate**

The `EXISTS` predicate is used to check if a subquery returns any rows. It is commonly used with correlated subqueries.

It returns **TRUE** if the subquery returns any rows and **FALSE** otherwise.

**Example:**

In [53]:
SELECT custid, companyname
FROM TSQLV6.Sales.Customers AS C
WHERE country = N'Spain'
 AND EXISTS (
                SELECT * FROM TSQLV6.Sales.Orders AS O
                WHERE O.custid = C.custid
            );

-- You can negate the EXISTS predicate with the NOT operator: 'NOT EXISTS'

custid,companyname
22,Customer DTDMN


Return customers from Spain if they have any orders where the order’s customer ID is the same as the customer’s customer ID.

### **Returning Previous or Next Values**

Returning **previous** or **next** values in a dataset: T-SQL expression that means “the maximum value that is smaller than the current value”.

The tricky part is that the concept of “previous” implies order, and rows in a table have no order.

**Example:**

In [27]:
-- "previous"

SELECT orderid, orderdate, empid, custid,
       (
            SELECT MAX(O2.orderid) 
            FROM TSQLV6.Sales.Orders AS O2 
            WHERE O2.orderid < O1.orderid
        ) AS prevorderid
FROM TSQLV6.Sales.Orders AS O1;

orderid,orderdate,empid,custid,prevorderid
10248,2020-07-04,5,85,
10249,2020-07-05,6,79,10248.0
10250,2020-07-08,4,34,10249.0
10251,2020-07-08,3,84,10250.0
10252,2020-07-09,4,76,10251.0
10253,2020-07-10,3,34,10252.0
10254,2020-07-11,5,14,10253.0
10255,2020-07-12,9,68,10254.0
10256,2020-07-15,3,88,10255.0
10257,2020-07-16,4,35,10256.0


Notice that because there's no order **before** the first order, the subquery returned a NULL for the <u>first</u> order.

In [29]:
-- "next"

SELECT orderid, orderdate, empid, custid,
       (
            SELECT MIN(O2.orderid)
            FROM TSQLV6.Sales.Orders AS O2
            WHERE O2.orderid > O1.orderid
        ) AS nextorderid
FROM TSQLV6.Sales.Orders AS O1;

orderid,orderdate,empid,custid,nextorderid
10248,2020-07-04,5,85,10249.0
10249,2020-07-05,6,79,10250.0
10250,2020-07-08,4,34,10251.0
10251,2020-07-08,3,84,10252.0
10252,2020-07-09,4,76,10253.0
10253,2020-07-10,3,34,10254.0
10254,2020-07-11,5,14,10255.0
10255,2020-07-12,9,68,10256.0
10256,2020-07-15,3,88,10257.0
10257,2020-07-16,4,35,10258.0


Notice that because there's no order **after** the last order, the subquery returned a NULL for the <u>last</u> order.

### **Using Running Aggregates**

Running aggregates allow you to calculate cumulative values based on some order, such as a running **total**, **average**, or **sum**.

**Example:**

In [31]:
-- View

SELECT orderyear, qty
FROM TSQLV6.Sales.OrderTotalsByYear;

orderyear,qty
2021,25489
2022,16247
2020,9581


In [32]:
-- Aggregates that accumulate values based on some order.
-- Computes for each year the running total quantity up to and including that year’s.

SELECT orderyear, qty,
        (
            SELECT SUM(O2.qty)
            FROM TSQLV6.Sales.OrderTotalsByYear AS O2
            WHERE O2.orderyear <= O1.orderyear
        ) AS runqty
FROM TSQLV6.Sales.OrderTotalsByYear AS O1
ORDER BY orderyear;

orderyear,qty,runqty
2020,9581,9581
2021,25489,35070
2022,16247,51317


- For the earliest year recorded in the view (2020), the running total is equal to that year’s quantity. 
- For the second year (2021), the running total is the sum of the first year plus the second year, and so on.

### **Dealing with Misbehaving Subqueries**

When working with subqueries, certain issues can arise, such as handling `NULL` values and dealing with substitution errors.

#### **NULL Trouble:**

Subqueries that return `NULL` values can cause unexpected results, especially when using comparison operators.

**Example:**

In [35]:
--  Return customers who did not place orders.

SELECT custid, companyname
FROM TSQLV6.Sales.Customers
WHERE custid NOT IN (
                        SELECT O.custid
                        FROM TSQLV6.Sales.Orders AS O
                    );

custid,companyname
22,Customer DTDMN
57,Customer WVAXS


INSERT INTO Sales.Orders
    (custid, empid, orderdate, requireddate, shippeddate, shipperid,
    freight, shipname, shipaddress, shipcity, shipregion,
    shippostalcode, shipcountry)
 VALUES
    (NULL, 1, '20220212', '20220212',
    '20220212', 1, 123.00, N'abc', N'abc', N'abc',
    N'abc', N'abc', N'abc');

- Run the previous query again and this time the query returns an **empty set**.
- The culprit here is the NULL customer ID you added to the Orders table. The NULL is one of the elements returned by the subquery.

#### **Substitution Errors:**

Substitution errors occur when a subquery is used incorrectly in a context where it doesn’t fit, such as returning multiple values where a single value is expected.

Logical bugs in your code can sometimes be elusive.

**Example:**

In [56]:
DROP TABLE IF EXISTS TSQLV6.Sales.MyShippers;

CREATE TABLE TSQLV6.Sales.MyShippers
(
    shipper_id INT NOT NULL,
    companyname NVARCHAR(40) NOT NULL,
    phone NVARCHAR(24) NOT NULL,
    CONSTRAINT PK_MyShippers PRIMARY KEY(shipper_id)
);

INSERT INTO TSQLV6.Sales.MyShippers (shipper_id, companyname, phone) VALUES
 (1, N'Shipper GVSUA', N'(503) 555-0137'),
 (2, N'Shipper ETYNR', N'(425) 555-0136'),
 (3, N'Shipper ZHISN', N'(415) 555-0138')

In [41]:
SELECT shipper_id, companyname
FROM TSQLV6.Sales.MyShippers
WHERE shipper_id IN (
                        SELECT shipper_id
                        FROM TSQLV6.Sales.Orders
                        WHERE custid = 43
                    );

shipper_id,companyname
1,Shipper GVSUA
2,Shipper ETYNR
3,Shipper ZHISN


Only shippers 2 and 3 shipped orders to customer 43, but for some reason this query returned all shippers from the MyShippers table.

it turns out that the column name in the Orders table holding the shipper ID is called not `shipper_id`, but rather `shipperid` (no underscore). The column in the MyShippers table is called `shipper_id`, with an underscore.

SQL Server first looks for the column `shipper_id` in the table in the inner query, Orders. Such a column is not found there, so SQL Server looks for it in the table in the outer query, MyShippers. Such a column is found in MyShippers, so that is the one used.

You can follow a couple of best practices to avoid such problems:
- Use **consistent attribute** names across tables.
- Prefix column names in subqueries with the source **table name or alias** (if you assigned one).

In [47]:
-- ERROR:

SELECT shipper_id, companyname
FROM TSQLV6.Sales.MyShippers
WHERE shipper_id IN (
                        SELECT O.shipper_id
                        FROM TSQLV6.Sales.Orders AS O
                        WHERE O.custid = 43
                    );

: Msg 207, Level 16, State 1, Line 6
Invalid column name 'shipper_id'.

In [49]:
-- CORRECT:

SELECT shipper_id, companyname
FROM TSQLV6.Sales.MyShippers
WHERE shipper_id IN (
                        SELECT O.shipperid
                        FROM TSQLV6.Sales.Orders AS O
                        WHERE O.custid = 43
                    );

shipper_id,companyname
2,Shipper ETYNR
3,Shipper ZHISN


In [50]:
-- At the end:

DROP TABLE IF EXISTS TSQLV6.Sales.MyShippers;

_Regarding DEPENDENCY_

- **_Self-contained_** subqueries, which are <u>independent</u> of their outer queries; 

- **_Correlated_** subqueries, which are <u>dependent</u> on their outer queries. 

_Regarding RESULT_

- scalar;
- multivalued subqueries. 

Also:

- returning previous and next values;
- using running aggregates;
- and dealing with misbehaving subqueries. 
- remember the **importance of prefixing column** names <u>in subqueries</u> with the _source table alias_.

### **Conclusion**

Subqueries are a powerful tool in SQL that allow for more complex queries and greater flexibility. Understanding the different types of subqueries, as well as how to handle common issues, is **key to mastering T-SQL**.