<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 3: Advanced SELECT Statements** 
_Retrieving data from multiple tables._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The use of implicit joins, explicit joins, and subqueries
- The use of numeric surrogate keys instead of text names
- Variations of the SQL JOIN operator (natural, equijoin, theta join)
- The use of outer joins to work with optional table relationships
- Subqueries as SQL expressions 

### **Skills / Know how to ...**
- Use `JOIN` operators to connect data from multiple tables
- Write subqueries for common use cases where JOIN is insufficient
- Use `WITH` statements to simplify complex queries with subqueries

--------
## **LESSON 3 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/rsCrjQck_jQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

### **Run this boilerplate code before continuing on.** 
 

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

Collecting pymysql
[?25l  Downloading https://files.pythonhosted.org/packages/4f/52/a115fe175028b058df353c5a3d5290b71514a83f67078a6482cff24d6137/PyMySQL-1.0.2-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 3.9MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2


'Connected: buan6510student@lahman2016'

**Rerun this code as needed to keep your software up to date and database connection fresh.**  

---
## **BIG PICTURE: Data as a Foreign Language**

Language takes on a new meaning when you are learning a second one. Suddenly you pay attention to things like grammar and spelling because without them being just like what's in the book you are lost. You find yourself needing people to speak very, very slowly so you can transliterate what you hearing in real time. Then, slowly, you start to get a more intuitive feel for things. You can go a little faster, fill in the gaps when the grammar and spelling aren't exactly perfect,... Eventually, if you stick with it you can even start to think in the language almost as well as a native speaker. Then, if you are really diligent, you start to speak totally unconsciously, sometimes forgetting what language you are speaking. (One way to tell someone is a natively bilingual speaker is when they are in a mixed crowd. If they get their languages crossed, so that they speak exactly the wrong language to each person, then you know the person is natively bilingual.)

So it is with data. We learn about data by working with datasets, learning conventions like data structures and data types, without really thinking about it much. Data is data, or so we think. Then we come across *foreign* data that doesn't obey all the rules we have been taking for granted. It is then that we *really* start to understand how data works. 

None of this was new to the people who developed SQL. The language has been around so long and had to be warped to fit so many use cases that there really isn't much it can't handle *with planning and effort*.  

In this lesson we will consider some of the many ways to combine data from multiple tables in SQL. Sometimes the integration is pretty straightforward. Sometimes it will take some work and perhaps a few carefully crafted subqueries. And, of course, sometimes it just doesn't work and trying to make it work is just futile and potentially risky. Hopefully, by the end of this lesson you will be able to recognize each case and know what to do.  

 ---
## **Multi-Table `SELECT` Queries**

We will explore three different ways to combine data from multiple tables in a single SQL query:
- Implicit joins that use relational algebra
- Explicit joins that use the JOIN operator
- Subqueries that nest *inside* other queries

We are again using the baseball database from Lesson 2, summarized below. For full details about the tables and what the columns mean, please consult Sean Lahman's [database documentation](http://www.seanlahman.com/files/database/readme2016.txt).

![Lahman 2016 ERD](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_baseball_stats_schema.png)


---
## **Implicit Joins**

Implicit joins do not have the `JOIN` keyword in them. That the code is performing a join is *implied*. 

The format of an implicit join is:
```
SELECT *
FROM TableA, TableB
WHERE TableA.columnX = TableB.columnX
```  

It looks pretty simple:
- List two tables in the `FROM` clause.
- In the `WHERE` clause match columns from one table with columns in the other table.

In fact, in the earliest versions of SQL implicit joins were the only way to merge data from multiple tables in SQL. However, they come with some potentially serious problems if the `WHERE` clause is not right. 

The issue is not so much with the `WHERE` clause as with the `FROM` clause. Consider the following query, which omits the `WHERE` clause entirely: 
```
SELECT *
FROM TableA, TableB
```
This is a so-called **cross join**, which we will revisit in detail in Lesson 4. A cross join matches each row in the first table (`TableA`) with each row in the second table (`TableB`). The total number of rows in the result is then given by the product of the row counts for the two tables. 

Let's try this ourselves. The following code does a cross join of two of the smaller tables in our baseball database:
```
SELECT nameLast, teamid
FROM Master, Teams     -- note: draws rows from two tables
```



We can easily determine the number of rows in each table: 

In [None]:
%%sql
-- How many players are there?
SELECT count(*) 
FROM Master;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
19105


In [None]:
%%sql
-- How many teams are there?
SELECT count(*)
FROM Teams;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
2835


The total number of rows in the cross join would then be 19105 x 2835 = ...

In [None]:
%%sql
SELECT 19105 * 2835;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


19105 * 2835
54162675


That's a little over 54 million rows. And that was with two modestly sized tables! Imagine if one of the tables had a million rows? It would take virtually forever (i.e., a few minutes) for a cross join like that to complete. (Actually a huge cross join query would eventually die, having exhausted all storage and possibly [bricking](https://www.howtogeek.com/126665/htg-explains-what-does-bricking-a-device-mean/) the server, but who really wants that?)

While you may be thinking "But I would never forget the `WHERE` clause. I'm not that careless!" it is better to avoid the possibility of crashing a server due to a SQL bug. 

**So, while we include implicit joins here for completeness, they are inherently dangerous and to be avoided whenever possible.**

---
## **Explicit Joins**

Explicit joins use the `JOIN` operator to merge tables in the `FROM` clause. They were added to SQL so many years ago to lessen the risk of unintentional cross joins and potentially improve the speed of performing table joins. 

There are several kinds of explicit joins, but the most common form is:
```
SELECT * 
FROM TableA JOIN TableB ON (TableA.columnX = TableB.columnX);
``` 

You'll notice that it includes the same basic information as the implicit join (two tables plus a join condition that must be met) except that there is no way to accidentally create a fatal cross join. 

There are three possibilities:
- **Natural joins** where the join condition is omitted entirely: 
  ```
  SELECT * 
  FROM Master JOIN Teams;
  ```
  In this case SQL will automatically match any columns from the two tables that have the same name and data type. 

- **Equijoins** where the join condition matches equal values from specific columns:
  ```
  SELECT * 
  FROM Master JOIN Teams ON (Master.playerID = Batting.playerID);
  ```
  Note that we use dot notation (from Lesson 1) to disambiguate the `playerID` columns by specifying the table names. 

- **Theta joins** where the join condition is not strict equality:
  ```
  SELECT DISTINCT m1.nameFirst, m1.nameLast 
  FROM Master as m1 JOIN Master as m2 ON (m1.birthYear > m2.birthYear)
  WHERE m2.nameLast = "Jeter" and m2.nameFirst ="Derek";
  ```
  Here we are joining the Master table to itself (using aliases to disambiguate the tables) and looking for who are younger than Derek Jeter *without having to know Derek Jeter's birth year*. What makes this a theta join is the condition ```m1.birthYear > m2.birthYear```. 
  
  Theta joins are most often used when making "fuzzy" matches where strict equality won't work. For example, if we had an invoice an employee's birthday lunch but forgot who it was for then we could look for employees with birthdays within a week of the luncheon:   
```
SELECT name, employeeID 
FROM employees JOIN payables ON (employee.birthday BETWEEN payables.date-7 AND payables.date+7)
WHERE payables.invoiceID = 1234;
```
Notice that we are using the `BETWEEN` comparator to specify the range of dates we want. (Also, we took some liberties with date arithmetic, which can vary a bit from one DBMS vendor to the next. [Here's how it works in MySQL](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_date-add).)

#### **Quick Note about Surrogate Keys**

In order to make joins work efficiently and avoid surprises (bugs), we generally want short numeric primary keys (usually with ID in the name) that are generated by the database instead of humans. It is just too easy to accidentally assign a duplicate primary key value, so why not just let the system do it for us?  

The best practice is to use a **surrogate key** (a.k.a., "autonumbering") mechanism for primary key columns. Surrogate keys  generate key values as integers, starting from 1. Each time a row is added to the table the surrogate key value is *incremented* by 1, causing the rows to be numbered in sequential order. The keys values are never reused. If we delete a row in the middle of the table, then the surrogate key value is deleted with it. 

We will come back to this issue in Lesson 4, when we discuss the many different kinds of keys used in database design. 


### **`JOIN ... ON`**

The syntax for a standard `JOIN ... ON` operation is  
```JOIN table ON boolean-expression```

- In the examples above we use parentheses to make the join boolean expression stand out but it is not strictly required. We can just treat anything after the `ON` like a where clause.
- The boolean expression used as the join condition should compare a column from the first table (before the `JOIN`) with a column from the second table (after the `JOIN`).
- If needed we can join multiple columns at a time using `AND` in the boolean expression.

The following example calculates the batting averages of every player on the 1986 Red Sox with at least one at bat:

In [None]:
%%sql
SELECT playerID, Batting.AB, Batting.H, Batting.H/Batting.AB AS `Batting Average`
FROM Batting JOIN Teams ON (Batting.teamID = Teams.teamID AND Batting.yearID = Teams.yearID)
WHERE franchID = 'BOS' AND Teams.yearID = 1986 AND Batting.AB>0

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
23 rows affected.


playerID,AB,H,Batting Average
armasto01,425,112,0.2635
barrema02,625,179,0.2864
baylodo01,585,139,0.2376
boggswa01,580,207,0.3569
bucknbi01,629,168,0.2671
dodsopa01,12,5,0.4167
evansdw01,529,137,0.259
gedmari01,462,119,0.2576
greenmi01,35,11,0.3143
hendeda01,51,10,0.1961


Here's a nicer version of the same resultset. 

In [None]:
_.DataFrame() 

Unnamed: 0,playerID,AB,H,Batting Average
0,armasto01,425,112,0.2635
1,barrema02,625,179,0.2864
2,baylodo01,585,139,0.2376
3,boggswa01,580,207,0.3569
4,bucknbi01,629,168,0.2671
5,dodsopa01,12,5,0.4167
6,evansdw01,529,137,0.259
7,gedmari01,462,119,0.2576
8,greenmi01,35,11,0.3143
9,hendeda01,51,10,0.1961


### **`JOIN ... USING (...)`**

Long `JOIN ... ON` joins are subject to typos, which can trigger errors like "Unknown column 'nameLas' in 'field list'" that get old pretty fast. To minimize typing in situations where the column names match exactly (but a natural join won't work) then we can use the following shorthand syntax:  
```JOIN table USING (columns)```

Any columns listed inside the parentheses (which are not optional) must exist on both tables. Here we repeat the batting average calculation with the simpler `USING` syntax:

In [None]:
%%sql
SELECT playerID, Batting.AB, Batting.H, Batting.H/Batting.AB AS `Batting Average`
FROM Batting JOIN Teams USING (teamID, yearID)
WHERE franchID = 'BOS' AND Teams.yearID = 1986 AND Batting.AB>0

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
23 rows affected.


playerID,AB,H,Batting Average
armasto01,425,112,0.2635
barrema02,625,179,0.2864
baylodo01,585,139,0.2376
boggswa01,580,207,0.3569
bucknbi01,629,168,0.2671
dodsopa01,12,5,0.4167
evansdw01,529,137,0.259
gedmari01,462,119,0.2576
greenmi01,35,11,0.3143
hendeda01,51,10,0.1961


### **`INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN`**

SQL joins have a *directional* component that can be useful in certain situations. Every join we have seen so far is an *inner* join, which is the default. Thus, we rarely see `INNER JOIN` used but we can if we want to be explicit about it. 

A so-called ***outer* join** can take on one of two directions:
- **Left join:** `TableA LEFT JOIN TableB` includes every row from `TableA` (to the left of the `JOIN`) and only the matching rows from `TableB` (to the right of the join).
- **Right join:** `TableA RIGHT JOIN Table B` is the reverse, including every row from `TableB` but only matching rows from `TableA`. 

**Heads up:** Some DBMSes like Google BigQuery, Oracle , and SQL Server also support a more general `FULL OUTER JOIN` syntax that combines the left and right joins, allowing every row from both tables to appear at least once. MySQL and SQLite do not support full outer joins, so **we will stick to left and right joins in this course.** 

Outer joins are often used when we allow NULL values in foreign keys. For example, what if wanted to see the post season batting average of Adam Greenberg, the [most unlucky but plucky MLB player ever](https://www.cnn.com/2012/10/02/sport/baseball-greenberg-second-chance/index.html)?

In [None]:
%%sql
SELECT playerID, yearID, BattingPost.AB, BattingPost.H, BattingPost.H/BattingPost.AB AS `Batting Average`
FROM Master LEFT JOIN BattingPost USING (playerID) 
WHERE nameLast = 'Greenberg' and nameFirst = 'Adam'

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


playerID,yearID,AB,H,Batting Average
greenad01,,,,


Adam appeared in exactly two games in his MLB career. For a few years between the first appearance and the second, he was the only player in MLB history to have a plate appearance but no at bats! Neither of his MLB games were in the post season playoffs. If we had left off the `LEFT` direction of the join then we would have gotten exactly *nothing*:

In [None]:
%%sql
SELECT playerID, yearID, BattingPost.AB, BattingPost.H, BattingPost.H/BattingPost.AB AS `Batting Average`
FROM Master JOIN BattingPost USING (playerID) 
WHERE nameLast = 'Greenberg' and nameFirst = 'Adam'

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
0 rows affected.


playerID,yearID,AB,H,Batting Average


We would have needed to use `RIGHT JOIN` if we swapped the order of the `Master` and `Batting` tables:


In [None]:
%%sql
-- Swapped table order but kept `LEFT JOIN`
SELECT playerID, yearID, BattingPost.AB, BattingPost.H, BattingPost.H/BattingPost.AB AS `Batting Average`
FROM BattingPost LEFT JOIN Master USING (playerID) 
WHERE nameLast = 'Greenberg' and nameFirst = 'Adam'

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
0 rows affected.


playerID,yearID,AB,H,Batting Average


In [None]:
%%sql
-- Switched to a `RIGHT JOIN`
SELECT playerID, yearID, BattingPost.AB, BattingPost.H, BattingPost.H/BattingPost.AB AS `Batting Average`
FROM BattingPost RIGHT JOIN Master USING (playerID) 
WHERE nameLast = 'Greenberg' and nameFirst = 'Adam'

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


playerID,yearID,AB,H,Batting Average
greenad01,,,,


**Heads up: This works; however it is generally better to favor `LEFT JOIN` over `RIGHT JOIN`. In fact, SQLite does not allow right joins at all!**

### **Chained Joins**

There are times when just one join is not enough. If we need columns from three or more tables, then we will have to **chain** them together, one at a time, with joins. The following query uses three chained `JOIN` operations to connect four tables.   

In [None]:
%%sql 
SELECT nameLast,nameFirst, Batting.AB, Batting.H, Batting.H/Batting.AB AS `Batting Average`
FROM Master 
  JOIN Batting ON (Master.playerID = Batting.playerID)
  JOIN Teams ON (Batting.teamID = Teams.teamID AND Batting.yearID = Teams.yearID)
  JOIN TeamsFranchises ON (Teams.franchID = TeamsFranchises.franchID)
WHERE franchName like 'Boston Red%' AND Batting.`yearID` = 1986;


 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
38 rows affected.


nameLast,nameFirst,AB,H,Batting Average
Armas,Tony,425,112,0.2635
Barrett,Marty,625,179,0.2864
Baylor,Don,585,139,0.2376
Boggs,Wade,580,207,0.3569
Boyd,Oil Can,0,0,
Brown,Mike,0,0,
Buckner,Bill,629,168,0.2671
Clemens,Roger,0,0,
Crawford,Steve,0,0,
Dodson,Pat,12,5,0.4167



Remarks:
- The last two joins were added so we could look up teams by name (`Boston Red%`) instead of the three letter `franchID`.
- The order of the `JOIN` operations within the chain matters. We will see exactly *why* in Lesson 4 when we discuss the mathematical underpinnings of the relational database model and again in Lesson 6 when we discuss strong and weak entities. 
- Even though no columns were returned from the `Teams` table, we needed to include it anyway. Without the `Teams` table there would be no way to connect the `Batting` table with the `TeamFranchise` table. In other words, there are no keys in common to match in an equijoin. Refer to the ERD in Lesson 2 to see why.   
- Each `JOIN` is on a separate line with indentation used to indicate that they are in the same `FROM` clause. **Please follow this convention for every chained join you create in this course.**  

---
## **Subqueries**

A subquery is an entire `SELECT` query used as an expression inside another query. To convert any query into a query expression, just wrap it in parentheses like this:   
```(SELECT nameLast FROM Master)```  
For short queries it is okay to leave everything on one like but for longer queries it is better to start each clause on a new line:
```
( SELECT nameList
  FROM Master)
```
Notice how the clauses of the subquery are left-aligned (via spaces) they form a solid left vertical line when you read them. That makes it easier to tell when a subquery starts and ends. We want anything with the same indentation block to be in the same subquery. If we embed a subquery inside of another subquery (making the queries three deep), then we indent a little more to the right to keep the alignment clean. 

Below we consider all the various ways you can use a subquery within a `SELECT` statement. 

### **Subqueries in the `SELECT` Clause**

When used in the `SELECT` clause a subquery is treated like a calculated column: 
```
SELECT (SELECT ...) ... 
FROM ...
```
There is an important caveat, however: the subquery must return a single value (i.e., one row and one column). Also, while not required, always use an alias to give the calculated column a name.

Most often these sorts of subqueries are used to assemble collections of calculations on otherwise unrelated data. Here we are getting row counts for a couple of tables. The last subquery is useful to check for duplicates. Just compare the `teamCount` with `uniqueTeamCount` to see that there must be some duplicates in the `teamid` column.








In [None]:
%%sql
SELECT (SELECT count(*) FROM Master) AS playerCount, (SELECT count(teamid) FROM Teams) AS teamCount, (SELECT count(DISTINCT teamid) FROM Teams) as uniqueTeamCount;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


playerCount,teamCount,uniqueTeamCount
19105,2835,149


### **Subqueries in the `FROM` Clause**
When used in the `FROM` clause, a subquery acts as a virtual *pseudo*-table, created on the spot within the outer query: 
```
SELECT ...
FROM (SELECT ...) AS pseudo-table-name ...
```
Remarks:
- This usage is not very common. 
- The alias is not optional.  
- This form is most often used so that the outer query can "decorate" the results of the subquery, joining in more data from other tables or performing calculations on the subquery columns. 



In the example below, the subquery `subq` is calculating the number of years each player played. The outer query is then "decorating" the subquery results with the player names. This is *sometimes* more efficient than using a `GROUP BY` in the outer query. 

In [None]:
%%sql
SELECT nameLast, nameFirst, playerYears
FROM (SELECT playerID, count(DISTINCT yearId) AS playerYears FROM Batting GROUP BY playerID) AS subq
      JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,nameFirst,playerYears
Aardsma,David,9
Aaron,Hank,23
Aaron,Tommie,7
Aase,Don,13
Abad,Andy,3
Abad,Fernando,7
Abadie,John,1
Abbaticchio,Ed,9
Abbey,Bert,5
Abbey,Charlie,5


### **Subqueries in the `WHERE` Clause**
When used in the `WHERE`, subqueries appear as expressions within the boolean expressions:
```
SELECT ...
FROM ...
WHERE some-column operator subquery ...
```

So, for example, if we only wanted baseball players with above average batting averages on the 1986 Red Sox we could try something like this:


In [None]:
%%sql
SELECT playerID, Batting.AB, Batting.H, Batting.H/Batting.AB AS `BattingAverage`
FROM Batting JOIN Teams USING (teamID, yearID)
WHERE franchID = 'BOS' 
  AND Teams.yearID = 1986 
  AND Batting.H/Batting.AB > (SELECT sum(Batting.H)/sum(Batting.AB) FROM Batting WHERE yearid = 1986)
ORDER BY Batting.AB DESC;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


playerID,AB,H,BattingAverage
bucknbi01,629,168,0.2671
barrema02,625,179,0.2864
riceji01,618,200,0.3236
boggswa01,580,207,0.3569
evansdw01,529,137,0.259
armasto01,425,112,0.2635
greenmi01,35,11,0.3143
dodsopa01,12,5,0.4167
saxda01,11,5,0.4545
lollati01,1,1,1.0


Notice that the subquery has to return a single value in order for the comparison to work. 

### **Subqueries in the `WITH` Clause**

A `WITH` clause (also known as a Common Table Expression) is unusual because it appears before the `SELECT` clause:
```
WITH 
  subq1 AS (...),
  subq2 AS (...)
SELECT ...
FROM subq1 ...
```

A `WITH` clause allows us to give subqueries names so they can act like temporary tables within the remainder of the query. In this case the name comes before the `AS` and the subquery in parentheses afterward. We can list as many named subqueries as we need. 

Why do we need this? As far as SQL itself is concerned, we don't. However, a `WITH` clause can make some queries a lot easier to read and test. Instead of burying the subqueries in the `FROM` or `WHERE` clauses we can write and test them up front *before* writing the rest of the query. 

**Heads up: `WITH` clauses were only recently added to MySQL (in version 8.0) and won't work in the older MySQL versions that are supported by most cloud providers. Unfortunately for us (see below), that includes the AWS RDS instance that hosts our *Lahman 2016* database. SQLite supports `WITH` clauses though.** 

In [None]:
%%sql
-- Determine the MySQL version
-- Need at least version 8.0 for `WITH` clauses to work
SELECT VERSION()

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


VERSION()
5.7.26-log


### **A Note about *Correlated* Subqueries**

A correlated subquery is where the subquery uses data from the outer query. They are not very common but there are certain edge cases where they are pretty useful. 

If, for example, we repeated the "players with above average batting averages" query, except this time comparing to the team batting average instead of the league batting average, then we would need to include the teamID in the subquery. 

In [None]:
%%sql
SELECT playerID, Batting.AB, Batting.H, Batting.H/Batting.AB AS `BattingAverage`
FROM Batting JOIN Teams USING (teamID, yearID)
WHERE franchID = 'BOS' 
  AND Teams.yearID = 1986 
  AND Batting.H/Batting.AB > (SELECT sum(Batting.H)/sum(Batting.AB) 
                              FROM Batting 
                              WHERE yearid = 1986 and teamID = Teams.teamID) -- Note: teamID is from outer query
ORDER BY Batting.AB DESC;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
7 rows affected.


playerID,AB,H,BattingAverage
barrema02,625,179,0.2864
riceji01,618,200,0.3236
boggswa01,580,207,0.3569
greenmi01,35,11,0.3143
dodsopa01,12,5,0.4167
saxda01,11,5,0.4545
lollati01,1,1,1.0


**Heads up: correlated subqueries can be very slow to complete. Only use them when you must.**

## **`UNION`, `INTERSECT`, and `EXCEPT` Queries**

We conclude our tutorial on SQL `SELECT` queries with a quick introduction to three special "set theoretic" operators that, like joins and subqueries, allow us to combine data from multiple tables. 

First, note that joins and subqueries are inherently column-oriented. If, for example, we use a join to add a second table to our query, we get access to all its columns as well. Similarly, when we add a subquery in the `SELECT` or `FROM` clauses we are adding more columns into the mix. 

The `UNION`, `INTERSECT` and `EXCEPT` operators are row-oriented. They never add any columns. Instead, they add or subtract rows. We will get into this more in Lesson 4, but the key concept is that **a table is a *set* of rows.** With the set theoretic operators we can add rows or delete rows from a row set but we can't modify the rows themselves. Each row always has the same number of columns. 

**Heads up:** SQL operators *never* alter data. Thus the examples below do not change any table data; they apply the operators to generate a new result set, not a table. If we want to *create a new table* then we will need SQL DDL, which is covered in Lesson 7. If we want to *alter* table data then we will need SQL DML, which is covered in Lesson 8. 

We'll take the set operators one at a time.

The `UNION` operator adds two `SELECT` queries:
```
SELECT ...
UNION
SELECT ...
```
For example, the following query forms the union of the `Batting` table and the `BattingPost` table:
```
SELECT * FROM Batting
UNION
SELECT * FROM BattingPost
```
The top and bottom queries have the same number and types of columns, making them **union-compatible**.  While the names of the columns don't have to match, the number of columns has to be the same and the data types of corresponding columns (first column, second column, etc.) have to have the same data types. If the second column in the top query is integer-values then the second column of the lower query also has to be integer-valued. The column names themselves always come from the top query, ignoring the names in the bottom query.

The `INTERSECT` operator syntax is exactly the same as the `UNION` syntax. However, instead of adding rows, it returns only the rows that are in both the top and the bottom queries. 

The `EXCEPT` operator also has a top query and a bottom query, both union-compatible with the other. This time, however, `EXCEPT` only returns the rows in the top query that *are not* found in the bottom query. It is like taking the difference between the two tables. 


---
## **PRO TIPS: How to write queries correctly the first time, every time**
It's pretty easy to tell a novice SQL coder from a seasoned veteran. The novice always starts typing long queries from the top, working through the clauses one at a time until they reach the bottom. Then they run the query and 99% of the time it fails. SQL will give some sort of cryptic error message $-$ nobody knows what they mean half the time $-$ and then the novice spends 45 minutes or so staring at the code trying to spot the error.

There are lots of problems with this approach. Here are a few:
- Is there really just one bug? SQL will only report the first fatal error it finds. There may be many more bugs to fix before your code works. 
- Any time spent staring at the screen is not actually producing working code.
- Where do you start to fix the error? Sometimes a bug is spread over several lines; each line is "correct" but the combination of lines with each other is fatally wrong. 
- Sometimes you don't get an error message at all. Instead, the query just runs forever, until it bricks the server (and your professor gives you an F for destroying everybody's homework). Or, the query just returns nothing, a result set with zero rows. 

A seasoned veteran seemingly never makes these kinds of mistakes. Every query seems to work all the time. Your humble professor has gone entire semesters without having a live SQL bug in a class demo. It can be done! It just takes the discipline to follow the *right* process and the courage to tear apart code that doesn't work. No code is sacred. Sometimes you have to break things before you can fix them. 

Okay, so what is the process?
1. Start with the simplest possible query that does pretty much nothing. Run it to make sure that the SQL runtime is working and that you haven't made a trivial mistake like leaving out the second `E` in `SELECT` or forgotten the semicolon at the end. 
2. Code the most difficult, most likely to fail, part of the query, making whatever minimal changes to the rest of the query so that it will run. Then run it to make sure you didn't mess up. 
3. Code the next most difficult ... 
4. Repeat until the query works. 

Yes, that really is it! Instead of writing a huge query all at once, write in chucks that can be tested and debugged as you go along. You will write *more total lines of SQL code* but each change will be small and easy to get right before moving on.

Let's apply this to a fairly complicated like query this one:
```
SELECT nameLast,nameFirst, Batting.AB, Batting.H, Batting.H/Batting.AB AS `Batting Average`
FROM Master 
  JOIN Batting ON (Master.playerID = Batting.playerID)
  JOIN Teams ON (Batting.teamID = Teams.teamID AND Batting.yearID = Teams.yearID)
  JOIN TeamsFranchises ON (Teams.franchID = TeamsFranchises.franchID)
WHERE franchName like 'Boston Red%' AND Batting.`yearID` = 1986;
```

Even with the code printed out in front of you, if you were to type it in from top to bottom you would almost certainly have a typo or other error in your code. So, let's not even try to do that. Let's apply our process. 

We'll start with the simplest possible code that does as little as possible:


In [None]:
%%sql
SELECT *
FROM Master
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01


Great. It works. We didn't forget the `%%sql` magic and we spelled everything correctly. To avoid typos in the column names we used a wildcard. We even used a `LIMIT` clause so that we don't have to wait for it to list 19000 players. 

Now let's think about the hardest thing the query has to do. For most queries it is the joins. If even one join doesn't work then you get the dreaded zero row table! Let's try to join in another table. To minimize typing (and typing mistakes) we'll try `JOIN ... USING` instead of `JOIN ... ON` syntax.

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Teams USING (playerID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
(pymysql.err.OperationalError) (1054, "Unknown column 'playerID' in 'from clause'")
[SQL: SELECT *
FROM Master
  JOIN Teams USING (playerID)
LIMIT 5;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


Ugh, that didn't work. We know the `playerID` is on the `Master` table, so the error message really doesn't make sense. So, we look at the data model again and realize that we don't have enough common keys to join the `Master` and `Teams` tables. Let's try a different join, this time with the `Batting` table using the `playerID`.   

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
abercda01,1850,1,2,USA,OK,Fort Towson,1939,11,11,USA,PA,Philadelphia,Frank,Abercrombie,Francis Patterson,0,0,,,1871-10-21,1871-10-21,aberd101,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0,0,0,0,0,,,,,
addybo01,1842,2,0,CAN,ON,Port Hope,1910,4,9,USA,ID,Pocatello,Bob,Addy,Robert Edward,160,68,L,L,1871-05-06,1877-10-06,addyb101,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13,8,1,4,0,,,,,
allisar01,1849,1,29,USA,PA,Philadelphia,1916,2,25,USA,DC,Washington,Art,Allison,Arthur Algernon,150,68,,,1871-05-04,1876-10-05,allia101,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19,3,1,2,5,,,,,
allisdo01,1846,7,12,USA,PA,Philadelphia,1916,12,19,USA,DC,Washington,Doug,Allison,Douglas L.,160,70,R,R,1871-05-05,1883-07-13,allid101,allisdo01,1871,1,WS3,,27,133,28,44,10,2,2,27,1,1,0,2,,,,,
ansonca01,1852,4,17,USA,IA,Marshalltown,1922,4,14,USA,IL,Chicago,Cap,Anson,Adrian Constantine,227,72,R,R,1871-05-06,1897-10-03,ansoc101,ansonca01,1871,1,RC1,,25,120,29,39,11,3,0,16,6,2,2,1,,,,,


That worked! Booyah! So let's tack on another join that gets us closer to the `TeamFranchise` table. Consulting the ER Diagram, this time it's the `Teams` table.

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


teamID,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,yearID_1,lgID_1,franchID,divID,Rank,G_1,Ghome,W,L,DivWin,WCWin,LgWin,WSWin,R_1,AB_1,H_1,2B_1,3B_1,HR_1,BB_1,SO_1,SB_1,CS_1,HBP_1,SF_1,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
PH3,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1875,1,,11,45,3,10,0,0,0,4,1,0,0,3,,,,,,1875,,CEN,,11,14,,2,12,,,N,,70,529,126,19,3,0,8,0,0,,,,138,55,3.93,14,0,0,378,170,0,3,0,164,,0.769,Philadelphia Centennials,Centennial Grounds,,92,98,CEN,PH3,PH3
BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1875,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,1872,,BRA,,6,37,,9,28,,,N,,237,1466,370,46,10,0,19,24,17,14.0,,,473,189,5.06,37,0,0,1008,561,6,19,0,358,,0.81,Brooklyn Atlantics,Capitoline Grounds,,115,122,BRA,BR2,BR2
BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1875,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,1873,,BRA,,6,55,,17,37,,,N,,366,2210,588,60,27,6,53,43,18,9.0,,,549,221,3.98,52,1,0,1500,737,8,42,15,505,,0.82,Brooklyn Atlantics,Union Grounds,,90,94,BRA,BR2,BR2
BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1875,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,1874,,BRA,,6,56,,22,33,,,N,,301,2165,497,50,11,1,32,0,0,,,,449,180,3.2,56,1,0,1518,621,16,13,0,500,,0.822,Brooklyn Atlantics,Union Grounds,,89,94,BRA,BR2,BR2
BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1875,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,1875,,BRA,,11,44,,2,42,,,N,,132,1547,304,32,9,2,10,0,0,,,,438,174,3.95,31,0,0,1188,531,9,22,0,426,,0.801,Brooklyn Atlantics,Union Grounds,,88,94,BRA,BR2,BR2


It ran but there seems to be a problem. It keeps repeating the first player over and over again. Again, after consulting the data model, we see that `Batting` has a `yearID` column, as does the `Teams` table. Aha! Each season is treated as a new team! Leaving `yearID` out of our join condition created a cross join. If we had not had a `LIMIT` clause then we might have blown up AWS! (Not really.) So let's add `yearID` to the join conditions before we get too carried away with ourselves. 

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


yearID,teamID,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,stint,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,lgID_1,franchID,divID,Rank,G_1,Ghome,W,L,DivWin,WCWin,LgWin,WSWin,R_1,AB_1,H_1,2B_1,3B_1,HR_1,BB_1,SO_1,SB_1,CS_1,HBP_1,SF_1,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
1875,PH3,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1,,11,45,3,10,0,0,0,4,1,0,0,3,,,,,,,CEN,,11,14,,2,12,,,N,,70,529,126,19,3,0,8,0,0,,,,138,55,3.93,14,0,0,378,170,0,3,0,164,,0.769,Philadelphia Centennials,Centennial Grounds,,92,98,CEN,PH3,PH3
1875,BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,,BRA,,11,44,,2,42,,,N,,132,1547,304,32,9,2,10,0,0,,,,438,174,3.95,31,0,0,1188,531,9,22,0,426,,0.801,Brooklyn Atlantics,Union Grounds,,88,94,BRA,BR2,BR2
1871,TRO,abercda01,1850,1,2,USA,OK,Fort Towson,1939,11,11,USA,PA,Philadelphia,Frank,Abercrombie,Francis Patterson,0,0,,,1871-10-21,1871-10-21,aberd101,abercda01,1,,1,4,0,0,0,0,0,0,0,0,0,0,,,,,,,TRO,,6,29,,13,15,,,N,,351,1248,384,51,34,6,49,19,62,,,,362,153,5.51,28,0,0,750,431,4,75,12,198,,0.845,Troy Haymakers,Haymakers' Grounds,,101,100,TRO,TRO,TRO
1879,SR1,adamsge01,1855,1,26,USA,MA,Worcester,1920,10,11,USA,WA,Clarkston,George,Adams,George Henry,175,66,R,R,1879-06-14,1879-06-21,adamg102,adamsge01,1,NL,4,13,0,3,0,0,0,0,0,0,1,1,,,,,,NL,SYR,,7,71,,22,48,,,N,,276,2611,592,61,19,5,28,238,0,,,,462,230,3.19,64,5,0,1947,775,4,52,132,398,37.0,0.872,Syracuse Stars,Newell Park,,89,95,SYR,SR1,SR1
1871,RC1,addybo01,1842,2,0,CAN,ON,Port Hope,1910,4,9,USA,ID,Pocatello,Bob,Addy,Robert Edward,160,68,L,L,1871-05-06,1877-10-06,addyb101,addybo01,1,,25,118,30,32,6,0,0,13,8,1,4,0,,,,,,,ROK,,9,25,,4,21,,,N,,231,1036,274,44,25,3,38,30,53,,,,287,108,4.3,23,1,0,678,315,3,34,16,220,,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97,99,ROK,RC1,RC1


That's better. Now we get back the same rows but with more columns. Okay, now let's add in the final join to the `TeamsFranchises` table.

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (teamID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
(pymysql.err.OperationalError) (1054, "Unknown column 'teamID' in 'from clause'")
[SQL: SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (teamID)
LIMIT 5;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


Oops. Since everything worked just fine before the last join, the bug is in the code we just added. After consulting the data model again, we see that `franchID` appears on both the `Teams` and `TeamsFranchises` tables. We'll join using that instead. 

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


franchID,yearID,teamID,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,stint,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,lgID_1,divID,Rank,G_1,Ghome,W,L,DivWin,WCWin,LgWin,WSWin,R_1,AB_1,H_1,2B_1,3B_1,HR_1,BB_1,SO_1,SB_1,CS_1,HBP_1,SF_1,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,franchName,active,NAassoc
CEN,1875,PH3,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,1,,11,45,3,10,0,0,0,4,1,0,0,3,,,,,,,,11,14,,2,12,,,N,,70,529,126,19,3,0,8,0,0,,,,138,55,3.93,14,0,0,378,170,0,3,0,164,,0.769,Philadelphia Centennials,Centennial Grounds,,92,98,CEN,PH3,PH3,Philadelphia Centennials,,
BRA,1875,BR2,abadijo01,1850,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01,2,,1,4,1,1,0,0,0,1,0,0,0,0,,,,,,,,11,44,,2,42,,,N,,132,1547,304,32,9,2,10,0,0,,,,438,174,3.95,31,0,0,1188,531,9,22,0,426,,0.801,Brooklyn Atlantics,Union Grounds,,88,94,BRA,BR2,BR2,Brooklyn Atlantics,,
TRO,1871,TRO,abercda01,1850,1,2,USA,OK,Fort Towson,1939,11,11,USA,PA,Philadelphia,Frank,Abercrombie,Francis Patterson,0,0,,,1871-10-21,1871-10-21,aberd101,abercda01,1,,1,4,0,0,0,0,0,0,0,0,0,0,,,,,,,,6,29,,13,15,,,N,,351,1248,384,51,34,6,49,19,62,,,,362,153,5.51,28,0,0,750,431,4,75,12,198,,0.845,Troy Haymakers,Haymakers' Grounds,,101,100,TRO,TRO,TRO,Troy Haymakers,,
SYR,1879,SR1,adamsge01,1855,1,26,USA,MA,Worcester,1920,10,11,USA,WA,Clarkston,George,Adams,George Henry,175,66,R,R,1879-06-14,1879-06-21,adamg102,adamsge01,1,NL,4,13,0,3,0,0,0,0,0,0,1,1,,,,,,NL,,7,71,,22,48,,,N,,276,2611,592,61,19,5,28,238,0,,,,462,230,3.19,64,5,0,1947,775,4,52,132,398,37.0,0.872,Syracuse Stars,Newell Park,,89,95,SYR,SR1,SR1,Syracuse Stars,N,
ROK,1871,RC1,addybo01,1842,2,0,CAN,ON,Port Hope,1910,4,9,USA,ID,Pocatello,Bob,Addy,Robert Edward,160,68,L,L,1871-05-06,1877-10-06,addyb101,addybo01,1,,25,118,30,32,6,0,0,13,8,1,4,0,,,,,,,,9,25,,4,21,,,N,,231,1036,274,44,25,3,38,30,53,,,,287,108,4.3,23,1,0,678,315,3,34,16,220,,0.821,Rockford Forest Citys,Agricultural Society Fair Grounds,,97,99,ROK,RC1,RC1,Rockford Forest Citys,,


Perfect. Now that we have all the data columns we'll need, we can decorate our code with the final details. Let's start with the `WHERE` clause, which if we get right will allow us to cut out the `LIMIT` clause. 

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND frachName LIKE 'Boston Red%'
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
(pymysql.err.OperationalError) (1054, "Unknown column 'frachName' in 'where clause'")
[SQL: SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND frachName LIKE 'Boston Red%%';]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


Oops. Another typo, This time it's easy to catch because the error message tells us where it is. We left out the `n` in `franchName`.

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND franchName LIKE 'Boston Red%'
LIMIT 5;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
5 rows affected.


franchID,yearID,teamID,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,stint,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,lgID_1,divID,Rank,G_1,Ghome,W,L,DivWin,WCWin,LgWin,WSWin,R_1,AB_1,H_1,2B_1,3B_1,HR_1,BB_1,SO_1,SB_1,CS_1,HBP_1,SF_1,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,franchName,active,NAassoc
BOS,1986,BOS,armasto01,1953,7,2,Venezuela,Anzoategui,Puerto Piritu,,,,,,,Tony,Armas,Antonio Rafael,182,71,R,R,1976-09-06,1989-10-01,armat001,armasto01,1,AL,121,425,40,112,21,4,11,58,0,3,24,77,1,2,0,2,12,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,barrema02,1958,6,23,USA,CA,Arcadia,,,,,,,Marty,Barrett,Martin Glenn,175,70,R,R,1982-09-06,1991-05-07,barrm001,barrema02,1,AL,158,625,94,179,39,4,4,60,15,7,65,31,0,1,18,4,13,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,baylodo01,1949,6,28,USA,TX,Austin,,,,,,,Don,Baylor,Don Edward,190,73,R,R,1970-09-18,1988-10-01,bayld001,baylodo01,1,AL,160,585,93,139,23,1,31,94,3,5,62,111,8,35,0,5,12,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,boggswa01,1958,6,15,USA,NE,Omaha,,,,,,,Wade,Boggs,Wade Anthony,190,74,L,R,1982-04-10,1999-08-27,boggw001,boggswa01,1,AL,149,580,107,207,47,2,8,71,0,4,105,44,14,0,4,4,11,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,boydoi01,1959,10,6,USA,MS,Meridian,,,,,,,Oil Can,Boyd,Dennis Ray,155,73,R,R,1982-09-13,1991-10-01,boydo001,boydoi01,1,AL,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,


So far, so good. Each player is on Boston's roster, the year is 1986, and there are no duplicates. Let's try taking off the `LIMIT`. 

In [None]:
%%sql
SELECT *
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND franchName LIKE 'Boston Red%';

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
38 rows affected.


franchID,yearID,teamID,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,stint,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,lgID_1,divID,Rank,G_1,Ghome,W,L,DivWin,WCWin,LgWin,WSWin,R_1,AB_1,H_1,2B_1,3B_1,HR_1,BB_1,SO_1,SB_1,CS_1,HBP_1,SF_1,RA,ER,ERA,CG,SHO,SV,IPouts,HA,HRA,BBA,SOA,E,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro,franchName,active,NAassoc
BOS,1986,BOS,armasto01,1953,7,2,Venezuela,Anzoategui,Puerto Piritu,,,,,,,Tony,Armas,Antonio Rafael,182,71,R,R,1976-09-06,1989-10-01,armat001,armasto01,1,AL,121,425,40,112,21,4,11,58,0,3,24,77,1,2,0,2,12,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,barrema02,1958,6,23,USA,CA,Arcadia,,,,,,,Marty,Barrett,Martin Glenn,175,70,R,R,1982-09-06,1991-05-07,barrm001,barrema02,1,AL,158,625,94,179,39,4,4,60,15,7,65,31,0,1,18,4,13,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,baylodo01,1949,6,28,USA,TX,Austin,,,,,,,Don,Baylor,Don Edward,190,73,R,R,1970-09-18,1988-10-01,bayld001,baylodo01,1,AL,160,585,93,139,23,1,31,94,3,5,62,111,8,35,0,5,12,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,boggswa01,1958,6,15,USA,NE,Omaha,,,,,,,Wade,Boggs,Wade Anthony,190,74,L,R,1982-04-10,1999-08-27,boggw001,boggswa01,1,AL,149,580,107,207,47,2,8,71,0,4,105,44,14,0,4,4,11,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,boydoi01,1959,10,6,USA,MS,Meridian,,,,,,,Oil Can,Boyd,Dennis Ray,155,73,R,R,1982-09-13,1991-10-01,boydo001,boydoi01,1,AL,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,brownmi01,1959,3,24,USA,NJ,Camden County,,,,,,,Mike,Brown,Michael Gary,195,74,R,R,1982-09-16,1987-08-15,browm003,brownmi01,1,AL,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,bucknbi01,1949,12,14,USA,CA,Vallejo,,,,,,,Bill,Buckner,William Joseph,185,72,L,L,1969-09-21,1990-05-30,buckb001,bucknbi01,1,AL,153,629,73,168,39,2,18,102,6,4,40,25,9,4,0,8,25,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,clemero02,1962,8,4,USA,OH,Dayton,,,,,,,Roger,Clemens,William Roger,205,76,R,R,1984-05-15,2007-09-16,clemr001,clemero02,1,AL,33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,crawfst01,1958,4,29,USA,OK,Pryor,,,,,,,Steve,Crawford,Steven Ray,225,77,R,R,1980-09-02,1991-10-05,craws001,crawfst01,1,AL,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,
BOS,1986,BOS,dodsopa01,1959,10,11,USA,CA,Santa Monica,,,,,,,Pat,Dodson,Patrick Neal,210,76,L,L,1986-09-05,1988-06-22,dodsp001,dodsopa01,1,AL,9,12,3,5,2,0,1,3,0,0,3,3,0,0,0,0,0,AL,E,1,161,81,95,66,Y,,Y,N,794,5498,1488,320,21,144,595,707,41,34,,,696,624,3.93,36,6,41,4287,1469,167,474,1033,129,146,0.979,Boston Red Sox,Fenway Park II,2147641,101,100,BOS,BOS,BOS,Boston Red Sox,Y,


Great! That gives us all the players and their stats but really, that's way too many columns. It's time to replace the wildcard with the columns we need. 

In [None]:
%%sql
SELECT nameLast,nameFirst, Batting.AB, Batting.H
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND franchName LIKE 'Boston Red%';

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
38 rows affected.


nameLast,nameFirst,AB,H
Armas,Tony,425,112
Barrett,Marty,625,179
Baylor,Don,585,139
Boggs,Wade,580,207
Boyd,Oil Can,0,0
Brown,Mike,0,0
Buckner,Bill,629,168
Clemens,Roger,0,0
Crawford,Steve,0,0
Dodson,Pat,12,5


Okay now for the final step. We need to calculate the batting average. 

In [None]:
%%sql
SELECT nameLast,nameFirst, Batting.AB, Batting.H, Batting.H/Batting.AB AS `Batting Average`
FROM Master
  JOIN Batting USING (playerID)
  JOIN Teams USING (teamID, yearID)
  JOIN TeamsFranchises USING (franchID)
WHERE yearID = 1986 AND franchName LIKE 'Boston Red%';

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
38 rows affected.


nameLast,nameFirst,AB,H,Batting Average
Armas,Tony,425,112,0.2635
Barrett,Marty,625,179,0.2864
Baylor,Don,585,139,0.2376
Boggs,Wade,580,207,0.3569
Boyd,Oil Can,0,0,
Brown,Mike,0,0,
Buckner,Bill,629,168,0.2671
Clemens,Roger,0,0,
Crawford,Steve,0,0,
Dodson,Pat,12,5,0.4167


That looks great. It could look a little nicer. Let's use pandas to make it pretty.

In [None]:
_.DataFrame()

Unnamed: 0,nameLast,nameFirst,AB,H,Batting Average
0,Armas,Tony,425,112,0.2635
1,Barrett,Marty,625,179,0.2864
2,Baylor,Don,585,139,0.2376
3,Boggs,Wade,580,207,0.3569
4,Boyd,Oil Can,0,0,
5,Brown,Mike,0,0,
6,Buckner,Bill,629,168,0.2671
7,Clemens,Roger,0,0,
8,Crawford,Steve,0,0,
9,Dodson,Pat,12,5,0.4167


While it may have taken you a while to follow along with the example, in real time the coding took about 2 minutes. You'll get plenty of practice with this stuff in the homeworks. In the meantime, just follow the process. It really works. 

---
## **SQL AND BEYOND: Google BigQuery**

BigQuery is a popular data hosting service offered as part of the Google Cloud Platform. 
>What is BigQuery?
>
>Storing and querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google's infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data.

While BigQuery has been around for about a decade now (under various brands), it has recently gained a lot of interest with data scientists working with massive datasets. 

BigQuery is optimized for analytical processing:
- All data for a given query is kept in a single table (i.e., no joins)
- The data is loaded once and rarely modified
- Each row may comprise numerous columns of numerical statistics (facts) with accompanying contextual columns (dimensions)

Since the data is somewhat static, there is no need for most of the integrity protections provided by traditional databases. In fact, data redundancy and denormalization (Lesson 5) are desired in such cases. It makes the job of the analyst easier because *every query* is pretty much the same, either a simple sampling or rows or perhaps a PivotTable-like aggregation like we learned in Lesson 2. 

Let's take a quick tour and then do a few queries. We'll use the *full* play-by-play dataset from the NBA Boxscore assignment. That's every play in every game since 2004, all 17 million of them. 

### **Data by the Bucket**
The dataset is way too big to work with in a single file. It also never changes so there is little need to keep a copy on your hard disk. Cloud providers like Google, Amazon, etc. provide cloud storage for just this sort of thing. 

Google Cloud Platform (GCP) stores data in buckets, priced out at a few cents per gigabyte. Here are the files we will be working with:

![GCP Buckets](https://github.com/christopherhuntley/DATA6510/raw/master/img/L3_GCP_Buckets.png)

A few remarks:
- GCP bundles services by the *project* so that teams can work out data access permissions. Files shown are for the `NBA Lineup Facts` project. 
- The bucket name is `nba_lineup_facts`. There can be multiple buckets per project. 
- The data files are pretty tame, with the biggest one at just over 300 megabytes. The `play_facts` folder, however, has a 300-400 megabyte file for each season. 
- All data is stored in CSV format, though we can also use MySQL dump files (i.e., SQL code) if we wanted it to load faster into BigQuery. 


 







  

 








### **Database as Dataset Container**
Since a given database may be split into many pieces (through a process called *partitioning*), GCP refers to them as *datasets*. A dataset, in turn, is a collection of BigQuery tables. Below we see the columns of the `play_facts_all` table that we will use for our queries. 

![GCP data explorer](https://github.com/christopherhuntley/DATA6510/raw/master/img/L3_GCP_data_explorer.png)

The data for a given table is loaded from one or more files. If they are small files then we can just upload them. However, for our database they are loaded from cloud storage. 

![GCP Create Table](https://github.com/christopherhuntley/DATA6510/raw/master/img/L3_GCP_create_table.png)

>For those who are wondering, the actual DBMS used is MySQL 8, running in a dedicated instance. Thus, the GCP software adheres to the three tiered architecture we learned in Lesson 1:
>- Presentation: the GCP web interface *and* the `%%bigquery` magic 
>- Logic: the various cloud services that make up the GCP
>- Data: A MySQL 8 instance running in the cloud ... somewhere 

### **BigQuery SQL Workspace**

BigQuery provides its own query tool, which looks a lot like just about any other. You can use it to compose queries, run them, and (optionally) save the results to a table. 

![GCP Query Tool](https://github.com/christopherhuntley/DATA6510/raw/master/img/L3_GCP_query_tool.png)

Remarks:
- The query aggregates all of the play-by-play data so that we can compare the performance of NBA lineups (i.e., groups of 5 players) across seasons.
- Table names in BigQuery (like `nba-lineup-facts.lineup_facts.play_facts_all`) always use dot notation; the format is  `project.dataset.table`.
- The query took 3.3 seconds to run. The same calculation took a few *minutes* in pandas, with results 'rolled up' one season at a time. 
- After running the query the results were saved as new table called `lineups_w_200mins`. 

**While the query tool is nice for data that never leaves the Google Cloud, we of course want to use the data in our notebooks, where we will spend the rest of this BigQuery tutorial.**







### **Setting up BigQuery in Colab**

BigQuery uses its own variation of the `%%sql` magic we have been using to run our queries. It also requires that the user be logged in to the Google Cloud. We can do both with the cell below. 

In [1]:
# load the bigquery magics extension
%load_ext google.cloud.bigquery

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


#### **Why did we have to log in again?**
As mentioned before, Google organizes data by *project*. Read-only permission to view the data (i.e., run `SELECT` queries) has been granted to the `student.fairfield.edu` Google domain. Just to be sure that you are using an `@student.fairfield.edu` account, BigQuery requires one last login before you can run any queries.  

### **Running Queries  with `%%bigquery` Magic**

Instead of rerunning the *expensive* query we just ran in the GCP Query Tool, we will do a few simpler ones as a quick demonstration. 

As has by now become customary, we will phrase each query as a question, with the code below.

#### **Query 1. How many rows are there in the `play_facts_all` table?**

In [2]:
%%bigquery --project nba-lineup-facts
SELECT COUNT(*) as total_rows
FROM `nba-lineup-facts.lineup_facts.play_facts_all`;

Unnamed: 0,total_rows
0,16986388


That's a lot of rows! Imagine trying to do that with MS Excel. 

The `%%bigquery` magic takes a `--project` argument instead of the query strings we used with the classic `%%sql` magic. Except for that, everything is much like before.

#### **Query 2. Which lineups played the most total minutes together in a season?**

In [3]:
%%bigquery --project nba-lineup-facts
SELECT * 
FROM `nba-lineup-facts.lineup_facts.lineups_w_200mins`
ORDER BY minutes DESC
LIMIT 10;

Unnamed: 0,year,team,lineup,minutes,plus_minus_36m
0,2006,DET,"['Ben Wallace', 'Chauncey Billups', 'Rasheed W...",2051.433333,8.300538
1,2005,DET,"['Ben Wallace', 'Chauncey Billups', 'Rasheed W...",1906.866667,6.418907
2,2014,IND,"['David West', 'George Hill', 'Lance Stephenso...",1860.016667,6.464458
3,2013,IND,"['David West', 'George Hill', 'Lance Stephenso...",1628.316667,9.219337
4,2014,POR,"['Damian Lillard', 'LaMarcus Aldridge', 'Nicol...",1615.866667,4.700883
5,2017,WAS,"['Bradley Beal', 'John Wall', 'Marcin Gortat',...",1595.083333,7.831566
6,2010,BOS,"['Kendrick Perkins', 'Kevin Garnett', 'Paul Pi...",1535.333333,8.441164
7,2005,PHX,"[""Amar'e Stoudemire"", 'Joe Johnson', 'Quentin ...",1520.166667,11.414538
8,2008,BOS,"['Kendrick Perkins', 'Kevin Garnett', 'Paul Pi...",1494.8,11.824993
9,2010,MEM,"['Marc Gasol', 'Mike Conley', 'O.J. Mayo', 'Ru...",1468.333333,4.903519


Basketball aficionados will note that the top two lineups (rows 0 and 1) were from the great Detroit Pistons teams from the early 2000s. They were kind of famous for playing their starting lineups for most of the game. Why? Because they won a lot of games that way. However, the "7 seconds or less" Phoenix Suns (row 7) from the same era actually won games by bigger margins. 

#### **Query 3. Which lineups were the most efficient (i.e., had the best `plus_minus_36m`)?**

In [4]:
%%bigquery --project nba-lineup-facts
SELECT * 
FROM `nba-lineup-facts.lineup_facts.lineups_w_200mins`
ORDER BY plus_minus_36m DESC
LIMIT 10;

Unnamed: 0,year,team,lineup,minutes,plus_minus_36m
0,2011,DAL,"['Dirk Nowitzki', 'Jason Kidd', 'Jason Terry',...",349.283333,21.541251
1,2018,UTA,"['Donovan Mitchell', 'Jae Crowder', 'Joe Ingle...",212.9,21.136684
2,2020,OKC,"['Chris Paul', 'Danilo Gallinari', 'Dennis Sch...",221.866667,19.471154
3,2008,LAL,"['Derek Fisher', 'Kobe Bryant', 'Lamar Odom', ...",226.55,19.068638
4,2017,GSW,"['Andre Iguodala', 'Draymond Green', 'Kevin Du...",287.583333,18.651985
5,2016,GSW,"['Andre Iguodala', 'Andrew Bogut', 'Draymond G...",234.483333,17.962897
6,2019,PHI,"['Ben Simmons', 'JJ Redick', 'Jimmy Butler', '...",333.6,17.913669
7,2016,CLE,"['J.R. Smith', 'Kevin Love', 'LeBron James', '...",205.95,17.82957
8,2016,GSW,"['Andre Iguodala', 'Draymond Green', 'Harrison...",295.766667,17.649048
9,2009,ORL,"['Courtney Lee', 'Dwight Howard', 'Hedo Turkog...",256.3,17.417089


Here we see the classic "death" lineups from the Golden State Warriors and others. They won games through pure efficiency. These lineups were among the best in the league on offense *and* on defense. While many argued that the Golden State lineups were unfair, it is noteworthy that 4 linueps were actually even better, and only one of them won a championship. The team with the top lineup, the 2011 Dallas Mavericks, were incredibly underrated, with their win over LeBron James and the rest of the Heatles for the championship framed as an *upset*. Not really. They literally had the most efficient lineup in history.  

#### **Query 4. Speaking of LeBron James, how did his lineups do over the years?**

In [5]:
%%bigquery --project nba-lineup-facts
SELECT * 
FROM `nba-lineup-facts.lineup_facts.lineups_w_200mins`
WHERE lineup like '%LeBron James%'
ORDER BY plus_minus_36m DESC
LIMIT 10;

Unnamed: 0,year,team,lineup,minutes,plus_minus_36m
0,2016,CLE,"['J.R. Smith', 'Kevin Love', 'LeBron James', '...",205.95,17.82957
1,2015,CLE,"['J.R. Smith', 'Kevin Love', 'Kyrie Irving', '...",527.283333,15.36176
2,2011,MIA,"['Chris Bosh', 'Dwyane Wade', 'Joel Anthony', ...",240.883333,13.450495
3,2013,MIA,"['Chris Andersen', 'LeBron James', 'Norris Col...",202.716667,11.89838
4,2009,CLE,"['Ben Wallace', 'Delonte West', 'LeBron James'...",495.1,11.706726
5,2013,MIA,"['Chris Bosh', 'Dwyane Wade', 'LeBron James', ...",332.366667,11.481296
6,2016,CLE,"['J.R. Smith', 'Kevin Love', 'Kyrie Irving', '...",835.216667,10.430826
7,2009,CLE,"['Anderson Varejao', 'Delonte West', 'LeBron J...",746.166667,9.938798
8,2020,LAL,"['Anthony Davis', 'Avery Bradley', 'Danny Gree...",387.95,9.279546
9,2011,MIA,"['Chris Bosh', 'Dwyane Wade', 'Erick Dampier',...",268.083333,8.594343


Here we see the 2016 Cleveland team that upset the Golden State Warriors for the title, a slightly less efficient team from the previous year that lost to the Warriors, James's last Miami team, and then a lot of good but not historically great lineups. That LeBron James has won so many titles is truly remarkable. 

### **Now the Money Shot ...**

The above queries were not accidental. They were used to create the scatter plot below. 
- Each dot represents a lineup with at least 200 minutes played. There are almost 900 of them. 
- The lineups to the far right, at the tip of the arrowhead, are the ones in Query 2. 
- The big gold dots represent the famous Golden State death lineups. Those are (mostly) listed in Query 3. 
- The smaller red dots are lineups with LeBron James, which we uncovered in Query 4.  
- The blue line at the top (called the tradeoff curve) can be found by rerunning Query 3 over and over again with different minute thresholds (200+ minutes, 350+ minutes, 660+ minutes, etc.). Each lineup on that curve could be said to be the best in history! Or at least since 2004.

![nba lineup tradoff](https://github.com/christopherhuntley/DATA6510/raw/master/img/L3_nba_plot.png)

> **For those of you who aren't sports fans, take heart that is the last of the sports examples for a while. How about movies instead?**

## **Congratulations! You've made it to the end of Lesson 3.**

You now know pretty much everything you need to know about `SELECT` queries. If there is anything else you need to know, then a least you have a solid foundation on which to build. 

Quiz 2 will test your understanding of the relevant theory and your ability to write short `SELECT` queries *without the ability to run them in Jupyter*.

## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.