<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 8: SQL DML** 
_Where SQL takes action_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- ACID transactions
- SQL's role in the data ETL process
- The various SQL DML statements
- How transaction controls can be used for multi-step database operations
- The basics of cloud-based RDBMS hosting

### **Skills / Know how to ...**
- Create normalized tables from denormalized datasets
- Use SQL DML for basic CRUD operations
- Avoid common load order issues 
- Use `CASE` expressions to implement complex conversion logic

--------
## **LESSON 8 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/6UsegPkeJKw" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

---
## **BIG PICTURE: CRUD on ACID**
The general theme of this lesson is about transactional processing that puts, alters, and deletes data in databases. In past lessons we discussed the four basic CRUD actions in the abstract. Now we will get into the nitty gritty details, or at least the ones that can be addressed with SQL.

In principle we'd like the data in our database to survive IT armageddon where the power shuts down with no notice, the database is in the middle of a lengthy operation, and the consequences of failure are catastrophic. Then perhaps we can begin to count on it being there when we need it. 

The gold standard for robustness in the face of catastrophic failure is ACID, four properties that together go as far as possible to keep our data safe:
- **Atomicity.** Lengthy transactions with lots of steps are treated as one unit. If any step fails then we can roll it back to the beginning, as if it never happened. 
- **Consistency.** We never want the database to be in an unexpected or unrecoverable state. If given data operations in any given order the result is always the same. There is no ambiguity or uncertainty introduced by the system itself. 
- **Isolation.** Just like people, databases often have to multitask, processing several transactions at once. Ideally, we want to keep failure of any transaction from causing failure of another. They should be running as independently as possible, *especially* when failure risk is high. 
- **Durability.** Once data has been committed by a transaction, it should persist until another transaction alters it. 

If you think about each of these things, you will realize just how fragile most software really is. 
- How far back in time does the "undo" on your word processor allow you to go back? If everything you have written since yesterday afternoon was garbage, could you revert back to what you had before, *even if you had never saved it anywhere?* 
- If your computer crashed while it was halfway through saving the latest draft of your senior thesis, would the file be recoverable? Would you lose half your work?
- If you and a classmate are editing the same Google doc and your partner falls asleep at the keyboard, typing an infinite string of J characters ..., can you regain control before everything is destroyed? Or do you have to start over with a new doc? 
- If you start recording a workout on your smart watch but then forget to end the workout before all of the power is drained from your battery, does any data get lost? 

ACID is how we prevent all of these things from blowing up your data. 

Okay, so where do you get ACID? From any modern relational database product. Even the most primitive ones like SQLite or MS Access are better than any other tool for keeping your data safe. They are CRUD on ACID, and that's a good thing. 


---
## **ETL = Extract $\rightarrow$ Transform $\rightarrow$ Load**

While it is certainly wonderful and makes the analyst's life much easier if data was collected expressly for their use, the typical case is not so great. Data can come from anywhere and may require significant scrubbing before it can be trusted. In some cases, there may be multiple data sources, with somewhat incompatible data to be merged into a coherent dataset. 

The general process of working with such *dirty* data is called ETL:
- **Extract** data from the original sources.
- **Transform** and integrate it to fit the new purpose.
- **Load it** into a central data repository that will protect the data from corruption.

While there are certainly other tools for this purpose, SQL is a great place to start:
- Modern relational databases include utilities for working with data in various formats. 
- SQL includes plenty of functions for transforming data from one data type to another *plus* the power of SQL queries to bring it all together into a useful form. **If SQL is not not enough, then use another tool as well.** SQL is already compatible with just about every programming language on earth. It has serious first mover advantage from decades of heavy use.
- When you are done, the data can reside safely in a database with the guarantees of full ACID compliance. 

After reviewing the syntax and function of SQL `INSERT`, `UPDATE`, and `DELETE` statements, we will consider a few special cases that put ACID principles to the test.  

 


---
## **SQL `INSERT` Statements** 
We use `INSERT` statements to add rows to a table. There are two basic forms:
- Adding new data (values) to the database
- adding table data extracted (selected) from another table

### **`INSERT INTO ... VALUES`**

```sql
INSERT INTO tablename ( columnlist ) VALUES
  ( valuelist ); 
```

- `columnlist` and `valuelist` are comma-separated lists of column names and literal values. The two lists have exactly the same number of items, with the first column corresponding to the first value, etc.
- Any columns not included in the `columnlist` are not assigned a NULL value unless a `DEFAULT` value is specified.
- If the table has a surrogate primary key, then generally we do not want to include the primary key column; the database will generate it for us.
- When we say *literal values* in the `valuelist` we mean the values they would appear in a `WHERE` clause. It would be the value to the right of the `=` in a boolean expression. 
- The parentheses and trailing semicolon are not optional. 

We can insert multiple rows at a time as follows. 

```sql
INSERT INTO tablename ( columnlist ) VALUES
  ( valuelist1 ),
  ( valuelist2 ), 
  ...
  ;
```

- That's a list of `valuelist` items, one per row.
- There is no comma just before the semicolon.

For example, the following adds two new movies to the Movies Tonight database:
```sql
INSERT INTO movies ( title, rating ) VALUES 
  ('Romeo and Juliet', 'PG-13'),
  ('A Time to Kill','PG-13');
```
Note that the `movieID` column was omitted because it is autogenerated by the database. 


### **`INSERT INTO ... SELECT`**
If the data is already in the database in some form, then we can use a `SELECT` query to  gather the data values prior to insertion.

```sql
INSERT INTO tablename ( columnlist ) 
  SELECT ...
  ;
```

- As with inserting literal values the columns returned by the `SELECT` query must correspond to the ones in the `columnlist`
- The actual names of the columns returned by the `SELECT` query do not matter, though the data types should be compatible with what is already defined in the table. 

Another movies example, this time using data [imported from IMDB](https://www.imdb.com/interfaces/):

```sql
INSERT INTO movies ( title )  
  SELECT primaryTitle 
  FROM imdb_title_basics_import 
  WHERE startYear = "1996" 
```

- Since IMDB does not provide US movie ratings in its public data dumps the `ratings` column was omitted from the `columnlist`. That also means that the `ratings` column has to allow null values. Otherwise the insertion will fail. 
- If we want to keep track of the `tconst` movie identifier used by IMDB then we will have to add another column to the `movies` table. 



---
## **SQL `UPDATE` Statements**

SQL `UPDATE` statements set specific column values on selected rows. 

```sql
UPDATE tablename 
SET
  column1 = newvalue1,
  column2 = newvalue2,
  ...
WHERE ...
;
```
- Only the columns that are being updated need to be included.
- The `WHERE` clause works just like in a `SELECT` query.
- A new value can be any expression that returns a scalar value. That includes subqueries. 
- It is possible to use joins to update several tables at once. However, that is fairly new to the SQL standards and not likely to work in older (legacy) databases. It won't work in MySQL 5.7, for example, but it does work in MySQL 8.0. The workaround is to use subqueries (with joins) instead. 

Here we are updating the *Romeo + Juliet* movie title to its proper name.

```sql 
UPDATE movies 
SET
  title = 'Romeo + Juliet' 
WHERE
  movieID = 24;
```

---
## **SQL `DELETE` Statements**

Deleting rows is about as easy as it gets. 

```sql
DELETE FROM tablename
WHERE ...
```

- There is no need to specify columns.
- If the `WHERE` clause is omitted then *every* row is deleted. 
- We can delete from multiple tables at a time with slightly altered syntax. However, it is not universally supported. 














---
## **Load Order and Transactions**
Maintaining referential integrity is a continual process. The DBMS is always on the lookout for integrity violations. Each query is treated as an *atomic transaction* that can be undone (rolled back) if it does not complete successfully. Thus, if an update querysets a foreign key to an impossible value or nullifies something that can't be null, then the database will immediately complain and return the database state to whatever it was before the query. 

While that is a very reasonable and safe way to approach data integrity, it has some implications for how and when we load data into a given table. We will first consider cases where **table load order** can be used to avoid referential integrity violations. Then we will consider cases where we have to go further, using custom transaction controls to force the database to do what we need it to. 

### **Strongest First Loading**
The vast majority of referential integrity problems can be prevented by taking care about the order in which we insert and delete records. 

Consider any parent-child relationship where the parent must exist before the child. 

The process to add a new child row is then:
1. If the parent doesn't exist then add the parent first.
2. Once the parent exists, then add the child. 

Deleting the parent can cause the opposite problem, as all children will need to be deleted before the parent. We can use `ON DELETE CASCADE` in the foreign key constraints to handle that automatically.

When applied to a whole database the load order is strongest to weakest:
1. Load all the strong entities.
2. Load all weak entities that only depend on strong entities.
3. Load any entities that only rely on #1 and #2.
4. ...

We will see this strategy in place with the Movies Tonight case a little further down. 

### **Transaction Control**

Sometimes just taking care with load order is not enough. For those cases we use transaction control. 

Consider the classic parent-child-grandchild case, where there is a whole hierarchy of entities to be saved at once. This might happen, for example, when saving a new customer, the customer's order invoice, and the invoice line items. Based on the Strongest to Weakest rule, we would save the parent (customer record), then the child (invoice), and then the grandchildren (line items). 

The SQL code might look something like this:

```sql
INSERT INTO customers ...
INSERT INTO invoices ...  -- MySQL: use LAST_INSERT_ID() function to get the new customer id 
                          -- SQLite: use LAST_INSERT_ROWID()
INSERT INTO invoice_items ...
```

However, what happens if there is a problem saving one of the grandchildren? Then the entire transaction should be voided, including the invoice and new customer record. Instead of deleting them one by one, we can use a transaction block instead:

```sql
 
BEGIN;        -- Start a new Transaction
INSERT INTO customers ...  
INSERT INTO invoices ...
INSERT INTO invoice_items ...
COMMIT;       -- Finalize the Transaction
```

If the transaction fails before the `COMMIT` statement, then all changes made during the transaction are ignored. It's like it never happened. 

Generally, multistep transactions like this are packaged together as **stored procedures**. However, since creating stored procedures is more of a task for data engineers than data analysts, it is beyond the scope of this course. 

#### **Solving the Twinning Problem**
Transaction control can be used to solve the mandatory twins issue from Lesson 6. Depending on the database vendor, there are two different approaches. 

The most common (and theoretically correct) approach is **deferred commit**, where specific foreign key checks are marked as *deferrable* if they are inside of a transaction block. The keys are then checked as part of the `COMMIT` at the end. [This is how it works in SQLite](https://www.sqlite.org/foreignkeys.html), for example. 

A more risky approach is to **explicitly disable foreign key checks** during a given transaction block. In MySQL this looks like:

```sql 
BEGIN;
SET foreign_key_checks = 0; -- disables FK checking
...
SET foreign_key_checks = 1; -- reenables FK checking
COMMIT;
```

The risk here is that one might forget the `SET foreign_key_checks = 1;` step just before the `COMMIT`. 

---
## **Movies Tonight, Part 4**

We will finish off the Movies Tonight case by extracting, transforming, and loading the data from a single CSV file.  

![ERD from Lesson 5](https://github.com/christopherhuntley/DATA6510/raw/master/img/L6_MoviesTonight_v2.png)

- `Artists(`**`artistID`**, `name)`
- `Movies(`**`movieID`**, `title,rating)`
- `Theaters(`**`theaterID`**, `name, location, phone)`
- `Credits(`**`creditID`**, `credit_code`, <u>`movieID`</u>,<u>`artistID`</u>`)`
- `Shows(`**`showID`**, `showtime`, <u>`movieID`</u>,<u>`theaterID`</u>`)`

### **Setup (Again)**

The code below creates a folder in Google Drive for our SQLite database. 











In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create the DATA6510/data/MoviesTonight folder in Google Drive
from pathlib import Path
data_root = Path("./drive/My Drive/Colab Notebooks/DATA6510")
if not data_root.exists():
  print(
      '''
      Warning! The folder '/Colab Notebooks/DATA6510' could not be found in the connected Google Drive. 
      Please make 100% sure that both Colab and Chrome are set up use your @student.fairfield.edu account. 
      For now, a new folder with the correct path has been created in whatever Google Drive it found. 
      ''')
data_root = data_root / 'data' / 'MoviesTonight'
data_root.mkdir(parents=True, exist_ok=True)

Mounted at /content/drive


In [2]:
%%bash
ln -s drive/My\ Drive/Colab\ Notebooks/DATA6510 data6510

In [3]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Database connection
%sql sqlite:///data6510/data/MoviesTonight/MoviesTonight.db

'Connected: @data6510/data/MoviesTonight/MoviesTonight.db'

The database connection should reopen your database from Lesson 7. Now we just need to insert data.

### **Importing from CSV**


In [None]:
# retrieve the DATASET.csv file
dataset_df = pd.read_csv('https://raw.githubusercontent.com/christopherhuntley/DATA6510/master/data/MoviesTonight/DATASET.csv')
conn = sqlite3.connect('DATA6510/data/MoviesTonight/MoviesTonight.db') 
dataset_df.to_sql('DATASET',conn,if_exists='replace',index=False)

In [None]:
%%sql @data6510/data/MoviesTonight/MoviesTonight.db
SELECT * FROM DATASET LIMIT 10;

Done.


TName,Location,Phone,MTitle,ShowTime,Rating,CCode,CName
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Austin Pendleton
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Bebe Neuwirth
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Dianne Wiest
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Eli Wallach
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Kenny Kerr
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Lainie Kazan
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Tim Daly
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,A,Whoopi Goldberg
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",4:20 PM,PG-13,D,Donald Petrie
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899,"Associate, The",7:20 PM,PG-13,A,Austin Pendleton


### **Populating the Strong Entity Tables**
There are three tables without any foreign keys:
- Theaters
- Movies
- Artists

These can be created directly from the `DATASET` table. However, we won't do it all at once. **To be sure we know what will be inserted, always write the SELECT query first.** We'll do it in slow motion below but in real life you would just use one cell, rerunning with each step.

#### **The `theaters` Table**

**Pass 1: `SELECT` ONLY**

In [None]:
%%sql
-- Select the data for the theaters table 
-- Note the use of DISTINCT here; very important
-- There should be nine theaters with no duplicates
-- SQLite handles the theaterID for us
SELECT DISTINCT tname,location,phone 
FROM DATASET; 

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


TName,Location,Phone
Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899
Cinema Star The Ultraplex 14,"Mission Grove, Riverside",(909) 342-2256
General Cinema Rancho 6,"I-215 At Mt. Vernon S. At I-10, San Bernardino",(714) 370-2085
Pacific Inland Center,"Inland Center Mall, San Bernardino",(714) 381-1611
SOCAL Canyon Crest Cinema,"Central Avenue South Of 60 Freeway Near Ucr, Riverside",(909) 682-6900
SOCAL Canyon Springs Cinema,"East Of I-215 On 60 Freeway At Day Street Canyon, Moreno Valley",(909) 782-0800
SOCAL Marketplace Cinema,"University/mission Inn Exits East Of 91 Freeway On, Riverside",(909) 682-4040
United Artists Riverside (Galleria) Tyler Mall,"Riverside Fwy Tyler, Riverside",(714) 689-802
United Artists Riverside Park Sierra,"3600 Park Sierra Dr., Riverside",(909) 359-6995


**Pass 2: `INSERT` and test that it worked**

In [None]:
%%sql
-- Populating the theaters table
-- make sure the table is empty
DELETE FROM theaters;

-- insert selected data
INSERT INTO theaters (name,location,phone) 
SELECT DISTINCT tname,location,phone 
FROM DATASET;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.
Done.


[]

In [None]:
%%sql
-- There are 9 theaters
SELECT * 
FROM theaters;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


theaterID,name,location,phone
1,Akarakian Theatres Moreno 4 Cinemas,"The Intersection Of Alessandro + Perris Blvds, Moreno Valley",(909) 485-2899
2,Cinema Star The Ultraplex 14,"Mission Grove, Riverside",(909) 342-2256
3,General Cinema Rancho 6,"I-215 At Mt. Vernon S. At I-10, San Bernardino",(714) 370-2085
4,Pacific Inland Center,"Inland Center Mall, San Bernardino",(714) 381-1611
5,SOCAL Canyon Crest Cinema,"Central Avenue South Of 60 Freeway Near Ucr, Riverside",(909) 682-6900
6,SOCAL Canyon Springs Cinema,"East Of I-215 On 60 Freeway At Day Street Canyon, Moreno Valley",(909) 782-0800
7,SOCAL Marketplace Cinema,"University/mission Inn Exits East Of 91 Freeway On, Riverside",(909) 682-4040
8,United Artists Riverside (Galleria) Tyler Mall,"Riverside Fwy Tyler, Riverside",(714) 689-802
9,United Artists Riverside Park Sierra,"3600 Park Sierra Dr., Riverside",(909) 359-6995


We can then repeat the process for the other two tables, shown here in one cell each. Again, even here the code was written originally in two passes (`SELECT` to get the data right then add `INSERT` on line above to populate the table).

#### **The `movies` Table**

In [None]:
%%sql
-- Populate the movies table
-- make sure the table is empty
DELETE FROM movies;

-- insert selected data
INSERT INTO movies(title,rating)
SELECT DISTINCT mtitle,rating
FROM DATASET; 

-- There are 23 movies
SELECT * FROM movies;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.
Done.
Done.


movieID,title,rating
1,"Associate, The",PG-13
2,"Ghost & The Darkness, The",R
3,Independence Day,PG-13
4,D3: The Mighty Ducks,PG
5,Dear God,
6,"First Wives Club, The",PG-13
7,High School High,PG-13
8,Larger Than Life,PG
9,"Mirror Has Two Faces, The",PG-13
10,Ransom,R


#### **The `artists` Table**

In [None]:
%%sql
-- Populate the artists table
-- make sure the table is empty
DELETE FROM artists;

-- insert selected data
INSERT INTO artists (name)
SELECT DISTINCT cname 
FROM DATASET;

-- There are 152 artists
SELECT * FROM artists;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.
Done.
Done.


artistID,name,bio
1,Austin Pendleton,
2,Bebe Neuwirth,
3,Dianne Wiest,
4,Eli Wallach,
5,Kenny Kerr,
6,Lainie Kazan,
7,Tim Daly,
8,Whoopi Goldberg,
9,Donald Petrie,
10,Bernard Hill,


### **Populating the Weak Entity Tables**
With the `theaters`, `movies`, and `artists` tables populated, including valid primary keys, we can now populate the `credits` and `shows` tables. However, there are two minor issues:
- How to set the foreign keys, given that the primary keys of the first three tables are not in the  `DATASET` table.
- The `ShowTime` values are nonstandard. 

We will address each of these issues as we populate the `credits` and `shows` tables one at a time. 

Note: The queries below were developed in several passes, just like with the first three tables.   

#### **The `credits` Table**

Since `movieID` and `artistID` do not exist in the `DATASET` table, we need to join in the `movies` and `artists` tables with the `DATASET` table. If we are willing to assume that artist names and movie titles are unique (at least in the data have on hand), we do it like this:

In [None]:
%%sql
-- Populate the credits table
-- make sure the table is empty
DELETE FROM credits;

-- insert selected data
DELETE FROM credits;
INSERT INTO credits(credit_code,movieID,artistID)
SELECT DISTINCT ccode, movieID, artistID 
FROM DATASET 
  JOIN movies ON (movies.title = DATASET.mtitle)  -- note: join on titles
  JOIN artists ON (artists.name = DATASET.cname); -- note: join on names; 

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.
Done.
161 rows affected.


[]

In [None]:
%%sql
SELECT * FROM credits LIMIT 10;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


creditID,credit_code,movieID,artistID
1,A,1,1
2,A,1,2
3,A,1,3
4,A,1,4
5,A,1,5
6,A,1,6
7,A,1,7
8,A,1,8
9,D,1,9
10,A,1,1


**If we were going to join on the names and titles anyway, then why use surrogate keys?** To handle the data we have not seen yet. After this initial load with a small number of movies and artists, every subsequent insertion will need to use transaction control to keep the keys matched. That means adding one row at a time, not in bulk. 

#### **The `shows` Table**
The `shows` table has the added complexity that the `ShowTime` column uses a nonstandard time format that is not supported by SQLite. We handle that with the somewhat ugly `CASE` expression below.

In [None]:
%%sql
-- Populate the shows table
-- make sure the table is empty
DELETE FROM shows;

-- insert selected data
INSERT INTO shows(movieID,theaterID,showtime)
SELECT DISTINCT
    movieID,
    theaterID,
    -- translates the showtime to ISO format
    CASE     
      WHEN upper(ShowTime) LIKE '%AM' AND substr(ShowTime,1,instr(ShowTime,':')-1) = '12' 
          THEN printf('00:%2s', substr(ShowTime,instr(ShowTime,':')+1,2))
      WHEN upper(ShowTime) LIKE '%PM' AND substr(ShowTime,1,instr(ShowTime,':')-1)+0 < '12' 
          THEN printf('%2i:%2s', substr(ShowTime,1,instr(ShowTime,':')-1) + 12,substr(ShowTime,instr(ShowTime,':')+1,2))
      ELSE printf('%s:%2s', substr(ShowTime,1,instr(ShowTime,':')-1),substr(ShowTime,instr(ShowTime,':')+1,2))
    END AS iso_time
FROM DATASET 
      JOIN movies ON (movies.title = DATASET.mtitle) 
      JOIN theaters ON (theaters.name = DATASET.tname);


SELECT * FROM shows LIMIT 10;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.
Done.
Done.


showID,showtime,movieID,theaterID
1,16:20,1,1
2,19:20,1,1
3,21:40,1,1
4,17:10,2,1
5,21:00,2,1
6,21:05,2,1
7,19:00,3,1
8,16:40,1,2
9,19:30,1,2
10,14:30,4,2


> **Heads Up:** Details about the logic of the `CASE` expression are provided in the PRO TIPS section.

### **Kicking the Tires with a Few Queries**

The following queries test whether:
- We got the right number of rows in each table
- The foreign keys match

In [None]:
%%sql
-- There should be 9 theaters
SELECT count(*) FROM theaters; 

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


count(*)
9


In [None]:
%%sql
-- There should be 23 movies
SELECT count(*) FROM movies; 

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


count(*)
23


In [None]:
%%sql
-- There should be 152 artists
SELECT count(*) FROM artists;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


count(*)
152


In [None]:
%%sql
-- There should be 161 credits
SELECT count(*) FROM credits;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


count(*)
161


In [None]:
%%sql
-- There should be 131 movies
SELECT count(*) FROM shows;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


count(*)
131


In [None]:
%%sql
-- This query tries out artist --> credit --> movies
SELECT name,credit_code
FROM artists 
  JOIN credits USING (artistID) 
  JOIN movies USING (movieID)
WHERE title = "Space Jam";

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name,credit_code
Bill Murray,A
Michael Jordan,A
Theresa Randle,A
Wayne Knight,A
Joe Pytka,D


In [None]:
%%sql
-- This query tries out theaters --> shows --> movie 
SELECT name, showtime, title
FROM theaters 
  JOIN shows USING (theaterID) 
  JOIN movies USING (movieID)
WHERE title = "Space Jam";

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name,showtime,title
Cinema Star The Ultraplex 14,17:00,Space Jam
Cinema Star The Ultraplex 14,18:00,Space Jam
Cinema Star The Ultraplex 14,19:15,Space Jam
Cinema Star The Ultraplex 14,20:15,Space Jam
SOCAL Marketplace Cinema,16:45,Space Jam
SOCAL Marketplace Cinema,17:50,Space Jam
SOCAL Marketplace Cinema,19:15,Space Jam
SOCAL Marketplace Cinema,20:10,Space Jam
SOCAL Marketplace Cinema,21:30,Space Jam
United Artists Riverside (Galleria) Tyler Mall,17:25,Space Jam


In [None]:
%%sql
-- What shows are there after "22:00" (10pm)
SELECT title AS movie, name AS theater, showtime
FROM theaters 
  JOIN shows USING (theaterID) 
  JOIN movies USING (movieID)
WHERE showtime > "22:00"   -- takes advantage of ISO format's natural lexicographic ordering
ORDER BY showtime;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


movie,theater,showtime
"Ghost & The Darkness, The",Pacific Inland Center,22:05
Set It Off,United Artists Riverside Park Sierra,22:05
"Mirror Has Two Faces, The",United Artists Riverside Park Sierra,22:20


### **A few more queries just for fun (sort of)**

Make sure you understand why each of these works. Each relies on a thorough understanding of advanced `SELECT` queries. 

In [None]:
%%sql
-- Who appeared in more than one movie?
SELECT name
FROM movies 
  JOIN credits USING (movieID) 
  JOIN artists USING (artistID)
GROUP BY artistID, name
HAVING count(*) > 1;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name
Vivica Fox
Rob Lieberman
Bill Murray
Barbra Streisand
Minnie Driver
Diane Venora
Campbell Scott
Stanley Tucci
Jada Pinkett


In [None]:
%%sql
-- Who was an actor and director in the same movie?
SELECT name, title
FROM artists 
  JOIN credits AS c1 USING (artistID) 
  JOIN movies USING (movieID) 
  JOIN credits AS c2 using (movieID,artistID) -- note: credits joined in twice!
WHERE c1.credit_code = "A" and c2.credit_code = "D";

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name,title
Rob Lieberman,D3: The Mighty Ducks
Barbra Streisand,"Mirror Has Two Faces, The"
Campbell Scott,Big Night
Stanley Tucci,Big Night


In [None]:
%%sql
-- Who was an actor and director in the same movie? 
-- This time using aggregation instead of extra joins
SELECT name, title
FROM artists 
  JOIN credits USING (artistID) 
  JOIN movies USING (movieID)
WHERE credit_code in ("A","D")
GROUP BY artistID, movieID, name, title
HAVING count(credit_code) > 1

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name,title
Rob Lieberman,D3: The Mighty Ducks
Barbra Streisand,"Mirror Has Two Faces, The"
Campbell Scott,Big Night
Stanley Tucci,Big Night


In [None]:
%%sql
-- Which actors costarred with Eli Wallach?
SELECT a2.name
FROM artists AS a1 
  JOIN credits AS c1 USING (artistID) 
  JOIN credits AS c2 USING (movieID) 
  JOIN artists AS a2 ON (c2.artistID = a2.artistID)
WHERE a1.name = "Eli Wallach" AND c1.credit_code ='A' AND c2.credit_code ='A' AND a2.name <> "Eli Wallach";

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


name
Austin Pendleton
Bebe Neuwirth
Dianne Wiest
Kenny Kerr
Lainie Kazan
Tim Daly
Whoopi Goldberg


### **A little cleanup before we go**
Now that we are done with it, we should drop the `DATASET` table. It's an artifact of the ETL process and not intended to be used as data.

In [None]:
%%sql
DROP TABLE IF EXISTS DATASET;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


[]

> **Heads Up** Why did we use ALL_CAPS for the `DATASET` table name? Because we wanted it to stand out from the others. It was only needed for as long as we were using it to load the other tables. ALL_CAPS was a reminder to drop it. 

---
## **PRO TIPS: How to handle nonstandard data formats**

**This is a somewhat advanced topic. Try to follow along but know that you will not be quizzed on it.**

Sometimes the source data will be formatted in a way that is not compatible with your database design. While the correct answer is usually to handle it before loading into the database $-$ Python and pandas are designed just for this sort of thing $-$ we can handle some of this using SQL itself. 

We saw an example in the Movies Tonight case, where we had to use a `CASE` expression to translate nonstandard show times to work in SQLite, which does not have native support for DATETIME data. From the docs:
> SQLite has no DATETIME data type. Instead, dates and times can be stored in any of these ways:
>- As a TEXT string in the ISO-8601 format. Example: '2018-04-02 12:13:46'.
>- As an INTEGER number of seconds since 1970 (also known as "unix time").
>- As a REAL value that is the fractional Julian day number.
>
>The [built-in date and time functions](https://sqlite.org/lang_datefunc.html) of SQLite understand date/times in all of the formats above, and can freely change between them. Which format you use, is entirely up to your application.

We have time string data in TEXT format. However, our time strings are not in ISO-8601 format, which uses 24-hour military times instead of AM and PM. Thus, the time '2:00 PM' should be '14:00' in ISO format. Unfortunately, SQLite does not provide a built in function to do the conversion, so we will need to handle the translation ourselves using `CASE` expressions. 

The following shows the process used to develop the `CASE` expression for Movies Tonight. 

**Pass 1: Detecting AM/PM**  
Let's start with recognizing that 'AM' has different rules from 'PM':
- We will be converting the `ShowTime` column from the `DATASET` table. 
- The `upper()` function converts everything to uppercase. We can use that to eliminate uppercase vs lowercase bugs.
- The `LIKE` comparator allows us to look for patterns in strings; '%PM' matches any number of characters followed by 'PM'.



In [None]:
%%sql
SELECT
    ShowTime,
    CASE  
      WHEN upper(ShowTime) LIKE '%PM' THEN 'it is PM'
      ELSE 'it is AM'
    END AS `AM or PM`
FROM DATASET
LIMIT 10;
   

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


ShowTime,AM or PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
4:20 PM,it is PM
7:20 PM,it is PM


**Pass 2: Handling Hours and Minutes**  
Next we need to pick off the hour and minutes part of the time string. Since the hour part sometimes has one character and in other cases has two, we will need to take that into account in our code.
- The `instr(X,Y)` function returns the position of the first occurence of `Y` in the string `X`.
- The `substr(X,Y,Z)` function returns the substring of `X`, starting at position `Y` with `Z` characters.
- A `GROUP BY` clause was used to make sure that both 1 digit and 2 digit hours are represented.
- SQLite automatically **coerces** numerical strings like "12" or "3.14" to equivalent numerical types when used in arithmetic expressions. So `"12"+12 = 24`. 



In [None]:
%%sql
SELECT 
    min(ShowTime), 
    instr(ShowTime,':')-1 AS hour_digits,
    substr(ShowTime,1,instr(ShowTime,':')-1) AS hours,
    substr(ShowTime,1,instr(ShowTime,':')-1)+12 AS hours_plus_12,
    substr(ShowTime,instr(ShowTime,':')+1,2) AS mins,
    CASE  
      WHEN upper(ShowTime) LIKE '%PM' THEN 'it is PM'
      ELSE 'it is AM'
    END AS `AM or PM`
FROM DATASET
GROUP BY hour_digits, `AM or PM`
LIMIT 10;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


min(ShowTime),hour_digits,hours,hours_plus_12,mins,AM or PM
2:30 PM,1,2,14,30,it is PM
10:05 PM,2,10,22,5,it is PM


**Pass 3: Putting it all together**

We now have enough to create the full expression. There are three cases to handle:
- '12:00 AM' - '12:59 AM': The hour is '00'.
- '01:00 AM' - '11:59 AM': The hour is as given, padded to two characters if needed.
- '12:00 PM' - 11:59 PM': Add 12 to the hour. 

We can take the second case as the default and handle the other two separately. The final code is shown below.
- The `printf()` function is used to "pretty print" text in a fixed format; it is a holdover from the original C function.
- The expressions for `hour_digits`, etc., were used in the `CASE` expressions as needed.
- There were no midnight shows or morning matinees but the `CASE` expression should work for those cases too. 

In [None]:
%%sql
SELECT 
    ShowTime, 
    CASE  
      WHEN upper(ShowTime) LIKE '%AM' AND substr(ShowTime,1,instr(ShowTime,':')-1) = '12' 
          THEN printf('00:%2s', substr(ShowTime,instr(ShowTime,':')+1,2))
      WHEN upper(ShowTime) LIKE '%PM' AND substr(ShowTime,1,instr(ShowTime,':')-1)+0 < '12' 
          THEN printf('%2i:%2s', substr(ShowTime,1,instr(ShowTime,':')-1) + 12,substr(ShowTime,instr(ShowTime,':')+1,2))
      ELSE printf('%s:%2s', substr(ShowTime,1,instr(ShowTime,':')-1),substr(ShowTime,instr(ShowTime,':')+1,2))
    END AS iso_time
FROM DATASET
GROUP BY ShowTime;

 * sqlite:///buan6510/data/MoviesTonight/MoviesTonight.db
Done.


ShowTime,iso_time
10:05 PM,22:05
10:20 PM,22:20
2:30 PM,14:30
3:50 PM,15:50
4:00 PM,16:00
4:10 PM,16:10
4:15 PM,16:15
4:20 PM,16:20
4:25 PM,16:25
4:30 PM,16:30


---
## **SQL AND BEYOND: Amazon Web Services RDS Hosting Service**

As you work with larger and larger datasets, your needs will eventually outgrow what will fit in SQLite or perhaps even your laptop's drive. For that you will want to use one of the budget-priced cloud storage services. We explored Google Cloud Platform in Lesson 3. It's great for one table databases but sometimes you are going to want the full relational model. Now it's Amazon's turn. 

Launched in 2009, Amazon's RDS service is a very popular choice for cloud-based RDBMS hosting, used for everything from the biggest enterprise systems to the smallest hobby projects. It supports a variety of relational DBMS platforms (Oracle, SQL Server, PostgreSQL, MySQL, MariaDB, etc.), with automated handling of system backups, user access controls, data migration, etc. that would normally require a trained database administrator. It also has virtually unlimited scale, pricing out storage capacity and data access bandwidth at commodity prices. 

Here we will walk through a managed MySQL server instance in RDS's Free Tier, which is plenty capable for most analytical projects. 

### **AWS Free Tier**
The AWS Free Tier is a great place to learn about how Amazon Web Services works. It takes a few minutes to get an account set up. Although a credit card is required, it is fairly easy to avoid being charged by avoiding "autoscaling" and similar options. 

![AWS Free Tier](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_AWS_Free_Tier.png)

### **RDS Instances**
RDS is Amazon's database hosting service. A database server (DBMS) in the RDS cloud is called an **instance**. The RDS Free Tier supports many of the most popular database options. The storage capacity is 20 gigabytes, with another 20 gigabytes set aside for backups, which is plenty for most purposes. 

![](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_RDS_MySQL_Engine.png)  

![RDS Instance](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_AWS_Database_Instance.png)

Once the instance is created and configured, we can add databases using standard SQL commands. Here we are using a linux mysql client to create the database. We could have just as easily used MySQL Workbench for Windows or MacOS. (See [here](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Workbench.png).) It's all SQL from here. 

![Create database](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_AWS_SQL_Create_Database.png)

Note: one server can host multiple databases. We just have to repeat the `CREATE DATABASE` command for each database. 

After a database is created, we can use SQL `CREATE USER` and `GRANT` statements to allow remote users (i.e., a Colab notebook) to access the data. In the example below, SQL is used to grant permission to students to run `SELECT` queries.

![Grant Permisions](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_AWS_User_Access.png)

### **Connect from Colab**
Connecting to the database from Colab is exactly like we have been doing already. However, if we want to run `CREATE`, `ADD`, etc. we will need to log in as a privileged user. In the connection below the 'datamanager' user is logged in using a password typed in interactively (with a little help from the `getpass()` utility). That keeps the password out of the notebook code, where it may leak to people who you really don't want rummaging through your databases. To be extra safe you may want to hide the username as well. 

> **Heads Up:** The queries below won't run without `datamanager` privileges. Look but don't touch. 




In [None]:
# install the pymysql driver needed to connect to MySQL
!pip install pymysql

# Import getpass, which silently asks for passwords
import getpass

# Load %%sql magic 
%load_ext sql

pw = getpass.getpass("Password:")
connection_string = 'mysql+pymysql://datamanager:{PW}@DATA6510demo.cuj5bhwwzkbm.us-east-2.rds.amazonaws.com/nba_play_log'.format(PW=pw)
%sql $connection_string

Collecting pymysql
[?25l  Downloading https://files.pythonhosted.org/packages/4f/52/a115fe175028b058df353c5a3d5290b71514a83f67078a6482cff24d6137/PyMySQL-1.0.2-py3-none-any.whl (43kB)
[K     |███████▌                        | 10kB 17.3MB/s eta 0:00:01[K     |███████████████                 | 20kB 13.7MB/s eta 0:00:01[K     |██████████████████████▍         | 30kB 11.8MB/s eta 0:00:01[K     |██████████████████████████████  | 40kB 9.7MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 3.1MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2
Password:··········


'Connected: datamanager@nba_play_log'

We are now able to modify the data in the tables as needed. However, the 'datamanager' user cannot drop the database or modify any table definitions. 

### **Create and Populate Tables from CSV Files**

CSV is a very popular data format. It is easy to import and export in/out of a spreadsheet, the files are fairly compact, and everybody has seen them before. However, there are some drawbacks for relational data:
- A CSV file can only contain one table 
- Data types have to be inferred (and software will sometimes get it wrong)
- It has no support for keys of any sort
- Not very convenient for binary data (BLOBs)

In order to make CSV files work properly for loading data into a relational database, the tried and true method is to 
1.  Create the table schemas (with `CREATE TABLE` statements).
2.  Import data into the database from csv files. This will create tables that we will drop at the end. They are only needed to get the source data into the database. 
3.  Use SQL DML (`INSERT` and `UPDATE`) to populate the tables in the schema. The DML would be executed, one table at a time, following the "strongest entity first" rule to avoid referential integrity violations. However, if you are 100% sure there will be no violations, then it is recommended to temporarily turn off [referential integrity checks when bulk-loading data](https://dev.mysql.com/doc/refman/8.0/en/optimizing-innodb-bulk-data-loading.html).

![Bulk Load Performance](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_fk_check_off.png)

We'll take these one step at a time, with a load of a full season of NBA PlayLog data.

**1. Create Table Schema.**
The first step is exactly like we have done before:
![Create Table Schema](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_Create_Table_Schema.png)

Note that the `DROP TABLE` statements are in the reverse order of the `CREATE TABLE` statements. That's once again to avoid referential integrity violations.  

> **Heads up**: This method of keeping the DDL separate from the data files has an added advantage. It allows us to migrate data from one DBMS to another, even when they have different SQL dialects. The CSV files (or `INSERT` statements) would be the same for any vendor. We just have to change the data types and other details in the DDL to fit the SQL dialect we are migrating to. 

**2. Load from CSV files.**
The source data can be loaded into the database any number of ways. MySQL, for instance, has a `LOAD DATA` command that can [pull directly from a CSV file](https://dev.mysql.com/doc/refman/8.0/en/load-data.html). In the example below, pandas is used to load the data in a somewhat DBMS-agnostic way:

![pandas data load](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_Load_From_Pandas.png)

The data is loaded one year at a time, with one CSV file per database table (`teams_import`, `games_import`, etc.).

> **Heads up:** We can also load the data from one huge CSV file (e.g., the `PlayLog` table we have already explored) and then parse it out to several normalized tables. However, there are performance issues that can lock up the server if the CSV file is big enough. That is actually the case here. It's faster and less problematic to do some work before uploading to the server. 

**3. DML to Populate Tables.**
With the data already in the database, we can then use `INSERT INTO ... SELECT` statements to transfer the data to tables. The query below loads data into the `games` table. Notice that it requires a pair of joins to the `teams` table in order to look up `team_id` keys.  

![DML Populate Tables](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_DML_Populate_Tables.png)

Each of these `INSERT` statements is a database transaction **which can fail** for any of a number of reasons:
- Primary key or foreign key violation. 
- Uses up 100% of CPU, memory, or disk space.

![CPU Utlization](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_CPU_Utlization.png)

- Connection reset due to a timeout. 

![Connection Reset](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Connection_Reset.png)

- ... so many more ...

Behind the scenes, the DBMS is going to great lengths to monitor each transaction to warn if things are going awry. The key stat below is the "undo log entries," which shows that 4.5 million rows of data would have to be rolled back if the query were to fail. 

![Transaction Monitoring](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Transaction_Monitoring.png)

Average write performance was about 1500 rows per second. If that drops to 0, then the database may be in a deadlocked state. Breaking the deadlock requires rebooting the server and yes, there is some risk that the server won't reboot. That's what backups are for! 

![Insert Performance](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Insert_Performance.png)

If rollback does occur, then expect the process to take a while. Reverting 4.2 millions rows of data is not easy, especially if the server does not have enough swap space and in-memory cache.  

![](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Transaction_Rollback.png)


**4. Clean up.**
After we done with the import data, we can just drop the tables. 

![Clean Up Tables](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_Clean_Up.png)

### **Working with Dump Files**

If the data already has been loaded into a database we can create a dump (archive) file. The process varies, but for MySQL it's packaged as a [separate utility](https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html), shown here in the MySQL Workbench app.

![MySQL Dumps from Workbench](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_MySQL_Dump_Workbench.png)

The resulting dump file includes lots of SQL DDL and DML code that has been tuned to load as quickly as possible without errors. It can be treated like any other file, except that it may be many, many gigabytes in size. The process of restoring from a backup is the reverse of the dump file export.  

![Restore from MySQL dump file](https://github.com/christopherhuntley/DATA6510/raw/master/img/L8_Restore_Workbench.png)

> **Heads up:** Since the dump file is an editable SQL script, we can alter the DDL statements (data types, constraints, etc.) to suit a different DBMS. The SQL DML is standard and does not need (much) fiddling. 


### **Run a Few Queries**
Generally we would never use *ad hoc* `SELECT` queries with a large transactional database like this one, which is designed for writing data in small ACID transactions. Instead, we would also keep a data warehouse, with tables that have been selectively denormalized to better fit the needs of data analysts. The data warehouse would extract and load data in bulk from this database, perhaps as a part of a 'nightly build' process, knowing that the data had already been scrubbed of anomalies. (We will cover this common architectural design in Lessons 9 and 10.) Nonetheless, we can get surprisingly good `SELECT` query performance from even fairly complex queries.

Here, for example, is a query that totals up the playing time for each player in the 2004-05 regular season and playoffs. A few complexities:
- Four tables and three joins, with millions rows of data being accessed and cross-referenced
- Grouping, with lots of subtotals tracks at once
- Sorting by a group aggregate (minutes) 

Total query time is 51 seconds, which is not exactly instantaneous but good enough for the occasional query. 


In [None]:
%%sql
SELECT team_code,players.name,sum(elapsed)/60 as minutes
FROM players JOIN lineups USING (player_id) JOIN play_segments USING (play_seg_id) JOIN teams ON (players.team_id = teams.team_id)
WHERE teams.team_year = 2005
GROUP BY player_id, players.name, team_code
ORDER BY minutes DESC;

 * mysql+pymysql://datamanager:***@buan6510demo.cuj5bhwwzkbm.us-east-2.rds.amazonaws.com/nba_play_log
526 rows affected.


team_code,name,minutes
DET,Tayshaun Prince,3987.1833
DET,Richard Hamilton,3943.3
DET,Chauncey Billups,3795.4
PHX,Shawn Marion,3781.0667
WAS,Gilbert Arenas,3729.7
SAS,Tony Parker,3602.1333
PHX,Joe Johnson,3597.6167
DET,Ben Wallace,3588.0833
MIA,Dwyane Wade,3554.6
DAL,Dirk Nowitzki,3548.9


Let's see how much data that really was:

In [None]:
%%sql
SELECT 
  (SELECT count(*) FROM players JOIN teams USING (team_id) WHERE team_year=2005) AS players,
  (SELECT count(*) FROM lineups JOIN play_segments USING (play_seg_id) JOIN games USING (game_id) WHERE game_year=2005) AS lineups,
  (SELECT count(*) FROM play_segments JOIN games USING (game_id) where game_year = 2005) AS plays,
  (SELECT count(*) FROM teams WHERE team_year = 2005) AS teams


 * mysql+pymysql://datamanager:***@buan6510demo.cuj5bhwwzkbm.us-east-2.rds.amazonaws.com/nba_play_log
1 rows affected.


players,lineups,plays,teams
526,5944039,594457,30


> **Heads up**: If we had just one cross join in there, the query would have quickly outstripped the server's memory and storage capacity. 5944039 x 594457 is a *very* large number. [Also, note that the number of lineups *should* be  10x the number of plays (because there are 10 players on the court at a time). However, there were some plays in the original logs with less than 10 players due to injuries, ejections, and data collection errors. It might be interesting to study those to see what happened.] 

---
## **Congratulations! You've made it to the end of Lesson 8.**

That's a wrap on SQL DML. Just be sure to study for Quiz 4. This is just one more quiz after that. 



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `DATA6510` folder so you can find it next time.