<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 2: Basic SELECT Statements** 
_Retrieving data from a single table._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The sequence and purpose of each clause of SQL select queries (SELECT, FROM, WHERE, GROUP BY, etc.)
- How SELECT queries compare to similar operations in Excel or Pandas 

### **Skills / Know how to ...**
- Write basic SQL SELECT / FROM / WHERE queries
- Apply functions and conditionals where required
- Calculate aggregate quantities like AVG, SUM, etc.
- Group records using column selectors

---
## **BIG PICTURE: SQL as a universal data access language** 

SQL is just not like the other popular programming languages. Consider, for example, the January 2021 *Tiobe Index*, where SQL is listed as #12. Every other language on the list can be used to build apps and systems. We call them general purpose languages. SQL, however, is explicitly a special purpose language, designed for data management and only data management. The most similar in that respect is R, with its focus on data analytics. Everything else is strictly general purpose.  So, if SQL is so out of step with the rest of the programming language universe, why is it considered a critical data analytics skill? Because it does its job extremely well, of course, but there is more to it than that. 

![Tiobe Index](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_tiobe_index.png)

**SQL is an ancient programming language.** Of the others on the list, C is the only contemporary. (Assembly Language is technically not a single language but a kind of language, so it does not have an age *per se*.) Both C and SQL were developed in the late 1960s and released in the early 1970s. The others came along *decades* later. By then, C and SQL were already ubiquitous, in use by millions of programmers around the world. 

**Age and "first mover advantage" are not all of the SQL story though.** Recall the three tiered model from Lesson 1? Each of the tiers is a *black box*, with internal implementation details hidden behind a *public interface* (often called an API). As long as a given tier can handle certain requests, phrased in such and such a way and producing results in specified format, then nobody has to care *how* it works. That is extremely powerful! SQL just happens to provide just such an interface. For basic CRUD operations in the data tier, SQL provides a complete and standard set of requests and responses that any application or system can rely on. Further, if a better implementation of the standards embodied by SQL comes along, we can upgrade to the newer technology without modifying *any* existing code.  

So, **when asked "Why SQL?" the best answer is "Why not SQL?"** because that is the right way to think of it. SQL works so well for many, many applications supported by millions of programmers. Thus, the burden of proof is to show that SQL can't get the job done and then propose something that does. Even then, the smart boss will instinctively ask for the same data integrity protections provided by SQL. In the end, using anything else is like swimming upstream, where relatively gentle currents on a clear day can become deadly in a storm. In other words, never assume that what works for local storage on an isolated iPhone will work for the cloud servers it connects with. The volume of data and the likelihood of corruption is just too high. Some companies, usually startups, that made the "No, not SQL" choice are having to live with their decisions and ... scaling up to SQL compliance. Why not start there from the beginning? 


 








## **SQL History, Standards, and Use Cases**

Structured Query Language (SQL) was developed by researchers at IBM in the early 1970s. The original name was Sequel, which is how many of us pronounce SQL to this very day. Unlike human language, programming languages have version numbers that track evolution over time. For many years, SQL was whatever IBM defined it to be, but by the mid-1980s SQL was an established standard (SQL-86) endorsed by the American National Standards Institute (ANSI). There have been newer versions over the many years since (most recently as SQL:2019) that extend the original SQL-86 standard to include things like XML data, binary large objects (BLOBs), and window functions. We will come back to these newer features in the latter portions of this course.  

SQL is not a full-featured language like C, C#, Java, Python, etc. Instead, it is considered a *data sublanguage* for creating and processing *relational* data and metadata. As discussed for this purpose it is ubiquitous, supported by just about every other programming language on earth. 

While we think of SQL as a language, the SQL *standard* defines five different kinds of language, each with a different *use case*:
- Data Definition Language (DDL)
- Data Manipulation Language (DML)
- SQL/Persistent Stored Modules (SQL/PSM)
- Transaction Control Language (TCL)
- Data Control Language (DCL)

We are going to focus on SQL SELECT statements, which are a kind of DML. The rest are mostly for SQL DB Administrators/Engineers. However, it is good to know what they are when the engineers bring them up in a meeting. 

The difference between data definition and data manipulation lies in what kinds of data are being addressed. DDL is used to create, describe, modify, or discard *metadata*: 
- Tables, columns, etc. 
- Relationships, keys, etc. 

DML is used to retrieve, add, update, or delete *data*:
- SELECT statements retrieve data from tables
- INSERT, UPDATE, and DELETE statements manage the data in the tables

In addition to DDL and DML, we will also consider a few relevant SQL/PSM, TCL, and DCL statements but always with an eye toward how they interact with DDL and DML. We'll leave the rest for data engineering courses. 

---


## **Preliminaries** 

In this class we will try to work with live data whenever possible. That means each lesson will generally start with:
- Boilerplate code to link in any needed software.
- Connection to the live database. If it is a database that we have not seen before then we will also take a look at the data model before running any queries. 

### **Software Prep** 

Run the boilerplate cell below, which loads all the software we'll need for the rest of this lesson. 

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymysql
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 2.2 MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2


You may have noticed that we used `pip` to install `pymysql`, a software driver we'll need to connect to MySQL databases from Python. We will be using the `pip` utility again later in the course. It finds and installs software from trusted Python package repositories. 

### **The Database**
We'll be working with the Lahman 2016 dataset, which has baseball statistics for every Major League player since ... forever. It goes all the way back to the very beginning! 

The data model follows the so-called Snowflake Schema pattern we'll explore in more detail in Lesson 10. The database is huge, with a couple dozen tables. For now we will have to content ourselves with the simplified Entity Relationship Diagram below. As the name implies, the `Master` table is of central importance. It represents the players, with a smattering of other personnel (coaches, owners, radio/TV announcers) mixed in for good measure. 

![Lahman 2016 ERD](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_baseball_stats_schema.png)

**The database itself is located in the cloud. Run the cell below to open a connection.**

In [None]:
%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

'Connected: buan6510student@lahman2016'

Take note of the connection string, which is SQL standard. The general format is 

`protocol :// username : password @ server / database`

Let's take this one apart:
- `protocol` $\rightarrow$ `mysql+pymysql`  (`mysql` DBMS using the just installed `pymysql` connector) 
- `username` $\rightarrow$ `buan6510student` (a dummy account with read-only access to the database)
- `password` $\rightarrow$ `buan6510` 
- `server` $\rightarrow$ `database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com` (notice the `rds.amazonaws` in the URL)
- `database` $\rightarrow$ `lahman2016`

The database resides in a virtualized RDS server hosted by Amazon Web Services. The DBMS is MySQL, a very popular choice explored more in the **SQL AND BEYOND** segment at the end of the lesson. 

> **Heads Up:** Normally, one would not disclose account credentials in this way but since we are using a read-only account the risk is basically nothing. However, normally we would take great pains to store database credentials in a separate file in a non-internet accessible location.

**Now that we have everything set up we can continue on to learning about SQL SELECT statements.** 

---

> **Super Important Heads Up**   
If a code cell returns an error like **'UsageError: Line magic function `%sql` not found'** or **'Environment variable $DATABASE_URL not set'** then your Colab session ("runtime") has timed out. When that happens just run the cell below to reload everything. 


In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


'Connected: buan6510student@lahman2016'

---
## **SQL SELECT Statements ... one clause at a time**

By far the most common use case for SQL is retrieving data from a relational database. For many of you, learning how to do just that is why you are taking this course. 

The good news is that we only need to consider one kind of SQL statement to do it. Nearly every data retrieval query has the following structure:
```
SELECT columns
FROM tables
WHERE row-conditions
GROUP BY columns
HAVING group-conditions
ORDER BY columns
LIMIT max-rows;
```
Each of the words in CAPITALS are SQL keywords used to indicate each *clause* in a SQL statement. These clauses are always given in exactly the same order, regardless of what the data is used for. The only clause that is strictly required is `SELECT`; all of the others are optional, added on when needed. All of these have been in SQL since the beginning and will be covered one at a time below. In Lesson 3 we will also explore a recent addition, the `WITH` clause, that simplifies certain complex queries that draw from multiple tables. 

> **Heads up:** the closing semicolon `;` is technically required at the end of each SQL statement (query), regardless of the number of clauses. However, if there is only one query in a given cell then we can safely omit it. That said, you might as well get used to typing it each time.

### **The `SELECT` Clause**

The `SELECT` clause is used to indicate which data columns we want. The columns are always provided as a comma-separated list. 

While the columns will *almost* always be selected from tables, we can actually use `SELECT` as an expression calculator:

In [None]:
%%sql
SELECT 1+1, 2*10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


1+1,2*10
2,20


The result is either a scalar value (if there is just one column) or a tabular *resultset* if there are multiple columns. If we like we can specify columns names (or more properly, *aliases*) using the `AS` modifier. We'll look at some more advanced uses of `AS` later in this lesson.  

In [None]:
%%sql
SELECT 1+1 AS one, 2*10 AS two;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


one,two
2,20


> **Heads Up:** Aliases are actually kind of important, sort of like defining variables in Python. Make sure you understand what they do before moving on. 

### **The `FROM` Clause**
If we want to use data from tables then we'll need to specify them in the `FROM` clause. The cell below uses the Lahman2016 database of career baseball statistics, where the `Master` table contains a list of every Major League Player since ... forever. 


In [None]:
%%sql buan6510student@lahman2016
SELECT nameFirst, nameLast
FROM Master
LIMIT 10;

10 rows affected.


nameFirst,nameLast
David,Aardsma
Hank,Aaron
Tommie,Aaron
Don,Aase
Andy,Abad
Fernando,Abad
John,Abadie
Ed,Abbaticchio
Bert,Abbey
Charlie,Abbey


The connection identifier after `%%sql` was created when we first connected to the database. We only need to use it when switching which database we want to work with. We also used a `LIMIT` clause to restrict the number of rows returned. There have been a lot of major league baseball players over the years, too many to scroll through in this lesson. (And yes, Tommie Aaron was Hank Aaron's brother. Baseball talent tends to run in families.)

> **Heads up:** Table names are case sensitive in MySQL, so the following results in an error:


In [None]:
%%sql
SELECT nameFirst, nameLast
FROM master
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
(pymysql.err.ProgrammingError) (1146, "Table 'lahman2016.master' doesn't exist")
[SQL: SELECT nameFirst, nameLast
FROM master
LIMIT 10;]
(Background on this error at: https://sqlalche.me/e/14/f405)


If we want to select all columns from the `Master` table, we use the `*` wildcard:


In [None]:
%%sql
SELECT *
FROM Master
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
abadijo01,1850,11,4,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
abbated01,1877,4,15,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
abbeybe01,1869,11,11,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
abbeych01,1866,10,14,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


> **Heads up:** When using a wildcard, the database uses whatever column order it finds most efficient. If we want the columns in a particular order, then we need to list them that way in the `SELECT` clause. 

We can also use the wildcard to count the number of rows in the table. 

In [None]:
%%sql
SELECT count(*)
FROM Master;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
19105


The `count()` function does exactly what it appears to do. It counts the number of rows. In this case we are counting entire rows, but we can also just count the number of unique values in a given column.

In [None]:
%%sql
SELECT count(DISTINCT nameLast)
FROM Master; 

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(DISTINCT nameLast)
9822


If we want the number of distinct rows, we can do that with the `DISTINCT` keyword before the `count()`:

In [None]:
%%sql
SELECT DISTINCT count(*)
FROM Master; 

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
19105


As we shall see in Lesson 3, **we can retrieve data from multiple tables using a `JOIN` operator.** The query below produces a list of people in the National Baseball Hall of Fame (a.k.a, "Cooperstown").

In [None]:
%%sql
SELECT nameFirst, nameLast
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,nameLast
Hank,Aaron
Jim,Abbott
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams


Oops. The query was supposed to list 10 people. Instead it lists just three, with Babe Adams listed multiple times. Actually, that's by design. We can see what's going on by including the `yearid` column from the `HallOfFame` table. 



In [None]:
%%sql
SELECT nameFirst, nameLast, yearID
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,nameLast,yearID
Hank,Aaron,1982
Jim,Abbott,2005
Babe,Adams,1937
Babe,Adams,1938
Babe,Adams,1939
Babe,Adams,1942
Babe,Adams,1945
Babe,Adams,1946
Babe,Adams,1947
Babe,Adams,1948


The `HallOfFame` table lists a player each year the player's induction was considered. An all time great player like Hank Aaron will be considered just once and immediately inducted. Most are not so lucky. While Jim Abbott was inducted into the College Baseball Hall of Fame, he was considered only once (in 2005) for his professional career and did not meet the standard for further consideration. Babe Adams was considered repeatedly in the 1930s and 1940s and is eligible for consideration again in 2021. 

**If we want to just see each name just once, we can use the `DISTINCT` modifier:**

In [None]:
%%sql
SELECT DISTINCT nameFirst, nameLast
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,nameLast
Hank,Aaron
Jim,Abbott
Babe,Adams
Bobby,Adams
Sparky,Adams
Tommie,Agee
Rick,Aguilera
Jack,Aker
Doyle,Alexander
Pete,Alexander


That's better. That it includes people not actually in enshrined in the Hall of Fame is a bug that we will fix in the `WHERE` section. 

### **The `WHERE` Clause**

The `WHERE` clause is used to place conditions (restrictions) on which rows we want from the specified tables. SQL conditions are phrased as "boolean expressions" that resolve to either True or False. The following query returns only players (i.e., not coaches, media, etc.) that were ultimately inducted. 



In [None]:
%%sql 
SELECT nameFirst, nameLast, yearID as induction_year
FROM HallOfFame
    JOIN Master USING (playerID)
WHERE inducted='Y' and category='Player'
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,nameLast,induction_year
Hank,Aaron,1982
Pete,Alexander,1938
Roberto,Alomar,2011
Cap,Anson,1939
Luis,Aparicio,1984
Luke,Appling,1964
Richie,Ashburn,1995
Earl,Averill,1975
Jeff,Bagwell,2017
Home Run,Baker,1955


Don't you just love it when players go exclusively by their nicknames? Or do you really suppose somebody named their son "Home Run" just to be aspirational?

We'll come back to boolean expressions like `inducted='Y' and category='Player'` in a separate section later in this lesson.

### **The `GROUP BY` Clause**

As we have already seen, SQL is capable of counting. It is also capable of calculating sums, like the total weight of all the people in the database:





In [None]:
%%sql
SELECT sum(weight) 
FROM Master;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


sum(weight)
3401541


Usually, however, we are not so interested in calculations over **all rows**. Instead, we will want totals over specific rows. To do that we can use a `WHERE` clause:

In [None]:
%%sql
SELECT count(*)
FROM Master 
WHERE nameFirst = 'Tony';

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
106


That's great, except we had to know exactly which first name we wanted to count up. If we want it to do the same thing but for every first name, we add a `GROUP BY` clause:

In [None]:
%%sql
SELECT nameFirst, Count(*) 
FROM Master
GROUP BY nameFirst
LIMIT 10;


 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,Count(*)
,37
A. J.,14
Aaron,37
Ab,1
Abbie,1
Abe,6
Abel,2
Abie,1
Abner,2
Abraham,3


Whoa. There are people in there without first names! Keep that in mind when we discuss data integrity in Lesson 4. 

Note that the `GROUP BY` clause had to specify which column to group by. We can of course use multiple columns:

In [None]:
%%sql
SELECT nameFirst, nameLast, count(*) as cnt
FROM Master
WHERE nameFirst <> ''
GROUP BY nameFirst, nameLast
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameFirst,nameLast,cnt
A. J.,Achter,1
A. J.,Burnett,1
A. J.,Cole,1
A. J.,Ellis,1
A. J.,Griffin,1
A. J.,Hinch,1
A. J.,Morris,1
A. J.,Murray,1
A. J.,Pierzynski,1
A. J.,Pollock,1


We used a `WHERE` clause to eliminate anyone with a blank first name. The `<>` comparator just means *not equal to*.  
Who knew that A. J. was such a popular name?  

### **The `HAVING` Clause**

There are times when we'll want to only show groups that meet specific conditions. For that we use `HAVING` clause, which works like `WHERE` except for *groups* instead of *rows*. 



In [None]:
%%sql
SELECT nameFirst, count(*) AS cnt
FROM Master
GROUP BY nameFirst
HAVING cnt > 300
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
7 rows affected.


nameFirst,cnt
Bill,549
Bob,341
George,305
Jim,442
Joe,399
John,487
Mike,437


Here we only wanted groups with 300 or more rows in them. We also use the `AS` modifier to name the count as `cnt` so we could refer to it in the `HAVING` clause. We will come back to the use of `AS` to define *aliases* a little later on in this lesson.  

### **The `ORDER BY` Clause**

The `ORDER BY` clause specifies how we want the rows of the resultset sorted. So, if we wanted to see the 10 most popular last names in our database we could try this:



In [None]:
%%sql
SELECT nameLast, count(*) AS cnt
FROM Master
GROUP BY nameLast
ORDER BY cnt DESC
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,cnt
Smith,155
Johnson,112
Jones,98
Brown,90
Miller,89
Williams,78
Wilson,74
Davis,68
Taylor,52
Moore,52


Like in the `HAVING` example we used an alias for the count. We also specified `DESC` in the `ORDER BY` clause to indicate that we want to order the counts in descending order from biggest to littlest. If we leave that off then it assumes we want ascending order (`ASC`) instead. 

In [None]:
%%sql
SELECT nameLast, count(*) AS cnt
FROM Master
GROUP BY nameLast
ORDER BY cnt
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,cnt
Herrin,1
Pyznarski,1
Barbary,1
Gardenhire,1
Stabell,1
Levsen,1
Noble,1
Torreyes,1
DeFreites,1
Wacha,1


If we want to sort by multiple columns then we just provide a list. The query will return a *lexicographic* ordering based on order the columns are listed.

In [None]:
%%sql
SELECT nameLast, count(*) AS cnt
FROM Master
GROUP BY nameLast
ORDER BY cnt DESC, nameLast
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,cnt
Smith,155
Johnson,112
Jones,98
Brown,90
Miller,89
Williams,78
Wilson,74
Davis,68
Moore,52
Taylor,52


All columns listed after the first are used as tie-breakers, just like alphabetizing words in the dictionary. (Note: *lexicon* is a synonym for *dictionary*, which is what gives lexicographic ordering its name.)

### **A Note about SQL Comments**

To allow us humans to better understand what each bit of code does, we can insert *comments* (or notes) that explain the structure and logic of the code. 

In SQL there are two different kinds of comments:
- **With double hyphens `--`.** Any text on a line that is to the right of the hyphens is ignored. Use this syntax to explain the **logic** of what a query does. 
- **With `/*` and `*/`.** Any text in between is ignored. Use this syntax to
mark sections of your code or comments that flow on multiple lines. Typically you will only see these in longer SQL scripts with multiple sections and complex logic. 

See below for examples. Note that the comments do not affect the query results, but they can make it easier for us to understand what is going on. 

> **Heads up:** you will need comments of both types for your final project. They are in some ways the most important part of the code. They are worth an entire letter grade! So, pay attention to how comments are used in the examples and then practice using them in your code. 

In [None]:
%%sql
-- This is an example of a so-called group by query. 
SELECT nameLast, count(*) AS cnt  
FROM Master
GROUP BY nameLast                  -- The GROUP BY clause can have any number of columns
ORDER BY cnt DESC, nameLast        -- ORDER BY applies to the groups, not the rows.
LIMIT 10; 

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,cnt
Smith,155
Johnson,112
Jones,98
Brown,90
Miller,89
Williams,78
Wilson,74
Davis,68
Moore,52
Taylor,52


In [None]:
%%sql
/* ----- THIS IS A SECTION COMMENT -----*/
-- Notice the `-----` used to make the section comment stand out. 

/* 
  This is a 
  comment that spans 
  several lines. 
*/

SELECT nameLast, count(*) AS cnt  
FROM Master
GROUP BY nameLast                  -- The GROUP BY clause can have any number of columns
ORDER BY cnt DESC, nameLast        -- ORDER BY applies to the groups, not the rows.
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,cnt
Smith,155
Johnson,112
Jones,98
Brown,90
Miller,89
Williams,78
Wilson,74
Davis,68
Moore,52
Taylor,52


---
## **Names, Aliases, and Views**

A SQL *name* is a label (or variable) used to identify a SQL object like a table, column, or query. Most names are provided by the coder when they create the database tables. However, sometimes these names are not specific enough, and SQL must generate more specific names without our help. Usually the generated names are not exactly the most human friendly (unless you happen to be Elon Musk). In such cases we define *aliases*, which act as nicknames for whatever the given name is. 

We've used column aliases a couple times already. 

```SELECT nameLast, count(*) AS cnt```


Why would we want to use an alias? Here are a few common use cases:

- **To make queries shorter and less subject to typo bugs.** Given the sometimes long and abstruse variable names found in some data sets, it only makes sense to use names that coders can actually type.
- **To give a function call or other calculation a meaningful name.** Column names like `1+1` just don't make much sense, so we provide aliases to use instead.  
- **To disambiguate duplicate column names.** This sort of thing happens when querying from multiple tables that use the same column names. 
- **To make certain kinds of advanced queries possible.** For example, there are times when a table might be *joined with itself*, which happens more often than one would expect.

We'll see each of these soon enough. Now let's see where they appear in code.

### **Naming Conventions, Backticks, and Dot Notation**

SQL standards *recommend* the same naming conventions as any other mainstream programming language: 
- use short but descriptive names like `birthdate` or `weight`
- no spaces or other punctuation except possibly underscores between compound names like `first_name` or `birth_weight`
- no names that match SQL keywords or function names like `SELECT`, `FROM`, `AS`, `count(*)`, etc. 
- no duplicate names in the same table or resultset. 

However, SQL does not *enforce* any of these rules directly by flagging an invalid name. Instead, it will complain and possibly throw an obscure error that says nothing about the name being invalid. 

To make the errors go away there is an easy but awkward workaround. If a name would otherwise be invalid, then we can wrap it in **backticks** like this: \`name\`. The backtick character is found just above the `tab` key on most keyboards. (Why backticks instead of quotes? So that column names can have quote characters in them.) So for example, if we *just had* to have a table column named `1+1` then we can just refer to it as \``1+1`\` wherever it is needed. Without the backticks, SQL would just run the calculation and (possibly) complain that a column name was not specified. 

Similarly, since we can construct new tables by combining several existing tables, it sometimes occurs that we end up with duplicate column names. In that case we can disambiguate the column name using **dot notation**:

```table.column```

If two tables `A` and `B` have the column `date`, then we can refer to the columns as `A.date` and `B.date`.  

In certain rare cases we may even be working with tables from multiple databases. We can do that by prepending the database name:  

```database.table.column```

Keep this in mind the next time you need to copy data from one database to another. 

### **Column Aliases**

The format of a column alias is 

``` column AS alias```

This is the simplest and most common kind of alias. 

### **Expression Aliases** 

A calculation like `1+1` is an *expression*. Expressions are bits of code that can be evaluated to produce a value. If the expression is complex we might enclose it in parentheses and then give it a alias:

```( expression ) AS alias```  

That makes the scope of the expression as clear as possible and gets around the silly name problems they can cause.

### **Table Aliases**

The format of a table alias is the same as for a column alias. The difference is that they are found in the `FROM` clause instead of the `SELECT` clause. We can then use the table alias anywhere we need to refer to the table. 

```sql
SELECT a.column1, a.column2, b.column1
FROM atablewithanimpossiblylongname AS a 
      JOIN anothertablewithalongname AS b USING (ID)
WHERE a.column1 > b.column1;
```

We can, also, of course, use column aliases to disambiguate the two columns called `column1`. 

### **Views (Query Aliases)**

We can give queries names so we can reuse them later. We call these recallable queries *views*. The syntax to create a view is: 

```sql
CREATE VIEW view-name AS 
SELECT ...
```

Notice that this time the alias (shown above as `view-name`) comes *before* the `AS`. This allows the query to be as long as needed without affecting readability. If the query is *really* long then we may treat it like another other long expression and enclose it in parentheses when creating the view. 

Once a view has been created we can use it just like a table. There are some times when that is the best way to break apart complex queries into smaller ones that are easier to debug.  


---
## **Boolean Expressions**

A boolean expression is anything that can be tested and found True or False. Boolean expressions are typically used as conditions to test in `WHERE` or `HAVING` clauses but there are other uses as well. 

In the examples below, each kind of boolean expression involves an **operator** that defines the True / False test. The general pattern is always:

`LE operator RE`

where left expression (LE) and right expression (RE) are anything that SQL can *evaluate* to determine values. The simplest example is something like

```1 > 2```

which evaluates to True. A more complicated example is

```10*3+(9-4)*8 = item_number```

where the left expression `10*3+(9-4)*8` is a calculation and the right expression `item_number` is a column. 

> **Heads up:** True and False as used throughout this course are actually the numbers 1 (True) and 0 (False). We're calling them True and False for clarity of exposition. 

### **Comparison Operators (a.k.a. Comparators)**

Except for a few SQL functions that return True or False, most boolean expressions are comparisons of one thing with another. In the table below LE and RE refer to the values of the expressions to the left and right of the operator. The first few operators are self-explanatory. The ones towards the bottom are explained in some detail below. 

| Operator | Usage: True only if ...                     | Example      |
| -------- |------------------------------------------- | -------------- |
| `=`      |  LE is equal to RE                          | `1=2` $\rightarrow$ False |
| `>`      |  LE is greater than  RE                   | `1>2` $\rightarrow$ False |
| `>=`      |  LE is greater than or equal to RE         | `1>=2` $\rightarrow$ False |
| `<`      |  LE is less than RE                      | `1<2` $\rightarrow$ True |
| `<=`      |  LE is less than or equal to RE            | `1<=2` $\rightarrow$ True |
| `<>`      |  LE is not equal to RE                     | `1<>2` $\rightarrow$ True |
| `IN`      |  LE is IN the set RE                       | `'Z' IN ('A','B', 'C')` $\rightarrow$ False |
| `NOT IN`  |  LE is NOT IN the set RE               | `'Z' NOT IN ('A','B', 'C')` $\rightarrow$ True |
| `BETWEEN`  |  LE is within the range RE             | `'C' BETWEEN  'A' AND 'Z'` $\rightarrow$ True |
| `NOT BETWEEN`  |  LE is not within the range RE             | `'C' NOT BETWEEN  'A' AND 'Z'` $\rightarrow$ False |
| `LIKE`  |  text LE matches the pattern RE             | `'Dump' LIKE  'D*'` $\rightarrow$ True |
| `NOT LIKE`  |  text LE does not match the pattern RE             | `'Dump' NOT LIKE  'D*'` $\rightarrow$ False |
| `IS NULL`  |  LE evaluates to NULL   | `2 IS NULL` $\rightarrow$ False |
| `IS NOT NULL`  |  LE does not evaluate to NULL   | `2 IS NOT NULL` $\rightarrow$ True |

#### **The `IN` and `NOT IN` Operators**

These operators require that the right expression be a *set* of values. The set can be defined in two ways:
- a `SELECT` query that returns a single column   
- a *set literal* like `('Amazon','Bed Bath & Beyond', 'Christmas Tree Shops')` 

A few examples are given below. Take note of the comments. The comparators are fairly forgiving, trying their best to evaluate to True whenever it makes logical sense.
 

In [None]:
%%sql 
-- Returns 0 (False); the code in parentheses is a subquery
SELECT 'Tobias' IN (SELECT nameFirst FROM Master);

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


'Tobias' IN (SELECT nameFirst FROM Master)
0


In [None]:
%%sql
-- Returns 1 (True); the text '12' is coerced to a number while truth testing
SELECT '12' IN (10,11,12,13);

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"'12' IN (10,11,12,13)"
1


In [None]:
%%sql
-- The number '12' is coerced to text by the operator
SELECT 12 IN ('10','11','12','13');

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"12 IN ('10','11','12','13')"
1


#### **The `BETWEEN` and `NOT BETWEEN` Operators**

These operators only work with ranges of values. Like `IN`, the range is a set. However, `BETWEEN` and `NOT BETWEEN` require that the set have a natural ordering. For example, `'C' BETWEEN  'A' AND 'Z'` is True because 'A' and 'Z' exist in the ASCII character set, where 'A' appears before 'Z'. Similarly, `2 BETWEEN 10 AND 20` is False because 2 is not between 10 and 20 on a number line. 




In [None]:
%%sql
-- Heads up: the endpoints of the range are included in the set
SELECT 20 BETWEEN 10 AND 20;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


20 BETWEEN 10 AND 20
1


In [None]:
%%sql
-- When working with text, the comparison is case insensitive
SELECT 'a' BETWEEN 'A' AND 'Z'

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


'a' BETWEEN 'A' AND 'Z'
1


> **Heads Up:** The most common use of `BETWEEN` is when working with dates, which of course have a natural ordering.

#### **The `LIKE` and `NOT LIKE` Operators**

`LIKE` compares an expression to the left and an expression to the right. The right hand expression is a text pattern that can include wildcard characters:
- `%` matches any number of characters
- `_` matches exactly one character 

If the right hand expression matches the pattern on the left then the operator returns 1; otherwise it returns 0. 

The examples below show a few of the infinite variations on a simple pattern match. 

In [None]:
%%sql
SELECT "ABC" LIKE "ABC";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""ABC"" LIKE ""ABC"""
1


In [None]:
%%sql
-- As before, it is case insensitive
SELECT "abc" LIKE "ABC";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""abc"" LIKE ""ABC"""
1


In [None]:
%%sql
-- The pattern includes characters not found in the LE
SELECT "AB" LIKE "ABC";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""AB"" LIKE ""ABC"""
0


In [None]:
%%sql
-- The `%` wildcard matches 0 or more characters
SELECT "AB" LIKE "AB%";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""AB"" LIKE ""AB%"""
1


In [None]:
%%sql
-- The `_` wildcard matches exactly one character
SELECT "AB" LIKE "AB_";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""AB"" LIKE ""AB_"""
0


In [None]:
%%sql
SELECT "ABC" LIKE "AB_";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""ABC"" LIKE ""AB_"""
1


In [None]:
%%sql
-- Wilcards can appear anywhere in patterns
SELECT "AB" LIKE "A%B";

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


"""AB"" LIKE ""A%B"""
1


#### **The `IS NULL` and `IS NOT NULL` Operators**

`IS NULL` and `IS NOT NULL` are special in that they don't have a right hand expression. Or, rather, that the right hand expression is always `NULL`. We use `NULL` to represent missing values. `NULL` values are not the same as `0` (zero) or `''` (empty text string) or similar expressions. It represents nothingness itself and is actually a defined SQL constant. So, technically, this operator could be called `ISNULL` (one word) but they make it more explicit because then it looks more like the other boolean operators with `IS NULL`. 
> **Heads Up:** There is really just one operator, `IS NULL`, with `IS NOT NULL` treated as a synonym for `NOT IS NULL`.  

In [None]:
%%sql
-- False because text is never NULL
SELECT '' IS NULL;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


'' IS NULL
0


In [None]:
%%sql
-- False because numbers are never NULL
SELECT 0 IS NULL;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 IS NULL
0


In [None]:
%%sql
-- False because 'NULL' is not coercible to NULL; use the constant `NULL` instead
SELECT 'NULL' IS NULL;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


'NULL' IS NULL
0


In [None]:
%%sql
-- Here we are testing whether the constant `NULL` is NULL.
SELECT NULL IS NULL;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


NULL IS NULL
1


> **Heads Up:** Don't confuse `IS` with `=`. The keyword `IS` must be folled by `NULL` or `NOT NULL`. 

In [None]:
%%sql
-- This doesn't work; IS NOT is not a synonym for `<>`
SELECT 1 IS NOT 2;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
(pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '2' at line 2")
[SQL: -- This doesn't work; IS NOT is not a synonym for `<>`
SELECT 1 IS NOT 2;]
(Background on this error at: https://sqlalche.me/e/14/f405)


### **The `AND`, `OR`, and `NOT` Operators**

Sometimes one truth test is not enough. For conditions that involve multiple tests, we use `AND`, `OR`, and `NOT` to *compose* complex boolean expressions from simpler ones.

In the truth table below, A, B, and C represent boolean expressions and T and F represent True or False. The first three columns specify the value of A, B, and C. The remaining columns are various composite expressions that use `AND`, `OR` and `NOT`. For example, **the third line says that if A is True, B is False, and C is True then the expression `A AND B OR C` evaluates to True.** 

| A | B | C | | NOT A | A AND B | A OR B | A AND B OR C | A OR B AND C | A AND (B OR C)
|:-:|:-:|:-:|  |:-----:|:-------:|:------:|:------------:|:------------:|:------------:|
| **T** | **T** | **T** |  | F | T | T | T | T | T |
| **T** | **T** | **F** |  | F | T | T | T | F | T |
| **T** | **F** | **T** |  | F | F | T | T | T | T |
| **T** | **F** | **F** |  | F | F | T | F | F | F |
| **F** | **T** | **T** |  | T | F | T | T | T | F |
| **F** | **T** | **F** |  | T | F | T | F | F | F |
| **F** | **F** | **T** |  | T | F | F | T | F | F |
| **F** | **F** | **F** |  | T | F | F | F | F | F |

Unless parentheses are used, the expressions are evaluated left to right. 
The last column shows the use of parentheses to alter the order, with the parenthetic expressions getting precedence. Note that in a couple cases (the fifth and seventh rows) the value of `A AND B OR C` does not match `A AND (B OR C)`. The parentheses matter!

In [None]:
%%sql 
-- The `NOT` operator
SELECT NOT 1;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


NOT 1
0


In [None]:
%%sql
-- The `AND` operator; both sides have to be true (1)
SELECT 0 AND 1;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 AND 1
0


In [None]:
%%sql 
-- The `OR` operator; only one side has to be true
SELECT 0 OR 1;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 OR 1
1


In [None]:
%%sql
-- `A AND B OR C` for the seventh row of the truth table
SELECT 0 AND 0 OR 1;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 AND 0 OR 1
1


In [None]:
%%sql
-- `A AND (B OR C)` for the seventh row of the truth table
SELECT 0 AND (0 OR 1);

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 AND (0 OR 1)
0


In [None]:
%%sql
-- Now all at once using the `=` operator
SELECT 0 AND 0 OR 1 = 0 AND (0 OR 1);

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


0 AND 0 OR 1 = 0 AND (0 OR 1)
0


In [None]:
%%sql
-- This version wraps parentheses around the LE and RE to make it easier to read
SELECT (0 AND 0 OR 1) = (0 AND (0 OR 1));

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


(0 AND 0 OR 1) = (0 AND (0 OR 1))
0


> **Heads up:** Do not skip ahead, thinking you won't be tested on how boolean expressions work. It's actually pretty important. SQL is all about logic, which ultimately boils down to *evaluating boolean expressions*. Put in the time now and save yourself a lot of headache later. 

---
## **SQL Functions**

A function is a named calculation or other bit code that can be run just by invoking its name and perhaps providing a few input arguments. Two examples we have already seen are `count()` and `sum()`. Notice how we *always* include the parentheses (and no spaces) right after the function name? That is how SQL (or just about any other language) knows that we are calling a function. The format is

```function-name(arg1, arg2, ...)```

where `arg1`, `arg2`, etc. represent input parameters. The text immediately to the left of the first parenthesis (without *any* spaces in between) is the function name. It is possible for a function to have any number of arguments, depending on how it is defined. However in SQL, most functions accept either one or zero arguments. 

The MySQL manual includes an [extensive reference documentation](https://dev.mysql.com/doc/refman/8.0/en/sql-function-reference.html) for *several hundred* operators and functions. If in doubt about what a function does or which one to use in a given situation, always [RTFM](https://en.wikipedia.org/wiki/RTFM) before asking for help with things are clearly covered in the manual. 

While there are too many to cover here, SQL functions come in several flavors:
- **[Aggregate functions](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html)** like that `count()` and `sum()` summarize columns of data  
- **[Mathematical functions](https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html)** like `log()` or `abs()` that perform calculations on numerical data 
- **[Date and Time functions](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html)** like `dayofmonth()` that operate on temporal data 
- **[String functions](https://dev.mysql.com/doc/refman/8.0/en/string-functions.html)** like `substr()` that work with text data 
- **[Window functions](https://dev.mysql.com/doc/refman/8.0/en/window-functions.html)** like `lag()` that allow us to create work with column-oriented serial data (like time series) 
- **[Cast functions](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html)** that convert data from one type to another 
- Various **utility functions** that handle things like data [encryption](https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html), [row locking](https://dev.mysql.com/doc/refman/8.0/en/locking-functions.html), and conversions to/from alternate data formats like [JSON](https://dev.mysql.com/doc/refman/8.0/en/json-functions.html) or [XML](https://dev.mysql.com/doc/refman/8.0/en/xml-functions.html), and handle [a few other miscellaneous things](https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functions.html). 

Instead of getting into the details of any of these functions, we will introduce a few of the most common ones as we go along. In the meantime, don't forget to RTFM if you have any questions. 


---
## **CASE Expressions**

Conditional execution is one of four fundamental control structures found in any programming language. Most languages (like the 14 other languages on the TIOBE Index at the top of this lesson) do that with if statements: **if such and such is true then do this.** SQL, however, is different from all of those other languages because **`SELECT` statements specify the results instead of the step-by-step procedural logic to generate the results.** In other words, there is no way to say **then do this procedure** in SQL. Nonetheless there are times when we will want different results depending on what we find in the data. For that, we use `CASE` expressions.

A [`CASE` expression](https://dev.mysql.com/doc/refman/8.0/en/case.html) has the following syntax pattern:

```sql
CASE
  WHEN boolean1 THEN expression1
  WHEN boolean2 THEN expression2
  ...
  ELSE expressionN
END
```
The expressions `boolean1`, `boolean2`, etc. specify conditions under which to evaluate and return the corresponding `expression1`, `expression2`, etc. Each condition is tried one at a time, starting with `boolean1`. Once one is found to be true then the CASE statement ends immediately without trying the remaining conditions. If none of the conditions are met then the `ELSE` clause (if it exists) is executed; otherwise the `CASE` expression terminates with an error. 

If each of the conditions are just alternate possible values of the same expression then we can use a slightly more compact form:

```sql
CASE expression
  WHEN value1 THEN expression1
  WHEN value2 THEN expression2
  ...
  ELSE expressionN
END
```

where `value1`, `value2`, ... represent the possible values. Besides being slightly more compact, this also has the performance advantage that the `expression` just before the first `WHEN` is evaluated only once. 

We can use a `CASE` expression anywhere we can use a function. It is essentially a kind of function, just with different syntax. 

Let's see how it works. The following CASE statement returns a different phrase depending on the current month. Notice that we used an alias to avoid what would otherwise be a long and awkward column name. 


In [None]:
%%sql
SELECT  
  CASE month(now())
    WHEN 12 THEN 'Happy Holidays!'
    WHEN 1 THEN 'Happy New Year!'
    ELSE 'Thanks'
  END AS closing_phrase
   


 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


closing_phrase
Thanks


> **Heads Up:** In recent versions of the SQL standard, there is a new function
 ```
 IF(boolean_expression, true_result, false_result)
 ``` 
 that works similarly to a `CASE` expression. However, since some database vendors also have an expression type that is also called `IF`, it is safest to stick with `CASE` expressions for now. 

---
## **Grouping and Aggregation**

Aggregation queries with `GROUP BY` clauses should be pretty familiar to anyone who has created an Excel PivotTable. 

![MS Excel Pivot Table](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_excel_pivot_table.png)

In fact, there is a one-to-one correspondence between them:

| SQL | |Excel |
|-----|--- |-------|
| `SELECT` clause | $\Leftrightarrow$| PivotTable Value fields |
| `GROUP BY` clause | $\Leftrightarrow$ | PivotTable Row/Column fields |
| `WHERE` clause | $\Leftrightarrow$ | PivotTable Filter fields |

You can likely guess which came first. Excel only steals from the best. 

Just as there are constraints on what you can legally do with a PivotTable, there are three rules that govern aggregation queries:
- The `WHERE` clause is executed *before* forming the groups.
- Any clauses after `GROUP BY` (i.e.,`HAVING`, `ORDER BY`, and `LIMIT`) refer to the groups, not the rows in the table.
- The column list in the `SELECT` clause can only include grouping columns (in the `GROUP BY` clause) and aggregate functions applied to columns *not* in the `GROUP BY` clause.

The last rule can be a bit tricky. Technically, the following query should trigger an error. Can you guess why? 






In [None]:
%%sql 
SELECT nameLast, nameFirst, count(*) AS cnt
FROM Master
GROUP BY nameLast
ORDER BY cnt
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


nameLast,nameFirst,cnt
Honeycutt,Rick,1
Lauder,Billy,1
Allietta,Bob,1
Macko,Steve,1
Barnette,Tony,1
Bierbauer,Lou,1
Oravetz,Ernie,1
DeHaan,Kory,1
Tavener,Jackie,1
Bollo,Greg,1


The issue is with `nameFirst`, which is not named in the `GROUP BY` clause. Depending on the DBMS vendor, sometimes this triggers an error (as the SQL standards dictate) but many DBMSes quietly add the missing column to the `GROUP BY` behind the scenes to avoid triggering the error. So, sometimes a query will work just fine in DBMS (e.g., MySQL) but the same query will fail in another (e.g., MS Access). To be safe, always follow the standard. 

We finish with a very common *logic* bug that SQL cannot automagically fix for you. Consider the code below. Can you detect the logic error?

In [None]:
%%sql 
SELECT nameLast, nameFirst, count(*) AS cnt
FROM Master
ORDER BY cnt
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


nameLast,nameFirst,cnt
Aardsma,David,19105


Here we are mixing scalar data (i.e., data from specific rows) with aggregate calculations (i.e., over all rows in the table). This is illegal in ANSI standard SQL, though most DBMSes won't pick up the error. If you are not careful, you may think that David Aardsma appears in the `Master` table 19105 times. The fix is to use a `GROUP BY` clause that has both the `nameFirst` and `nameLast` columns listed. 

**As a general rule, look for any queries that have `count()`, `sum()`, etc. in the `SELECT` clause. If the `SELECT` clause also has individual column names, then add a `GROUP BY` clause that includes each column name not in a function call.**

---
## **PRO TIPS: How to use SQL in Python and Excel**

### **SQL in Python (and Pandas)**
As discussed in Lesson 1, support for SQL (and SQLite) is built into Python. The [PEP 249](https://www.python.org/dev/peps/pep-0249/) documentation provides all the nitty gritty details. However, for our purposes, where we just want to extract data from a relational database, there are simpler ways to do it that involve less code and risk of creating bugs.

We will assume that the final destination for the data is a [pandas](https://pandas.pydata.org/) DataFrame. The pandas library (package) was among the things we imported at the beginning of this lesson. The code looked like this:
```python 
import pandas as pd
```
Pretty simple, right? The syntax even looks sort of SQL like, using `as` to specify the alias `pd` for the library name. (That actually happens a lot, with newer languages cribbing off of C and SQL for things like this.)

When using Jupyter the easiest way to extract data from a SQL database is just like we've been doing it. Use `%sql` magic and then use the results in a Python code cell. For short snippets of SQL we can do something like this:


 

In [None]:
# use %sql (with only one %) to get a resultset as a Python expression
rs = %sql SELECT * FROM Master LIMIT 10

# convert the resultset to a dataframe
df = rs.DataFrame()
df

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
5,abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
6,abadijo01,1850,11,4,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
7,abbated01,1877,4,15,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
8,abbeybe01,1869,11,11,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
9,abbeych01,1866,10,14,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


If we want to run longer SQL queries that involve multiple lines, then we use a `%%sql` cell magic and the special `_` Python variable in separate cells:

In [None]:
%%sql 
SELECT * 
FROM Master 
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
10 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
abadijo01,1850,11,4,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
abbated01,1877,4,15,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
abbeybe01,1869,11,11,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
abbeych01,1866,10,14,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


In [None]:
df = _.DataFrame()
df

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
5,abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
6,abadijo01,1850,11,4,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
7,abbated01,1877,4,15,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
8,abbeybe01,1869,11,11,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
9,abbeych01,1866,10,14,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


When not using `%%sql` (e.g., from the command line or in an IDE) we can use pandas's built in SQL support instead. Once the data is imported as  DataFrames, one can then use pandas functions and methods to do the sorts of things SQL does (sorting, grouping, counting, joining, etc.) but in a Pythonic way. After we get a better handle on SQL we will return to pandas to see how some of these things are done, but for now we are just concerned about retrieving data from a SQL database. If you are curious how this works then refer to the [pandas IO Tools docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), where you will find the `read_sql()` function and its cousins `read_sql_table()` and `read_sql_query()`. The names should be pretty self-explanatory by now. 

### **SQL in Excel**

Unlike Python, there is no direct way to use standard SQL in MS Excel. However, Excel does provide a utility called [MS Query](https://support.microsoft.com/en-us/office/use-microsoft-query-to-retrieve-external-data-42a2ea18-44d9-40b3-9c38-4c62f252da2e) that uses a *SQL-like* query syntax. For what it's worth, Google Sheets provides something similar with the [`QUERY()`](https://learnsql.com/blog/sql-in-google-sheets-query/) function. 

In MS Excel the place to start is with the Data tab. There you'll want to ask for external data from a non-SQL Server database. 
![MS Query 1](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_ms_query1.png)
Oops, unless you have done this before you get an error. 
![MS Query 1](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_ms_query2.png)
From there you'll find yourself going down a fairly deep rabbit hole of arcane tech like vendor-specific ODBC drivers, OLAP cubes, ...   
**On second thought, let's not do that. If you need SQL in Excel with anything but MS SQL Server then we wish you the best of luck. This is supposed to be a class in databases, not MS Office configuration. Ironically, it may be easier to do the query in Google Sheets and then export to an MS Excel workbook. It's a little easier to do this sort of thing in MS Windows but you still have a few extra steps.**



---
## **Tech Spotlight: MySQL DBMS**

In this lesson we have spent quite a bit of time working with MySQL. It is, after Oracle, the second most popular DBMS as of January 2021, according to [DB-engines.com](https://db-engines.com/en/ranking).

![MySQL ranking](https://github.com/christopherhuntley/DATA6510/raw/master/img/L2_db_engines_popularity.png)

Like SQLite, which we introduced in Lesson 1, MySQL has very modest origins. It was started in 1994 as free software under the GNU General Public License and found its first widespread usage as the "backend" of some of the earliest e-commerce websites. By 1996 it was the most deployed DBMS on the web. There were two primary reasons for this popularity:
- It was "free as in beer" (no cost) and free as in freedom (to debug and adapt as needed).
- It had "lightning fast" read performance on super cheap hardware. 

However, these obvious advantages came at a cost:
- It had "slow as molasses" write performance.
- It did not enforce referential integrity rules.
- It did not have transaction controls (rollback on error).

So, it could take a while to record a sale and even then it might corrupt the data if a query was badly written or the server lost power. That's what we lived with back then, when the choice was either i) pay a fortune for Oracle or Microsoft; or ii) take your chances with MySQL and make frequent backups. 

Over the years, MySQL has remained free and highly efficient, still a default choice for millions of websites. Meanwhile, it has also become standards compliant, often implementing all the required features of the latest standard *before* competing DBMS products. Finally, in 2008 it was acquired by Sun Microsystems and then by Oracle in 2010. So, for those keeping track, Oracle actually owns the top two DBMS products on the market. (And it sells paid support for MySQL for those who need it.)



## **Congratulations! You've made it to the end of Lesson 2.**

In Lesson 3 we will continue on to more advanced SELECT statements that involve multiple tables or subqueries. That's about as far as most data scientists go ... but we will of course continue on with even more advanced topics. There is *a lot more* to SQL than pulling pulling rows of data from tables. 