## Anatomy of the SELECT statement

The general structure of a `SELECT` statement is

```
      SELECT   <columns> 
        FROM   <table>             
       WHERE   <predicate on rows> 
    GROUP BY   <columns> 
      HAVING   <predicate on groups> 
    ORDER BY   <columns> 
```

In this NB we look at the various clauses in the `SELECT` statement 

In [None]:
%load_ext sql

# Windows users, please specify your password
%sql postgres://isdb16@localhost/postgres

In [None]:
%%sql

DROP TABLE IF EXISTS Students CASCADE; 

CREATE TABLE Students (
    PRIMARY KEY(id),
    id     text,
    name   text,
    state  CHAR(2),
    gpa    NUMERIC(3,2)
);

INSERT INTO Students (id, name, state, gpa)
     VALUES
     ('s1', 'Mary',  'PA', 4.00),
     ('s2', 'Jack',  'CA', 3.25),
     ('s3', 'Pat',   'FL', 2.78),
     ('s4', 'Jill',  'FL', 2.5),
     ('s5', 'Joe',   'PA', 3.9),
     ('s6', 'Ellen', 'CA', 3.00),
     ('s7', 'Harry', 'PA', 1.50),
     ('s8', 'Sally', 'OH', 3.25);

**Note**: While not needed, the output of all SELECT queries have been ordered for easy verification.

### Simple  SELECT / FROM queries

####  List all the fields of the Students table

In [None]:
%%sql 

  SELECT *
    FROM Students
ORDER BY id;

Think of the `FROM` clause being equivalent to a 'loop' in
procedural languages. When the SQL processor executes the `FROM`
clause it iterates over each row

The result of any select statement is a table. The above `SELECT`
statement started from `Students`, an 8x4 table, and produced another
8x4 table.

### Filtering rows with a WHERE clause

During the process of iteration of the FROM clause the WHERE clause
functions as an 'if' statement allowing only those rows that
satisfy the predicate to be processed by the SELECT clause.

####  Which state is Jack from?

In [None]:
%%sql

SELECT  state
  FROM  Students
 WHERE  name = 'Jack';

The result is a 1x1 table

#### What is Joe’s GPA

In [None]:
%%sql 

SELECT  name, gpa
  FROM  Students
 WHERE  name = 'Joe';


The result is a 1x2 table


#### Who are all the students who have a B and above GPA?


In [None]:
%%sql

  SELECT  name, gpa
    FROM  Students
   WHERE  gpa >= 3.0
ORDER BY name;


### How a SELECT statement is executed 

The structural order of classes in a `SELECT` statement is

```
SELECT   <columns>
  FROM   <table>
 WHERE   <predicate>
```
This order is mandated by the SQL spec.

Even though we write the clauses in this order, this is not the
order in which they are executed.  The execution order is below

```
      SELECT   <columns> ........... 3
        FROM   <table>  ............ 1            
       WHERE   <predicate> ......... 2
```

1. The SQL processor first identifies which table it is going to
   work on.  So far our tables are from CREATE TABLE statements.
   We'll call these as _base tables_ (in contrast to 'virtual
   tables' we'll see later on).
   
2. The table from step-1 is then filtered rowise.  Those rows
   that satisfy the predicate form another (temporary) table.
   
3. SQL processor takes the table from step-2 and extracts the
   specified columns.  At this point if an aggragation function
   (`count, max, min, avg`) is used the aggregate is calculated from
   the rows of the table in stpe-2 (see below)

### Aliases

Within a single table column names need to be unique.  Hence when
working with a single table when a query refers to a column there
is no ambiguity as to which column we are referring to But when
working with multiple tables the same column name can be used
across multiple tables e.g., the column name 'state' could be used
in different tables.  To disambiguate and indicate which table's
'state' column we are refering to, we need to preface the column name
with the table name.  Hence the above query could be written as:

In [None]:
%%sql

  SELECT  Students.name, Students.gpa
    FROM  Students
   WHERE  gpa >= 3.0
ORDER BY name;


and we get the same result set.

Of course, in this example we have only 1 table and hence don't
really need the prefix.  Often using the full table name as a
prefix is cumbersome and hence we use a shorter alias to refer to
the table

In [None]:
%%sql

  SELECT  s.name, s.gpa
    FROM  Students AS s
   WHERE  gpa >= 3.0
ORDER BY name;


Again, we get the same answer (as we should!)

### Sorting (aka ORDER BY)

The order of the rows and columns in a table are irrelevant.
(There is not such thing as a row that follows another row etc or a
column to the right of another column.)  When we do want the output
to be arranged in a particular sequence we use the ORDER BY clause

#### List the names of students who have a B or above in descending order of their GPA. Break ties by sorting in ascending order by name

In [None]:
%%sql 

SELECT  s.name, s.gpa
    FROM  Students AS s
   WHERE  gpa >= 3.0
ORDER BY  s.gpa DESC, s.name ASC;


Since we are workning with just 1 table, the column names will be
unique and hence the alias prefix could be omitted.

## Aggregation functions

#### What is the average GPA of all students?

In [None]:
%%sql
SELECT AVG(gpa)
  FROM students;
    

Note that the AVG aggregation function gives back a _single_
value.  To be precise we get back a table with just ONE column and
ONE row.  Recall that _all_ select statements return a table.

Since `AVG(gpa)` is a single value we can not do something like: 

as we are trying to display tables of different sizes.

####  Another (even simpler) aggregation:  How many students are there?

In [None]:
%%sql

SELECT COUNT(NAME)
  FROM students;
    

note that the SELECT statement returns a 1x1 table.

## Aggregations + WHERE


We can combine aggregation functions AND WHERE clauses.

#### How many students are from PA?

To get a count of students we could aggregate over any column
So we could also say `SELECT COUNT( name )`

In [None]:
%%sql

SELECT COUNT(state)
  FROM students
 WHERE STATE = 'PA';


Note that there is a subtle difference between
`COUNT(*)` and `COUNT(name)`.  `COUNT(*)` will count _all_ the rows of
the table.  Where as `COUNT(name)` will count the NON NULL values in
the columns 'name'.  Hence if the value of the name of a student is
NULL then count(*) and count(name) will be different


### Re-visiting how a SELECT statement is executed 

The execution order is below
```
      SELECT   <columns> ........... 3
        FROM   <table>  ............ 1            
       WHERE   <predicate> ......... 2
    ORDER BY   <columns> ........... 4
```

  1. The SQL processor first identifies which table it is going to
     work on.  So far our tables are from CREATE TABLE clauses.
     We'll call these as 'actual tables' (in contrast to 'virtual
     tables' we'll see later on).
  2. The table from step-1 is then filterered rowise.  Those rows
     that satisfy the predicate form another (temporary) table.
  3. SQL processor takes the table from step-2 and extracts the
     specified columns.  At this point if an aggragation function
     (count, max, min, avg) is used the aggregate is calculated from
     the rows of the table in stpe-2
  4. Once the relevant columns have been selected (or the aggregate
     calculated) we get one more temporary table which is arranged
     by the ORDER BY clause and then we get our final result

## GROUPS 

 Earlier we saw the queries along the following lines:
   - which state is a person from?
   - who are all the people from a particular state?
   - how many students in total are there?
   - how many students are from a particular state

Now suppose we want to know:

#### How many students come from each state?


Conceptually realizing this query is easy:

   1. From groups of students from each state e.g., all students
       from PA, all from FL, all from CA etc.
   2. Count the number of students in each group.
    
we relaize this in SQL with:

In [None]:
%%sql 

  SELECT state, COUNT(STATE)
    FROM Students
GROUP BY state
ORDER BY state;


The execution of the above statement can be visually disected along
the following lines.  Suppose we executed the following query:

In [None]:
%%sql 

SELECT *
  FROM Students
 ORDER BY STATE;
  

Students from the same state are grouped together. 
So instead of viewing a table of 8 rows we could view the table
has having 4 groups:
```
        +----+-------+-------+------+
        | id | name  | state | gpa  |
        +----+-------+-------+------+
        | s6 | Ellen | CA    | 3.00 |
        | s2 | Jack  | CA    | 3.25 |
        +----+-------+-------+------+
        | s3 | Pat   | FL    | 2.78 |
        | s4 | Jill  | FL    | 2.50 |
        +----+-------+-------+------+
        | s8 | Sally | OH    | 3.25 |
        +----+-------+-------+------+
        | s5 | Joe   | PA    | 3.90 |
        | s1 | Mary  | PA    | 4.00 |
        | s7 | Harry | PA    | 1.50 |
        +----+-------+-------+------+
```
The order of execution of the various clauses is now:

```
      SELECT   <columns> ........... 4
        FROM   <table>  ............ 1           
       WHERE   <predicate> ......... 2
    GROUP BY   <columns> ........... 3
    ORDER BY   <columns> ............5
```

**Once we have formed groups, the SELECT clause is applied to the
groups and not the individual rows.**  Hence we can ask (list the
count for each each state)

In [None]:
%%sql

  SELECT state, COUNT(state)
    FROM Students
GROUP BY state
ORDER BY state;


But we **CAN NOT** ask:

With the intention of grouping each person according to the state
they come from.

The column `name` is applicable at the row level.  A group does not
have a SINGLE value for the `name` column. 

**Question:** Suppose we group a table on attribute `a1`.  Write a SQL expression that will compute the number of groups that will be formed when grouping on `a1`

### HAVING

Similar to how we filtered rows with a `WHERE` clause we can filter
groups with a `HAVING` clause

#### How many states have more than 1 students?

In [None]:
%%sql 

  SELECT state, COUNT(name)
    FROM Students
GROUP BY STATE
  HAVING COUNT(NAME) > 1
ORDER BY State;


We can arrange the input in descending order of count with:

In [None]:
%%sql 

  SELECT state, COUNT(name)
    FROM Students
GROUP BY STATE
  HAVING COUNT(NAME) > 1
ORDER BY COUNT(NAME) DESC, STATE ASC; 


Including `HAVING` the order of execution of SQL clauses is:
```
      SELECT   <columns> ................... 5
        FROM   <table>  .................... 1           
       WHERE   <predicate on rows> ......... 2
    GROUP BY   <columns> ................... 3
      HAVING   <predicate on groups> ....... 4
    ORDER BY   <columns> ................... 6
```

## Sub queries

Suppose we want the answer to

#### How many students have a score above the average?

If we knew the average then the query is very simple:

```
   SELECT name, gpa
     FROM Students
    WHERE gpa > avg(gpa)
```

We can NOT do the above because we have a dependency situation.
Recall that the `WHERE` is a filter --- from amongst all the rows in
the `Students` table it allows those rows that meet the condition
through.  It is only after the filtering is done that the average
cab be calculated!  Hence we need to solve this query in a two step
process (1) calculate the average GPA for the whole class and (2)
filter those rows whose avg GPA is greater than the avg.

We could do this actually in two steps:

In [None]:
%%sql 

SELECT avg(gpa)        -- ..... query-1
  FROM Students;
    

Now filter based on the value of 3.0225

In [None]:
%%sql

  SELECT name, gpa     -- ..... query-2
    FROM Students
   WHERE gpa > 3.0225
ORDER BY name;

We can combine query-1 and query-2 into a single query as

In [None]:
%%sql

  SELECT name, gpa
    FROM Students
   WHERE gpa > (SELECT avg(gpa) from Students)    -- avg(gpa)
ORDER BY name;    

The parens around the sub-query are required.