# Querying Data

One of the most common tasks, when you work with the database, is to query data from tables by using the `select` statement. The `select` statement has many clauses that you can use to form a flexible query. While only one of them is mandatory when using the `select` clause, you will usually include at least two or three of the available clauses. Table below shows some of the different clauses and their purpose:

**Clause name** | **Purpose**
--- | ---
`Select` | Determines which columns to include in the query's result set
`From` | Identifies the tables from which to draw data and how the tables should be joined
`Where` | Filters out unwanted data
`Group by` | Used to group rows together by common column values
`Having` | Filter out unwanted groups
`Order by` | Sorts the rows of the final result set by one or more columns

***Note:*** *I have typed out SQL clauses in CAPITAL LETTERS in the examples but it is not always required to have them in CAPS. This is just to help you easily distinguish the SQL keywords from object references (tables, columns, etc.) in the statement/code. PostgreSQL and some other databases do accept SQL keywords in small letters.*

The following sections delve into the uses of the six major query clauses

## The `select` Clause

   Albeit the `select` clause is the first clause of a SQL query statement, it is one of the last clauses that the database server evaluates. The reason for this is that before you can determine what to include in the final result set, you need to know all of the possible columns that could be included in the final result set. The `select` clause is typically paired with `from` clause. Here's a query to get started:

*Initial setup to load sql module in order to run sql statements on this Notebook:*

***Note:*** *Each cell demonstrating SQL code in Jupyter Notebook needs to begin with `%%sql` in order for the interpreter to treat the code as SQL statements/queries. The `%%sql` is not needed otherwise in SQL/database-client tools.*

*Initial setup to load sql module in order to run sql statements on this Notebook:*

In [1]:
%load_ext sql
%sql postgresql://postgres:password@localhost/dvdrental

In [2]:
%%sql 

SELECT * FROM category;

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id,name,last_update
1,Action,2006-02-15 09:46:27
2,Animation,2006-02-15 09:46:27
3,Children,2006-02-15 09:46:27
4,Classics,2006-02-15 09:46:27
5,Comedy,2006-02-15 09:46:27
6,Documentary,2006-02-15 09:46:27
7,Drama,2006-02-15 09:46:27
8,Family,2006-02-15 09:46:27
9,Foreign,2006-02-15 09:46:27
10,Games,2006-02-15 09:46:27


In this query, the `from` clause lists a single table (`actor`), and the `select` clause indicates that all columns (designed by \*) in the `actor` table should be included in the result set. Basically the query statement could be described in English as follows:

> *Show me all columns and all rows in the **`actor`** table*

Also note that each SQL statement has to end with a semi-colon. Otherwise the SQL interpreter will continue to expect additional clauses for input.

You can also explicitly name the columns you are interested in the query, such as:

In [3]:
%%sql

SELECT category_id, name FROM category;

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id,name
1,Action
2,Animation
3,Children
4,Classics
5,Comedy
6,Documentary
7,Drama
8,Family
9,Foreign
10,Games


   Thus the `select` clause determines which of all possible columns should be included in the query's result set. 
   Also, it is not a good practice to use the asterisk (*) in the `select` statement when you use the embedded SQL statements in the code due to the following reasons:

+ For a large table with many columns, the `select` statement with an asterisk (*) shorthand will retrieve data from all columns of the table, which may not be necessary.
+ In addition, retrieving unnecessary data from the database increases the traffic between the database and application layers. As a result, your applications will be slow and less scalable. Therefore, it is a good practice to specify the column names explicitly in the `select` clause whenever possible to get only necessary data from a table.

Because of these reasons, you should only use the asterisk (\*) shorthand for the ad-hoc queries to examine the data. In cases when you need to select all columns to get a preview of the data, you can use the `limit` clause to limit the rows in the result set to a specific number:

In [4]:
%%sql 

SELECT * FROM category LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


category_id,name,last_update
1,Action,2006-02-15 09:46:27
2,Animation,2006-02-15 09:46:27
3,Children,2006-02-15 09:46:27
4,Classics,2006-02-15 09:46:27
5,Comedy,2006-02-15 09:46:27
6,Documentary,2006-02-15 09:46:27
7,Drama,2006-02-15 09:46:27
8,Family,2006-02-15 09:46:27
9,Foreign,2006-02-15 09:46:27
10,Games,2006-02-15 09:46:27


***Note:*** *The `limit` clause is not a SQL-standard though it is widely used by many relational database management systems such as MySQL, H2, and HSQLDB. Not all database systems support the `limit` clause.*

***2nd note:*** *I will also be frequently using the `limit` clause in the code examples to avoid displaying too many rows in the results output, thus keeping the learning notes concise.*

The usage of `select` clause is not just limited to columns. You can also use expressions like below:

In [5]:
%%sql 

SELECT category_id, 
category_id * 3.14159, 
UPPER(name) 
FROM category;

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id,?column?,upper
1,3.14159,ACTION
2,6.28318,ANIMATION
3,9.42477,CHILDREN
4,12.56636,CLASSICS
5,15.70795,COMEDY
6,18.84954,DOCUMENTARY
7,21.99113,DRAMA
8,25.13272,FAMILY
9,28.27431,FOREIGN
10,31.4159,GAMES


For certain built-in function or to evaluate a simple expression, you can skip the `from` clause entirely. For example:

In [6]:
%%sql

SELECT NOW();

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


now
2020-06-14 18:30:48.508362+08:00


### Column Aliases

Though SQL query will generate labels for the columns returned, you might still want to assign a new label to a column from a table. In cases where the result set are generated by expressions or built-in function calls, you would almost certainly want to assign your own labels to those columns. You can use the `as` keyword before the alias name like in the example below:

In [7]:
%%sql

SELECT category_id, 
category_id * 3.14159 AS category_id_x_pi, 
UPPER(name) AS name_upper
FROM category;

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id,category_id_x_pi,name_upper
1,3.14159,ACTION
2,6.28318,ANIMATION
3,9.42477,CHILDREN
4,12.56636,CLASSICS
5,15.70795,COMEDY
6,18.84954,DOCUMENTARY
7,21.99113,DRAMA
8,25.13272,FAMILY
9,28.27431,FOREIGN
10,31.4159,GAMES


**Note:** The `as` keyword is optional and it is not necessary to use `as` before column aliases. They do help to improve the readability of your SQL codes.

### Removing Duplicates

In some cases, a query might return duplicate rows of data like in the example below:

In [8]:
%%sql

SELECT category_id FROM film_category LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


category_id
6
11
6
11
8
9
5
11
11
15


You would probably want in this case a distinct set of film categories from the table like this:

In [9]:
%%sql

SELECT DISTINCT category_id FROM film_category;

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id
4
14
3
10
7
13
9
1
5
2


## The `from` Clause

Thus far, you have seen queries whose `from` clause contain a single table. The role of the `from` clause is that *it defines the tables used by a query, along with the means of linking the tables together*. 

### Tables

When confronted with the term *table*, the general notion is a set of related rows stored in a database. In a relational database system, there are three types of tables:

1. Permanent tables (i.e. created using the `create table` statement)
2. Temporary tables (i.e. rows returned by a subquery)
3. Virtual tables (i.e. created using the `create view` statement)

Each of these table types may be included in a query's `from` clause. From the earlier examples, you should be comfortable with including a permanent table in a `from` clause. I will now briefly describe the other types of tables that can be referenced in a `from` clause. 

#### Subquery-generated tables

A subquery is a query contained within another query. Subqueries are surrounded by parentheses and can be found in various parts of a `select` statement; within the `from` clause, however, a subquery serves the role of generating a temporary table that is visible from all other query clauses and can interact with other tables named in the `from ` clauses. Here's an example:

In [10]:
%%sql

SELECT c.customer_id, c.first_name, c.last_name, c.create_date
FROM (SELECT customer_id, first_name, last_name, email, create_date
      FROM customer LIMIT 10) c;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,first_name,last_name,create_date
524,Jared,Ely,2006-02-14
1,Mary,Smith,2006-02-14
2,Patricia,Johnson,2006-02-14
3,Linda,Williams,2006-02-14
4,Barbara,Jones,2006-02-14
5,Elizabeth,Brown,2006-02-14
6,Jennifer,Davis,2006-02-14
7,Maria,Miller,2006-02-14
8,Susan,Wilson,2006-02-14
9,Margaret,Moore,2006-02-14


In the above example, the subquery is referenced by the containing query via its alias, which in this case, is c. This is just an illustration of a simplistic but not particularly useful example of a subquery in a `from` clause. 

#### Views

A view is a query that is stored in the data dictionary. It looks and acts like a table, but there is in fact, no data associated with a view (thus, a virtual table). When you issue a query against a view, your query is merged with the view definition to create a final query to be executed. Views are created for various reasons, including for user's convenience or to simplify complex database designs. 

To demonstrate, here's an example of a view definition that queries the customer table and included a call to a built-in function to extract just the year from a date column:

In [11]:
%%sql

CREATE VIEW customer_vw AS 
SELECT customer_id, first_name, last_name, EXTRACT(YEAR FROM create_date) create_year
FROM customer;

 * postgresql://postgres:***@localhost/dvdrental
(psycopg2.errors.DuplicateTable) relation "customer_vw" already exists

[SQL: CREATE VIEW customer_vw AS 
SELECT customer_id, first_name, last_name, EXTRACT(YEAR FROM create_date) create_year
FROM customer;]
(Background on this error at: http://sqlalche.me/e/f405)


When the view is created, no additional data is generated or stored; the database server simply tucks away the `select` statement for future use. You can now issue queries against the view, like in the example below:

In [12]:
%%sql

SELECT customer_id, create_year 
FROM customer_vw LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,create_year
524,2006.0
1,2006.0
2,2006.0
3,2006.0
4,2006.0
5,2006.0
6,2006.0
7,2006.0
8,2006.0
9,2006.0


### Table Links

The second deviation fron the simple `from` clause definition is the mandate that if more than one table appears in the `from` clause, the conditions used to *link* the tables must be included as well. Here's a simple illustration:

In [13]:
%%sql

SELECT customer.customer_id, customer.first_name, customer.last_name, address.address
FROM customer INNER JOIN address
ON customer.address_id = address.address_id LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,first_name,last_name,address
1,Mary,Smith,1913 Hanoi Way
2,Patricia,Johnson,1121 Loja Avenue
3,Linda,Williams,692 Joliet Street
4,Barbara,Jones,1566 Inegl Manor
5,Elizabeth,Brown,53 Idfu Parkway
6,Jennifer,Davis,1795 Santiago de Compostela Way
7,Maria,Miller,900 Santiago de Compostela Parkway
8,Susan,Wilson,478 Joliet Way
9,Margaret,Moore,613 Korolev Drive
10,Dorothy,Taylor,1531 Sal Drive


The previous query displays data from both the `customer` table and the `address` table, so both tables are included in the `from` clause. The mechanism for linking the two tables (referred to as a *join*) is the customer's address affiliation stored in the `customer` table. Thus the database server is instructed to use the value of the `address_id` column in the `customer` table to look up the associated address name in the `address` table. Join conditions for two tables are found in the `on` subclause of the `from` clause; in this case, the join condition is `ON c.address_id = a.address_id`.

#### Table Aliases

When joining multiple tables in a single query, you need a way to identify which table you are referring to when you reference columns in the `select`, `where`, `group by`, `having`, and `order by` clauses. You have two choices when referencing a table outside the `from` clause:

+ Use the entire table name, such as `customer.customer_id`
+ Assign each table an *alias* and use the alias throughout the query

The above query uses the entire table names (i.e. `customer` and `address`). Here's what the same query looks like using table aliases:

In [14]:
%%sql

SELECT cust.customer_id, cust.first_name, cust.last_name, addr.address
FROM customer AS cust INNER JOIN address AS addr
ON cust.address_id = addr.address_id LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,first_name,last_name,address
1,Mary,Smith,1913 Hanoi Way
2,Patricia,Johnson,1121 Loja Avenue
3,Linda,Williams,692 Joliet Street
4,Barbara,Jones,1566 Inegl Manor
5,Elizabeth,Brown,53 Idfu Parkway
6,Jennifer,Davis,1795 Santiago de Compostela Way
7,Maria,Miller,900 Santiago de Compostela Parkway
8,Susan,Wilson,478 Joliet Way
9,Margaret,Moore,613 Korolev Drive
10,Dorothy,Taylor,1531 Sal Drive


***Note:*** *Similar to column aliases, the usage of `as` keyword before each table alias is optional; I used them to enhance readability of the code in the examples. You can choose to omit them and the query statement will still work.*

## The `where` clause

The queries shown thus have selected every row from the table (except for the demonstration of `distinct`) in earlier examples. Most of the time, however, you will not wish to retrieve *every* row from a table but will want a way to filter out those rows that are not of interest. This ia a job for the `where` clause. 

> *The `where` clause is the mechanism for filtering out unwanted rows from your result set.*

For example, the following query employs a `where` clause to retrieve *only* the non-active customers:

In [15]:
%%sql

SELECT customer_id, first_name, last_name, email, active
FROM customer 
WHERE active = 0;

 * postgresql://postgres:***@localhost/dvdrental
15 rows affected.


customer_id,first_name,last_name,email,active
16,Sandra,Martin,sandra.martin@sakilacustomer.org,0
64,Judith,Cox,judith.cox@sakilacustomer.org,0
124,Sheila,Wells,sheila.wells@sakilacustomer.org,0
169,Erica,Matthews,erica.matthews@sakilacustomer.org,0
241,Heidi,Larson,heidi.larson@sakilacustomer.org,0
271,Penny,Neal,penny.neal@sakilacustomer.org,0
315,Kenneth,Gooden,kenneth.gooden@sakilacustomer.org,0
368,Harry,Arce,harry.arce@sakilacustomer.org,0
406,Nathan,Runyon,nathan.runyon@sakilacustomer.org,0
446,Theodore,Culp,theodore.culp@sakilacustomer.org,0


In this case, the `where` clause contains a single filter condition, but you can include as many conditions as required; individual conditions are separated using operators such as `and`, `or` and `not`. Here's another example of a query that incorporates two filter conditions in the `where` clause:

In [16]:
%%sql 

SELECT film_id, title, rental_rate, rating 
FROM film 
WHERE rental_rate < 2.0 
AND rating = 'PG-13' LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


film_id,title,rental_rate,rating
18,Alter Victory,0.99,PG-13
36,Argonauts Town,0.99,PG-13
64,Beethoven Exorcist,0.99,PG-13
79,Blade Polish,0.99,PG-13
108,Butch Panther,0.99,PG-13
130,Celebrity Horn,0.99,PG-13
155,Cleopatra Devil,0.99,PG-13
157,Clockwork Paradise,0.99,PG-13
160,Club Graffiti,0.99,PG-13
163,Clyde Theory,0.99,PG-13


The first condition filters out rows that have `rental_rate` less than 2.0. The second condition filters out rows that have the rating 'PG-13'. Combining both conditions with `and` operator will instruct the database to fetch only rows that meet *both* conditions. Let's see what happens if you change the operator from `and` to `or`:

In [17]:
%%sql 

SELECT film_id, title, rental_rate, rating 
FROM film 
WHERE rental_rate < 2.0 
OR rating = 'PG-13' LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


film_id,title,rental_rate,rating
98,Bright Encounters,4.99,PG-13
1,Academy Dinosaur,0.99,PG
7,Airplane Sierra,4.99,PG-13
9,Alabama Devil,2.99,PG-13
11,Alamo Videotape,0.99,G
12,Alaska Phantom,0.99,PG
213,Date Speed,0.99,R
14,Alice Fantasia,0.99,NC-17
17,Alone Trip,0.99,R
18,Alter Victory,0.99,PG-13


Looking at the output, you will see that each row will have either rental_rate less than 2.0 or having the rating 'PG-13'. Thus at least one of the conditions is true for every row in the result set.

You can also combine both `or` and `and` operators by using parentheses to group conditions together like this:

In [18]:
%%sql 

SELECT film_id, title, rental_rate, rating 
FROM film 
WHERE (rental_rate < 2.0 AND rating = 'PG')
OR (rental_rate > 2.0 AND rating = 'NC-17') LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


film_id,title,rental_rate,rating
133,Chamber Italian,4.99,NC-17
1,Academy Dinosaur,0.99,PG
3,Adaptation Holes,2.99,NC-17
10,Aladdin Calendar,4.99,NC-17
12,Alaska Phantom,0.99,PG
15,Alien Center,2.99,NC-17
16,Alley Evolution,2.99,NC-17
19,Amadeus Holy,0.99,PG
29,Antitrust Tomatoes,2.99,NC-17
31,Apache Divine,4.99,NC-17


## The `group by` and `having` Clauses

All of the queries thus far retrieve raw data without any manipulation. In some cases, you will want to find trends in your data that will require the database server to cook the data a bit before your retrieve your result set. One such mechanism is the `group by` clause, which is used to group data by column values. When using the `group by` clause, you may also use the `having` clause, which allows you to filter out group data in the same way the `where` clause lets you filter raw data. Consider this example where we like to query the number of customers each rental store has below:

In [19]:
%%sql 

SELECT s.store_id, COUNT(c.customer_id) num_customers
FROM store AS s INNER JOIN customer AS c
ON s.store_id = c.store_id
GROUP BY s.store_id
HAVING count(c.customer_id) > 2;

 * postgresql://postgres:***@localhost/dvdrental
2 rows affected.


store_id,num_customers
1,326
2,273


**Note:** The `COUNT()` is an aggregate in-built function that allows you to get the number of rows that match a specific condition of a query. 

## The `order by` Clause

In general, the rows in a result set returned from a query are not in any particular order. If you want your result set in a particular order, you will need to instruct the server to sort the results using the `order by` clause.

> *The `order by` clause is the mechanism for sorting your result set using either raw columnd data or expressions based on column data.*

Let's look at the example below:

In [20]:
%%sql 

SELECT customer_id, first_name, last_name
FROM customer
ORDER BY last_name
LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,first_name,last_name
505,Rafael,Abney
504,Nathaniel,Adam
36,Kathleen,Adams
96,Diana,Alexander
470,Gordon,Allard
27,Shirley,Allen
220,Charlene,Alvarez
11,Lisa,Anderson
326,Jose,Andrew
183,Ida,Andrews


### Ascending Versus Descending Sort Order

When using `order by`, you have the option of specifying *ascending* or *descending* order via the `asc` and `desc` keywords. The default is ascending. so you'll need to add the `desc` keyword, only if you want to use a descending sort, like in the example below:

In [21]:
%%sql 

SELECT customer_id, first_name, last_name
FROM customer
ORDER BY last_name DESC
LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


customer_id,first_name,last_name
28,Cynthia,Young
413,Marvin,Yee
402,Luis,Yanez
318,Brian,Wyman
31,Brenda,Wright
496,Tyler,Wren
107,Florence,Woods
78,Lori,Wood
581,Virgil,Wofford
541,Darren,Windham
