# Querying Data

One of the most common tasks, when you work with the database, is to query data from tables by using the `select` statement. The `select` statement has many clauses that you can use to form a flexible query. While only one of them is mandatory when using the `select` clause, you will usually include at least two or three of the available clauses. Table below shows some of the different clauses and their purpose:

**Clause name** | **Purpose**
--- | ---
`Select` | Determines which columns to include in the query's result set
`From` | Identifies the tables from which to draw data and how the tables should be joined
`Where` | Filters out unwanted data
`Group by` | Used to group rows together by common column values
`Having` | Filter out unwanted groups
`Order by` | Sorts the rows of the final result set by one or more columns

***Note:*** *I have typed out SQL clauses in CAPITAL LETTERS in the examples but it is not always required to have them in CAPS. This is just to help you easily distinguish the SQL keywords from object references (tables, columns, etc.) in the statement/code. PostgreSQL and some other databases do accept SQL keywords in small letters.*

The following sections delve into the uses of the six major query clauses

## The `select` Clause

   Albeit the `select` clause is the first clause of a SQL query statement, it is one of the last clauses that the database server evaluates. The reason for this is that before you can determine what to include in the final result set, you need to know all of the possible columns that could be included in the final result set. The `select` clause is typically paired with `from` clause. Here's a query to get started:

*Initial setup to load sql module in order to run sql statements on this Notebook:*

***Note:*** *Each cell demonstrating SQL code in Jupyter Notebook needs to begin with `%%sql` in order to have the interpreter to treat the code as SQL statements/queries. The `%%sql` is not needed otherwise in SQL/database-client tools.*

*Initial setup to load sql module in order to run sql statements on this Notebook:*

In [1]:
%load_ext sql
%sql postgresql://postgres:password@localhost/dvdrental

In [2]:
%%sql 

SELECT * FROM actor;

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


actor_id,first_name,last_name,last_update
1,Penelope,Guiness,2013-05-26 14:47:57.620000
2,Nick,Wahlberg,2013-05-26 14:47:57.620000
3,Ed,Chase,2013-05-26 14:47:57.620000
4,Jennifer,Davis,2013-05-26 14:47:57.620000
5,Johnny,Lollobrigida,2013-05-26 14:47:57.620000
6,Bette,Nicholson,2013-05-26 14:47:57.620000
7,Grace,Mostel,2013-05-26 14:47:57.620000
8,Matthew,Johansson,2013-05-26 14:47:57.620000
9,Joe,Swank,2013-05-26 14:47:57.620000
10,Christian,Gable,2013-05-26 14:47:57.620000


In this query, the `from` clause lists a single table (*actor*), and the `select` clause indicates that all columns (designed by \*) in the *actor* table should be included in the result set. Basically the query statement could be described in English as follows:

> *Show me all columns and all rows in the **actor** table*

Also note that each SQL statement has to end with a semi-colon. Otherwise the SQL interpreter will continue to expect additional clauses for input.

You can also explicitly name the columns you are interested in the query, such as:

In [3]:
%%sql

SELECT first_name, last_name FROM actor;

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


first_name,last_name
Penelope,Guiness
Nick,Wahlberg
Ed,Chase
Jennifer,Davis
Johnny,Lollobrigida
Bette,Nicholson
Grace,Mostel
Matthew,Johansson
Joe,Swank
Christian,Gable


   Thus the `select` clause determines which of all possible columns should be included in the query's result set. 
   Also, it is not a good practice to use the asterisk (*) in the `select` statement when you use the embedded SQL statements in the code due to the following reasons:

+ For a large table with many columns, the `select` statement with an asterisk (*) shorthand will retrieve data from all columns of the table, which may not be necessary.
+ In addition, retrieving unnecessary data from the database increases the traffic between the database and application layers. As a result, your applications will be slow and less scalable. Therefore, it is a good practice to specify the column names explicitly in the `select` clause whenever possible to get only necessary data from a table.

Because of these reasons, you should only use the asterisk (\*) shorthand for the ad-hoc queries to examine the data. In cases when you need to select all columns to get a preview of the data, you can use the `limit` clause to limit the rows in the result set to a specific number:

In [4]:
%%sql 

SELECT * FROM actor LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


actor_id,first_name,last_name,last_update
1,Penelope,Guiness,2013-05-26 14:47:57.620000
2,Nick,Wahlberg,2013-05-26 14:47:57.620000
3,Ed,Chase,2013-05-26 14:47:57.620000
4,Jennifer,Davis,2013-05-26 14:47:57.620000
5,Johnny,Lollobrigida,2013-05-26 14:47:57.620000
6,Bette,Nicholson,2013-05-26 14:47:57.620000
7,Grace,Mostel,2013-05-26 14:47:57.620000
8,Matthew,Johansson,2013-05-26 14:47:57.620000
9,Joe,Swank,2013-05-26 14:47:57.620000
10,Christian,Gable,2013-05-26 14:47:57.620000


***Note:*** *Not all database systems support `limit` clause; MySQL also supports the `limit` clause.*

The usage of `select` clause is not just limited to columns. You can also use expressions like below:

In [5]:
%%sql 

SELECT actor_id, 
actor_id * 3.14159, 
UPPER(last_name) 
FROM actor LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


actor_id,?column?,upper
1,3.14159,GUINESS
2,6.28318,WAHLBERG
3,9.42477,CHASE
4,12.56636,DAVIS
5,15.70795,LOLLOBRIGIDA
6,18.84954,NICHOLSON
7,21.99113,MOSTEL
8,25.13272,JOHANSSON
9,28.27431,SWANK
10,31.4159,GABLE


For certain built-in function or to evaluate a simple expression, you can skip the `from` clause entirely. For example:

In [6]:
%%sql

SELECT version();

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


version
"PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, 64-bit"


### Column Aliases

Though SQL query will generate labels for the columns returned, you might still want to assign a new label to a column from a table. In cases where the result set are generated by expressions or built-in function calls, you would almost certainly want to assign your own labels to those columns. You can use the `as` keyword before the alias name like in the example below:

In [7]:
%%sql

SELECT actor_id, 
actor_id * 3.14159 AS actor_id_x_pi, 
UPPER(last_name) AS last_name_upper
FROM actor LIMIT 10;

 * postgresql://postgres:***@localhost/dvdrental
10 rows affected.


actor_id,actor_id_x_pi,last_name_upper
1,3.14159,GUINESS
2,6.28318,WAHLBERG
3,9.42477,CHASE
4,12.56636,DAVIS
5,15.70795,LOLLOBRIGIDA
6,18.84954,NICHOLSON
7,21.99113,MOSTEL
8,25.13272,JOHANSSON
9,28.27431,SWANK
10,31.4159,GABLE


### Removing Duplicates

In some cases, a query might return duplicate rows of data. For example, when retrieving the IDs of customers from payment table like below:

In [8]:
%%sql

SELECT customer_id FROM payment LIMIT 100;

 * postgresql://postgres:***@localhost/dvdrental
100 rows affected.


customer_id
341
341
341
341
341
341
342
342
342
343


As some customers have more than one payment transaction, you will see the same customer id. You would probably want in this case a distinct set of customers that have payment transactions

In [9]:
%%sql

SELECT DISTINCT customer_id FROM payment;

 * postgresql://postgres:***@localhost/dvdrental
599 rows affected.


customer_id
184
87
477
273
550
51
394
272
70
190
