In the previous file, we learned more about Postgres internal tables and how to build a database description from scratch. Using internal tables, we were able to describe metadata associated with our Postgres database. The metadata included data types, table names, created schemas, and plenty of other useful information.

In this file, we will examine queries and their plans of execution. Using Postgres tooling, we will introduce the basic techniques of database debugging. We will be working on the HUD database from the previous file.

We will focus on a [command called `EXPLAIN`](https://www.postgresql.org/docs/12/sql-explain.html). The `EXPLAIN` command takes in any SQL query and, instead of executing it, will return the internal sequence of steps Postgres follows to run the query.

Here's the format of an `EXPLAIN` query and an example of how to use the command:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    EXPLAIN 
    SELECT * FROM homeless_by_coc;
""")`

`print(cur.fetchall())`

Let's look at a cleaner version of the output from above.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    EXPLAIN 
    SELECT COUNT(*) FROM homeless_by_coc 
    WHERE year > '2012-01-01';
""")`

`query_plan = cur.fetchall()`

`for row in query_plan:
    print(row[0])`

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`import json`

`cur.execute("""
    EXPLAIN (FORMAT json)
    SELECT COUNT(*) FROM homeless_by_coc 
    WHERE year > '2012-01-01';
""")`

`query_plan = cur.fetchone()`

`print(json.dumps(query_plan[0], indent=2))`

By outputting the command in a json format, we are given additional information about each field. Here's the output once again:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
cpu_tuple_cost = 0.01
cpu_operator_cost = 0.0025
seq_page_cost = 1.0
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    SELECT reltuples, relpages
    FROM pg_class
    WHERE relname = 'homeless_by_coc';
""")`

`n_tuple, n_page = cur.fetchone()`

`total_cost = (cpu_tuple_cost + cpu_operator_cost) * n_tuple + seq_page_cost * n_page
print(total_cost)`

Above, we saw that the formula gave the value `2363.6125` which matches the cost that we had obtained with the `EXPLAIN`. The only difference is that the explain command did not show as many decimal places, it just showed the value `2363.61`.

Let's discuss some drawbacks with `EXPLAIN`. First, when we run the `EXPLAIN` command, we only get an estimate of the cost of running a query and not actual running times. While it may be helpful to see some bottleneck, we don't know if it is a real concern until we perform some benchmarking.

For example, above we looked at the query plan tree of the `SELECT * FROM homeless_by_coc WHERE year > '2012-01-01'`. In the `Seq Scan` node we can see the following data:

`"Plan Rows": 28843`

However, if we actually perform the query, we will obtain `50589` rows. The reason for this discrepancy is that, under the hood, `EXPLAIN` runs several queries on internal tables to give us the estimated data. As we say, one of these tables is the [`pg_class` table](https://www.postgresql.org/docs/12/catalog-pg-class.html) where the estimated costs and rows are stored. This table only stores estimates of rows and costs (not actual values) so the `EXPLAIN` command can only give us approximate values for our queries.

![image.png](attachment:image.png)

One of the drawbacks of using the `ANALYZE` option is that because the query is executed, it takes more time to execute. Furthermore, since the query is executed, it might alter the database, depending on the query. Since these commands are mostly used to debug queries to find out slow queries, we often don't want them to alter the database. 

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
import json
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT COUNT(*) 
    FROM homeless_by_coc 
    WHERE year > '2012-01-01';
""")`

`query_plan = cur.fetchone()
print(json.dumps(query_plan[0], indent=2))`

Let's look at the result of the output from the above exercise (we added arrows `<-` to highlight the new lines provided by the `ANALYZE` option):

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`import psycopg2
import json
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    DELETE FROM state_household_incomes;
""")`

`conn.rollback()
query_plan = cur.fetchone()
print(json.dumps(query_plan[0], indent=2))`

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**


`import psycopg2
import json
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()`

`cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT homeless_by_coc.state, homeless_by_coc.coc_number, homeless_by_coc.coc_name, state_info.name 
    FROM homeless_by_coc, state_info
    WHERE homeless_by_coc.state = state_info.postal;
""")`

`query_plan = cur.fetchone()
print(json.dumps(query_plan[0], indent=2))`

Above we obtained a query plan with the following structure (we omit some lines because the plan is too long):


[![image.png](attachment:image.png)

![image.png](attachment:image.png)


We will learn later on what the `Hash Join` and `Hash` nodes represent. But we can already see that a join query requires a much more complex plan. It will require, among other things, one sequential scan over the `homeless_by_coc` table and another on over the `state_info` table.

These loops can become extremely inefficient as the size of our tables increase. In the next file, we will address these concerns by adding a special option to our tables. This option will change the sequence scans to a loop that is less computationally expensive for reads and joins.

In this file we learned about the paths of a Postgres query. Then, we focused on the planner/optimization part of the path and introduced a Postgres command, `EXPLAIN`, that takes in a query and returns the output from the planner/optimization. Finally, we added the `ANALYZE` option to the `EXPLAIN` query and returned actual times of the query execution.

In the next file we will introduce the concept of indexing to speed up table querying. Using the `EXPLAIN ANALYZE` command, we will benchmark queries with and without indexes, and determine if there are any performance gains. Finally, we will introduce the concept of multiple indexes that might add additional performance boosts.