# Optimiser bad estimations

In [None]:
from plotly.offline import init_notebook_mode
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
from sqlalchemy import create_engine
from query_flow.parsers.postgres_parser import PostgresParser
from query_flow.vizualizers.query_vizualizer import QueryVizualizer

**Problems related to the optimizer work are hard to detect for regular users. Using QueryFlow we can visualize and compare the optimizer estimations to the actual statistics after executions. **

To visualize multiple metrics in the same Sankey-diagrams we adjusts the luminance of the color for different metrics.  
We are using QueryFlow to identify if we have stale statistics and where it was originated. 
The corresponding Sankey that represents the estimated cardinality compared to the actual cardinality can be seen in the next cell, but to make it work without generating the database we will use a mock.

In [None]:
con_str = 'postgresql:///etrabelsi_thesis'
with create_engine(con_str).connect() as con:
    execution_plan = con.execute("UPDATE crew set title_id=title_id")
query_renderer = QueryVizualizer(parser=PostgresParser())
query ="""
SELECT titles.title_id
FROM titles
INNER JOIN crew ON crew.title_id = titles.title_id
INNER JOIN people ON people.person_id = crew.person_id
WHERE genres like '%Comedy%' 
  AND name in ('Owen Wilson', 'Adam Sandler', 'Jason Segel')
"""
flow_df = query_renderer.get_flow_df(query, con_str=con_str)
query_renderer.vizualize(flow_df, title="Bad estimation for query 1", metrics=["actual_rows", "plan_rows"], open_=False)

We can see that each metric gets its own color; the darker gray represents the actual_rows metric and the darker lighter gray represents the plan_rows metric. 

We can see  that the optimizer was way off for the Crew scan, as the light gray edge is much thicker than the darker one. The reason the optimizer estimation is skewed is due to PostgreSQL’s mechanism for deleting and updating records. When an update or a delete occurs, it does not create extra space in the system. PostgreSQL rather flags these tuples as “dead tuples” and to remove those, one needs to use the VACUUM clause. 

We can clean the dead tuple in the Crew relation using the vacuum command only on the Crew relation. The vacuum query can be seen in the next cell. 


In [None]:

engine =  create_engine(con_str)
connection = engine.raw_connection()
connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
cursor = connection.cursor()
cursor.execute("VACUUM FULL crew")

To confirm that the optimizer statistics are up to date, we use QueryFlow to visualize the cardinality again. The corresponding Sankey that represents the estimated cardinality compared to the actual cardinality after the vacuum command can be seen in the next cell.

In [None]:
flow_df = query_renderer.get_flow_df(query, con_str=con_str)
query_renderer.vizualize(flow_df, title="Bad estimation for query 1", metrics=["actual_rows", "plan_rows"], open_=False)

We can immediately see  that the Crew scan is no longer skewed as it used to be, as the darker and lighter edges of the Crew sub-expression are proportional.