# Cost Based Optimization

## Task: see how statistics are used

* turn CBO on
* run simple query and see the query plan with stats using EXPLAIN COST
    * since Spark 3.0 we can use `explain(mode='cost')`
* run ANALYZE TABLE and see it again
* compute stats for individual cols and see the difference
* compute the histogram

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('CBO')
    .enableHiveSupport()
    .getOrCreate()
)

#### Check the CBO

In [None]:
spark.conf.get('spark.sql.cbo.enabled')

#### Turn CBO on in case it was off

In [None]:
spark.conf.set('spark.sql.cbo.enabled', True)

#### See the query plan with stats

Hint:
* we will work with the table `users`
  * use [tableExists](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.tableExists.html#pyspark.sql.Catalog.tableExists) from the catalog API to verify that the table exists
* compose a query with filter user_id < -1000 (we know that there are no such records)
* use explain with mode='cost' to see the plan with stats

In [None]:
spark.catalog.tableExists('users')

In [None]:
(
    spark.table('users')
    .filter(col('user_id') < -1000)
).explain(mode='cost')

#### See the statistics for the table

Hint
* use sql DESC EXTENDED

In [None]:
spark.sql('DESC EXTENDED users').show(truncate=60, n=50)

#### Compute the statistics

Hint
* run sql ANALYZE TABLE ... COMPUTE STATISTICS

In [None]:
spark.sql('ANALYZE TABLE users COMPUTE STATISTICS')

#### See the stats again

In [None]:
spark.sql('DESC EXTENDED users').show(truncate=60, n=50)

#### See the query plan for the query again

In [None]:
(
    spark.table('users')
    .filter(col('user_id') < -1000)
).explain(mode='cost')

#### See column level stats

Hint
* use DESC EXTENDED table_name, col_name

In [None]:
spark.sql('DESC EXTENDED users user_id').show(truncate=60)

#### Compute column level stats

Hint:
* use ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS col_names

In [None]:
spark.sql('ANALYZE TABLE users COMPUTE STATISTICS FOR COLUMNS user_id, display_name')

#### See the stats again

In [None]:
spark.sql('DESC EXTENDED users user_id').show(truncate=60)

#### See the plan again

In [None]:
(
    spark.table('users')
    .filter(col('user_id') < -1000)
).explain(mode='cost')

#### Compute the histogram for specific cols

Hint
* Check if histogram is enabled
* Enable if not
* Compute column level stats again

In [None]:
spark.conf.get('spark.sql.statistics.histogram.enabled')

In [None]:
spark.conf.set('spark.sql.statistics.histogram.enabled', True)

In [None]:
spark.sql('ANALYZE TABLE users COMPUTE STATISTICS FOR COLUMNS user_id')

#### See the stats again

In [None]:
spark.sql('DESC EXTENDED users user_id').show(truncate=60)

In [None]:
# Let's now try it again with the CBO OFF

spark.conf.set('spark.sql.cbo.enabled', False)

(
    spark.table('users')
    .filter(col('user_id') < -1000)
).explain(mode='cost')

To see more information about statistics in Spark, check my [article](https://towardsdatascience.com/statistics-in-spark-sql-explained-22ec389bf71b).

In [None]:
spark.stop()