# Bucketing & Partitioning

In this notebook you will see the advantages of partitioning and bucketing. This notebook depends on the result of the two previous notebooks (so run them first):
* Partitioning I
* Bucketing I

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Bucketing II')
    .enableHiveSupport()
    .getOrCreate()
)

# Task I

* join users with questions
 * take questions only for the year 2018
 * see the query plan
* turn off broadcast hash join to see the consequence of bucketing 

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-2]) 

questions_input_path = os.path.join(project_path, 'output/1/questions-partitioned')

users_input_path = os.path.join(project_path, 'output/users-bucketed')

#### Note:

If you are using derby database as the metastore (which is the default setting in local mode) and you connected to the metastore in some other notebook already, the next command will fail since you can connect to the database only from one application. In this case shut down your jupyter notebook and start it again (restarting the kernel is not enough).

In [None]:
spark.sql("show tables").show()

#### Read the data:

Hint:
* read users from the table
 * use spark.table(table_name)
* read questions from the partitioned layout

In [None]:
usersDF = spark.table('users')

questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

#### Turn off broadcast hash join:

Hint:
* set autoBroadcastJoinThreshold to -1

In [None]:
spark.conf.set('spark.sql.autoBroadcastJoinThreshold', -1)

#### Write the join query:

Hint:
* use repartition(10, 'user_id') to achieve 'one-side shuffle-free join'

In [None]:
(
    questionsDF
    .filter(col('year').isin([2018, 2017, 2016]))
    .repartition(10, 'user_id')
    .join(usersDF, 'user_id')
    .select('user_id', 'year')
).collect()

#### See the query plan

Hint:
* go to Spark UI to the sql tab
* from the plan you should see:
 * partition pruning
 * Exchange only in one branch of the plan

# Task II

Do the same as before but this time filter for specific user = 8440. Go to see the query plan. You should see in Parquet Scan node that only 1 bucket was selected and that it scans much less data (See the input size to the first stage of the job).

Hint:
* use collect instead of show to see the real difference in data size that is scaned

In [None]:
(
    questionsDF
    .filter(col('year').isin([2018, 2017, 2016]))
    .repartition(10, 'user_id')
    .join(usersDF, 'user_id')
    .filter(col('user_id') == 8440)
    .select('user_id', 'year')
).collect()

In [None]:
spark.stop()