# Bucketing

In this notebook you will see the advantages of bucketing. 

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Bucketing')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'data/questions-json')

users_input_path = os.path.join(project_path, 'data/users')

users_output_path = os.path.join(project_path, 'output/users-bucketed')

#### Note:

We will use in-memory metastore that is designed for testing purposes. Right now it should be empty, you can check it calling sql statement 'show tables'

In [None]:
spark.sql('show tables').show()

#### Read the data:

* we will need users and questions datasets

In [None]:
# your code here:


#### Task 1: Save the users as bucketed table

Hint:
* repartition by the same column that you use for bucketing `user_id` 
    * choose the same number of partitions as you want buckets (10)
* use [bucketBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html#pyspark.sql.DataFrameWriter.bucketBy) with 10 buckets
* use` users_output_path` as the destination for the data in the table

In [None]:
# your code here:


Check what tables we have in the metastore now:

In [None]:
# your code here:


# Task II

* join users with questions
 * see the query plan
* turn off broadcast hash join to see the consequence of bucketing
  * set autoBroadcastJoinThreshold to -1

In [None]:
spark.conf.set('spark.sql.autoBroadcastJoinThreshold', -1)

#### Write the join query:

Hint:
* for users leverage now the table by calling: spark.table(table_name)

In [None]:
# your code here:


#### See the query plan

Hint:
* go to Spark UI to the sql tab
* from the plan you should see:
 * Exchange only in one branch of the plan because users leverage the bucketing and don't require additional `Exchange`

# Task III

Do the same as before but this time filter for specific user = 476. Go to see the query plan. You should see in Parquet Scan node that only 1 bucket was selected and that it scans much less data (See the input size to the first stage of the job).

In [None]:
# your code here:


In [None]:
spark.stop()