<link rel='stylesheet' href='../assets/css/main.css'/>

# Broadcast Join 

## Overview

Here we are joining large data and small data.  we will perform a **broadcast join**

## Duration

30 mins

## Depends on

[Lab 9.1](9-1_join-1.ipynb)

## Step-1: Verify datsets

We have 2 datasets

- transactions data (large data).  Sample data is in `data/transactions/transactions-sample.csv`
- rewards data (small data).  Sample data in `data/reward-points/reward-points.csv`

Both datasets have `merchant_id` field in common.

Also optionally, verify you have this data in HDFS.


## Step-2: Start up Spark

In [None]:
try:
    spark
except NameError:
    import findspark
    findspark.init()  # uses SPARK_HOME
    print("Spark found in : ", findspark.find())

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession

    # use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
    import tempfile
    tmpdir = tempfile.TemporaryDirectory()

    config = ( SparkConf()
             .setAppName("TestApp")
             .setMaster("local[*]")
             .set('executor.memory', '2g')
             .set('spark.sql.warehouse.dir', tmpdir.name)
             .set("some_property", "some_value") # another example
             )

    spark = SparkSession.builder.config(conf=config).getOrCreate()
    sc = spark.sparkContext

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

## Step-3: Load both datasets And Register Tables

In [None]:
transactions = spark.read.csv("../data/transactions/csv", header=True)
rewards = spark.read.csv("../data/reward-points/reward-points.csv", header=True)

transactions.createOrReplaceTempView("transactions")
rewards.createOrReplaceTempView("rewards")

## Step-4: Broadcast Join

We will provide a hint for broadcast join.  Broadcast small table `rewards`

In [None]:
import time

t1 = time.perf_counter()

s = """
SELECT /*+ BROADCAST (rewards) */ 
transactions.merchant_id, count(*) as total_rewards
from transactions join rewards 
ON (transactions.merchant_id = rewards.merchant_id)
group by transactions.merchant_id
order by total_rewards desc
"""

spark.sql(s).show()

t2 = time.perf_counter()

print ("Join in {:,.2f} ms ".format( (t2-t1)*1000))

## Step-5: See the query plan

Use `explain` keyword.

Can you spot any optimizations?

Hint : Look at the physical plan.

In [None]:
joined = spark.sql(s)
joined.explain(extended=True)

## Step-6: See the DAG on Spark UI

Go to Spark UI and observe the DAG.

## Step-7: Now Run this on Hadoop cluster

Launch spark on Hadoop cluster, and load both datasets and do a join.

Here is some sample code to get you started.  Adjust TODO items as needed.

In [None]:
## TODO Adjust data paths accordingly
transactions = spark.read.csv("/user/me/transactions/csv", header=True)
rewards = spark.read.csv("/user/me/reward-points/reward-points.csv", header=True)

transactions.createOrReplaceTempView("transactions")
rewards.createOrReplaceTempView("rewards")

s = """
SELECT /*+ BROADCAST (rewards) */ 
transactions.merchant_id, count(*) as total_rewards
from transactions join rewards 
ON (transactions.merchant_id = rewards.merchant_id)
group by transactions.merchant_id
order by total_rewards desc
"""

spark.sql(s).show()