# Optimize the query plan II

Suppose we want to join badges with users (using the table from the metastore). We also want to use a UDF (which does some computation on the badges.name field) and using a window we want for each user order the badges depending on the creation date. 

See the query bellow which does that in suboptimal way and try to rewrite it to achieve more optimal plan. More specifically try to eliminate the Exchange in the query plan.

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, udf, row_number
)

from pyspark.sql import Window
from pyspark.sql.types import IntegerType

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Optimize II')
    .enableHiveSupport()
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-2]) 

badges_input_path = os.path.join(project_path, 'data/badges')

In [None]:
usersDF = spark.table('users')

badgesDF = (
    spark
    .read
    .option('path', badges_input_path)
    .load()    
)

#### UDF:

The UDF bellow is just simple function that gets the lenght of a string. This can be easily done using native pyspark dataframe function length. For the sake of this example however suppose that this function encapsulates some komplex logic which can not be done natively.

In [None]:
@udf(IntegerType())
def get_length_udf(str):
    return len(str)

In [None]:
badgesDF.show(truncate=False, n=10)

#### Window definition:

In [None]:
w = Window().partitionBy('user_id').orderBy('date')

# Task:

The query bellow is suboptimal. Try to rewrite the query to achive more optimal plan that leads to more efficient execution.

Hint:
* see the query plan
* eliminate the Exchange from the plan
* take advantage of the table users, which is bucketed on user_id

In [None]:
(
    usersDF
    .join(badgesDF, 'user_id')
    .withColumn('name_len', get_length_udf('name'))
    .withColumn('question_n', row_number().over(w))
).collect()

#### Rewrite the query:

Hint:
* move the UDF before the join

In [None]:
(
    usersDF
    .join(
        badgesDF.withColumn('name_len', get_length_udf('name')), 
        'user_id'
    )
    .withColumn('question_n', row_number().over(w))
).collect()

In [None]:
spark.stop()