# Higher Order Functions

In this notebook you will solve two questions using higher order functions

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, collect_list, expr, array_join

import os

In [2]:
spark = (
    SparkSession
    .builder
    .appName('HOF I')
    .getOrCreate()
)

# Task I

* convert the field `tags` in questions json dataset (String in json file) to an array using HOFs

In [3]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_json_input_path = os.path.join(project_path, 'data/questions-json')

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

#### Read the data from JSON:

In [4]:
questionsDF = (
    spark
    .read
    .format('json')
    .option('path', questions_json_input_path)
    .load()
)

#### Transform tags:

Hint:
* first split the string to an array
 * use [split](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.split)
* use [TRANSFORM](https://spark.apache.org/docs/latest/api/sql/index.html#transform) in sql expression
 * use [regexp_replace](https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace) on each element

In [5]:
(
    questionsDF
    .withColumn('tags', split('tags', '><'))
    .selectExpr(
        '*',
        "TRANSFORM(tags, value -> regexp_replace(value, '(>|<)', '')) AS tags_arr"
    )
    .drop('tags')
    .withColumnRenamed('tags_arr', 'tags')
    .select('question_id', 'title', 'tags')
).show(truncate=30, n=10)

+-----------+------------------------------+------------------------------+
|question_id|                         title|                          tags|
+-----------+------------------------------+------------------------------+
|   61416257|Ag-Grid wrong row orders in...|                     [ag-grid]|
|   61482176|Optional parameter & params...|[c#, function, methods, syn...|
|   61919808|      Matching Texts in python|[python, regex, machine-lea...|
|   60340057|Knockout custom binding for...| [knockout.js, fullcalendar-4]|
|   62001217|Python mysql autocommit dat...|[python, mysql, phpmyadmin,...|
|   61417491|Getting an error stating I ...|[python, keras, lstm, recur...|
|   59573018|Rxswift operator share repl...|[ios, swift, system.reactiv...|
|   60384286|JVM not taking Daylight Tim...|[java, linux, production-en...|
|   60781664|Is there a way to set the s...|                 [bash, slurm]|
|   59692353|tradingview close -close[1]...|      [pine-script, indicator]|
+-----------

# Task II

* For each user concatenate titles of questions he answered to a single string using HOFs.
* First do it using HOFs
* Second do it using native function array_join

In [6]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

#### Concat the titles:

Hint:
* collect the titles to an array for each user
 * use groupBy and collect_list
* use [AGGREGATE](https://spark.apache.org/docs/latest/api/sql/index.html#aggregate) in SQL expression to concat the array to a single string
* remove first 3 chars using [substring](https://spark.apache.org/docs/latest/api/sql/index.html#substring)

In [7]:
(
    questionsDF
    .groupBy('user_id')
    .agg(
        collect_list('title').alias('title')
    )
    .selectExpr(
        '*',
        "AGGREGATE(title, cast('' AS string), (buffer, value) -> (concat(buffer, ' - ', value))) AS total_title"
    )
    .withColumn('total_title', expr("substring(total_title, 4, length(total_title))"))
).show(truncate=50, n=10)

+-------+--------------------------------------------------+--------------------------------------------------+
|user_id|                                             title|                                       total_title|
+-------+--------------------------------------------------+--------------------------------------------------+
| 127907|[Popup Window will not Close, MS Chart Control:...|Popup Window will not Close - MS Chart Control:...|
| 179205|[How do I return to my Windows Phone app from a...|How do I return to my Windows Phone app from a ...|
| 229832|[Custom error pages shown using IIS6 rather tha...|Custom error pages shown using IIS6 rather than...|
| 310455|[Mesh generation algorithm, Live Wallpaper touc...|Mesh generation algorithm - Live Wallpaper touc...|
| 636995|                              [C - fgets segfault]|                                C - fgets segfault|
| 720378|[casting Arrays.asList causing exception: java....|casting Arrays.asList causing exception: jav

#### Do the same using [array_join](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_join):

In [8]:
(
    questionsDF
    .groupBy('user_id')
    .agg(
        collect_list('title').alias('title')
    )
    .withColumn('total_title', array_join(col('title'), ' - '))
    .select('total_title')
).show(truncate=90, n=10)

+------------------------------------------------------------------------------------------+
|                                                                               total_title|
+------------------------------------------------------------------------------------------+
|Popup Window will not Close - MS Chart Control: Formatting Axis Labels - Hiding Previou...|
|How do I return to my Windows Phone app from a YouTube video? - WPF: How do I create a ...|
|Custom error pages shown using IIS6 rather than web.config settings - Web Forms Routing...|
|Mesh generation algorithm - Live Wallpaper touch - I don't want it when an app is launc...|
|                                                                        C - fgets segfault|
|casting Arrays.asList causing exception: java.util.Arrays$ArrayList cannot be cast to j...|
|"libaacdecoder.so" not found aacdecoder on Lollipop - Set default value for Bindy field...|
|                        How are Gimbals affected by the iOS 7.1 iBeac

In [9]:
spark.stop()