# Complex Data Types

In this notebook you will continue to improve the text categorization query implemented in `Text categorization` and `User Defined Functions` notebook. For each question find out which category has the most occurences in the text. Consider only questions for which we have at least one occurence.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, lit, array, struct, reverse, array_sort
from pyspark.sql.types import IntegerType

import os
import re

In [None]:
spark = (
    SparkSession
    .builder
    .appName('UDFs I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [None]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

In [None]:
categories = ['java', 'sql', 'python', 'spark']

In [None]:
@udf(IntegerType())
def count_occurences(message, category):
    return len(re.findall(r"{}".format(category) , message, re.IGNORECASE))

In [None]:
def get_c(df):
    for category in categories:
        df = df.withColumn(category, count_occurences(col('body'), lit(category)))
    return df

In [None]:
result = get_c(questionsDF.select('question_id', 'body'))

### Find the most relevant category

* The result now contains number of occurences for each catagory.
* For each question find out which category has the most occurences

Hint
* For each question create an array of structs where the struct should have to subfields
 * category_name
 * frequency (number of occurences)
* Use a for-loop over the `cols` list to create the array
* Sort the array in descending order (have the `frequency` subfield on the first position in the struct)
* Access the subfields of the first element
* Docs for array_sort https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_sort
* Docs for reverse https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.reverse
* Docs for struct https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.struct

In [None]:
# Using a for-loop create the expression that we will pass as an argument to the array function
# Use struct function

s = []
for c in categories:
    s.append(struct(col(c).alias('frequency'), lit(c).alias('category_name')))

In [None]:
# Create the array using the array function
# Sort the array and take first element

(
    result
    .withColumn('categories', array(*s))
    .withColumn('categories', reverse(array_sort('categories')))
    .select(
        'question_id',
        col('categories.category_name')[0].alias('category'),
        col('categories.frequency')[0].alias('frequency')
    )
    .filter(col('frequency') > 0)
).show()

#### Note
* When you sort array with structs, the position of the subfields is important.

In [None]:
spark.stop()