# User Defined Functions

In this notebook you will use UDF to count number of occurences for a word inside a text. We will improve the text categorization query implemented in `Text categorization` notebook. For each category find how many times the word is containd in the question's body.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, lit, sum

from pyspark.sql.types import IntegerType

import os
import re

In [None]:
spark = (
    SparkSession
    .builder
    .appName('UDFs I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [None]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

In [None]:
categories = ['java', 'sql', 'python', 'spark']

#### Implement UDF

* Implement a UDF that will take two arguments (both of them need to be columns (that's why it is a udf)):
 * body of the question
 * category that we look for
* The function should find number of occurences of the category in the question's body

Hint:
* use Python `re.findall(r"category", message_string)` to get a list of all occurences
* use Python `len` function to get the list size
* you can first try the plain function on some mock data and if it works, then make a UDF from it

In [None]:
# define the python function

def count_occurences(message, category):
    return len(re.findall(r"{}".format(category) , message, re.IGNORECASE))

In [None]:
# test on mock data:

mock_data = "python first, python second, java first"

print(count_occurences(mock_data, 'python'))
print(count_occurences(mock_data, 'java'))

In [None]:
# make a UDF from it:

@udf(IntegerType())
def count_occurences(message, category):
    return len(re.findall(r"{}".format(category) , message, re.IGNORECASE))

#### Use the UDF

* Use the UDF inside the get_c function that we implemented in `Text categorization notebook`

In [None]:
def get_c(df):
    for category in categories:
        df = df.withColumn(category, count_occurences(col('body'), lit(category)))
    return df

In [None]:
result = get_c(questionsDF.select('body'))

In [None]:
result.show(n=5)

Sum the occurences for each category:

In [None]:
result.select(list(map(lambda x: sum(x).alias(x), categories))).show()

In [None]:
spark.stop()