# User Defined Functions

In this notebook you will use UDF to count number of occurences for a word inside a text. We will look for the number of occurences for a particular technology in the question text.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, lit, sum, regexp_count, desc

from pyspark.sql.types import IntegerType

import os
import re

In [None]:
spark = (
    SparkSession
    .builder
    .appName('UDFs I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [None]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

#### Implement UDF

* Implement a [UDF](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html#pyspark.sql.functions.udf) that will take two arguments (both of them need to be columns:
 * body of the question
 * technology that we look for
* The function should find number of occurences of the technology in the question's body

Hint:
* use Python `re.findall(r"category", message_string)` to get a list of all occurences
* use Python `len` function to get the list size
* you can first try the plain function on some mock data and if it works, then make a UDF from it

In [None]:
# define the python function

def count_occurences(message, technology):
    return len(re.findall(r"{}".format(technology) , message, re.IGNORECASE))

In [None]:
# test on mock data:

mock_data = "python first, python second, java first"

print(count_occurences(mock_data, 'python'))
print(count_occurences(mock_data, 'java'))

In [None]:
# make a UDF from it:

@udf(IntegerType())
def count_occurences(message, technology):
    return len(re.findall(r"{}".format(technology), message, re.IGNORECASE))

#### Use the UDF

* Apply the function on the DataFrame as a column transformation
 * add a new column 'python' and count occurences for this word in the text body

In [None]:
(
    questionsDF.withColumn('python', count_occurences(col('body'), lit('python')))
).show()

#### Verify the result

Verify the result by using the new native function released in Spark 3.5.0 [regexp_count](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_count.html) 

In [None]:
(
    questionsDF
    .withColumn('python', count_occurences(col('body'), lit('python')))
    .withColumn('python_2', regexp_count('body', lit('python')))
    .select('question_id','body', 'python', 'python_2')
    .orderBy(desc('python_2'), 'question_id')
).show(truncate=30)

In [None]:
spark.stop()