# Task - simple text search

* write a function (wraper over native pyspark dataframe functions) that detects if a given name is contained in the question text
* this function should return one of these names (if they are contained in the text): einstein, newton, maxwell, dirac, gauss. If non of them is contained in the text return other
* Write the function in such way it can handle large array of words
* We will use this function again in a streaming application later on

Note:
* In this ntb you will build programatically column expression and do a simple search in text

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, array, lit, when

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Text search')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [None]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

#### Implement function for text search

Hint
* use when-otherwise condition
* use like function for text search
* the function should take column as input and return another column as output

In [None]:
def get_person(message):
    return (
        when(message.like('%einstein%'), 'einstein')
        .when(message.like('%newton%'), 'newton')
        .when(message.like('%maxwell%'), 'maxwell')
        .when(message.like('%dirac%'), 'dirac')
        .when(message.like('%gauss%'), 'gauss')
        .otherwise('other')
    )

#### Apply the function

* use also groupBy the result to see how many occurences are there for each name

In [None]:
(
    questionsDF
    .withColumn('physicist', get_person(col('body')))
    .groupBy('physicist')
    .agg(count('*'))
    .orderBy('physicist')
).show(truncate=False)

#### Now implement the function more dynamically

Hint:
* define a list of names that we look for
* iterate over the array and build the condition that is used in the function

In [None]:
names = ['einstein', 'newton', 'maxwell', 'dirac', 'gauss']

In [None]:
def get_person_dynamic(message):
    col_exp = when(lit(False), '')
    for name in names:
        col_exp = col_exp.when(message.like('%{}%'.format(name)), name)
    return col_exp.otherwise('other')

In [None]:
(
    questionsDF
    .withColumn('physicist', get_person_dynamic(col('body')))
    .groupBy('physicist')
    .agg(count('*'))
    .orderBy('physicist')
).show(truncate=False)

In [None]:
spark.stop()

Note
* other possibilities for text search are using the functions:
    * rlike
    * regexp_extract