# Simple text categorization

* In this notebook we will see how to dynamically process lots of culumns using a function as wrapper over native functions

## Task

* Assume we have some categories (labels or tags) and we want to do a simple text categorization. For each of these categories find out if the category is contained in the body of the question
* Implement a function that will add a new column for each of these categories. The name of the column should be the name of the category and the value is 1/0 depending on whether the word is contained in the text.
* According to this rule each question can belong to multiple groups
* As the final result, compute a sum for each of these categories, to see how many questions belong there

### Example

* categories: [java, python]
* for each categeory we will add one column to the dataframe
* text message: "I prefer coding in python"
* word `python` is present in the text so in `python` column we will have 1, in `java` column we will have 0 (on this particular row) 

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, sum

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Text search')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'output/questions-transformed')

In [None]:
questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
)

#### First do it without a function

Hint:
* use `when` - `otherwise` condition together with the `like` function
 * [like](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.like)
 * [when](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when)
* at the end, sum the occurences using `df.select(sum(..).alias(..), sum(..).alias(..), ...)`
    * [alias](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.alias) let's you rename the column
* the output should be a DataFrame with one row

* Here are the categories: `['java', 'sql', 'python', 'spark']`
* Here are the corresponding column names: `['java', 'sql', 'python', 'spark']` (it is the same as the category names)

In [None]:
(
    questionsDF
    .select('body')
    .withColumn('java', when(col('body').like('%java%'), lit(1)).otherwise(0))
    .withColumn('sql', when(col('body').like('%sql%'), lit(1)).otherwise(0))
    .withColumn('python', when(col('body').like('%python%'), lit(1)).otherwise(0))
    .withColumn('spark', when(col('body').like('%spark%'), lit(1)).otherwise(0))
    .select(
        sum('java').alias('java'), 
        sum('sql').alias('sql'), 
        sum('python').alias('python'), 
        sum('spark').alias('spark')
    )
).show()

# Implement a function

* Now do the same using a function
* The function should take 2 arguments:
    * df...a DataFrame that will be transformed
    * input_col...name of the column with the text message
* The function should return a new DataFrame with new columns
* The function will simply add for each category new col using some `for-loop` 

Hint:
* in the for loop we will iterate over the categories and you will need to dynamicaly create a string for the `like` function. In Python you can use curly braces with `format` function, for example: `'%{}%'.format(category)`
* call the final function as follows:
    * `fun_name(df=questionsDF, input_col='body')`
    * or use [transform](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.transform) (available in Spark 3.0):
        * `questionsDF.transform(lambda x: fun_name(df=x, input_col='body'))`

In [None]:
categories = ['java', 'sql', 'python', 'spark']

def get_c(df, input_col='body'):
    for category in categories:
        df = df.withColumn(category, when(col(input_col).like('%{}%'.format(category)), lit(1)).otherwise(0))
    return df

In [None]:
result = get_c(df=questionsDF.select('body'), input_col='body')

# or since Spark 3.0:
result = questionsDF.select('body').transform(lambda x: get_c(df=x, input_col='body'))

In [None]:
result.filter(col('spark') > 0).show(n=3)

# Final sum

* Use Python `map + lambda` function to iterate over the column names and apply the `sum` with `alias` 

In [None]:
result.select(list(map(lambda x: sum(x).alias(x), categories))).show()

In [None]:
spark.stop()