## Do high-reputation users tend to write more readable content?

In this notebook we will see how to use a Python library to analyze text using User Defined functions.

You will:
* Do some basic text cleaning of the answers
* Compute readability of answers
* Join answers with users
* Compute average readability per user
* Compute correlation between user reputation and avarage readability of his answers

In [None]:
from textstat import flesch_reading_ease
import os
import re

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, udf, desc, corr, avg, first, regexp_replace, trim


In [None]:
spark = (
    SparkSession
    .builder
    .appName('Text Analysis')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

## Compute readability of each answer

Hint
* we will work with the answers dataset. Check the body of the answers, you will see that they contain html tags and possible other characters that is good to remove for the analysis
* implement a function that will do (at least some basic) text cleaning
  * this function can be a native wrapper over [regexp_replace](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html) that can be used to remove the unwanted characters
* implement a User Defined Function that will use the `flesch_reading_ease` function from the `textstat` Python library
  * for more info about textstat see the [docs](https://pypi.org/project/textstat/)
  * for more info about the metodology to compute readability using Flesh Reading Ease see [wiki](https://simple.wikipedia.org/wiki/Flesch_Reading_Ease)
* Note:
  * if you want to go for some more robust text cleaning and the regexp_replace is not sufficient or to cumbersome for you, you can use Python functionality for the text cleaning and make it part of your UDF. 

In [None]:
# create the input dataframes for answers and users:

answersDF = spark.read.parquet(answers_input_path)

usersDF = spark.read.parquet(users_input_path)

In [None]:
# implement the text cleaning function:

def clean_text(df: DataFrame) -> DataFrame:
    return (
        df.withColumn("body", regexp_replace("body", "<[^>]*>", ""))  # Remove HTML tags
        .withColumn("body", regexp_replace("body", "\\\\n|\\\\r|\\\\t|\\n|\\r|\\t", " "))  # Remove escape characters
        .withColumn("body", regexp_replace("body", "\\s+", " "))  # Collapse multiple spaces
        .withColumn("body", trim("body"))  # Trim leading/trailing spaces
    )

In [None]:
# implement the udf to compute the readability

@udf('double')
def readability_udf(text):
    if not text:
        return None
    return flesch_reading_ease(text.strip())

## Compute correlation between average readability and reputation of users

Hint:
* apply the udf to compute the readability
* join answers with users to bring the info about reputation
* group by users and compute avg readability for each user (this should describe how readable are on average the published answers for each user)
* compute the Pearson correlation coefficient for average readability and reputation
  * see [docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.corr.html) for corr 

In [None]:
(
    answersDF
    .transform(clean_text)
    .withColumn('readability', readability_udf(col('body')))
    .join(usersDF, 'user_id')
    .groupBy('user_id')
    .agg(
        avg('readability').alias('avg_readability'),
        first('reputation').alias('reputation')
    )
    .agg(corr('avg_readability', 'reputation'))
).show()

In [None]:
spark.stop()