# Data exploration

The goal of this notebook is to get familiar with the datasets that will be used throughout the training. 

Explore these three dataset:
* questions (Json format)
* answers (Parquet format)
* users (Parquet format)


1. For each of them:
  * see the schema
  * see 10 records
  * find the total count
    
2. For users find out how many distinct locations we have
3. Who asked the question with highest score?
4. Is the answer that has the highest score accepted?
5. Identify the question with the most occurrences of the word `spark` in the body, case-insensitive.
6. Compute response time for spark-related question that has answer with highest score

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count, regexp_count, lower, lit

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Data inspection')
    .getOrCreate()
)

In [None]:
print(spark.version)

In [None]:
# Paths to the datasets:

base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'data/questions-json')

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

In [None]:
usersDF = spark.read.parquet(users_input_path)

questionsDF = spark.read.json(questions_input_path)

answersDF = spark.read.parquet(answers_input_path)

### 1. Explore the data: Check the schemas, counts and some records

Hint:
* use `printSchema`, `count`, `show`

In [None]:
usersDF.printSchema()

In [None]:
usersDF.show(n=10)

In [None]:
usersDF.count()

In [None]:
questionsDF.printSchema()

In [None]:
questionsDF.show(n=10, truncate=10)

In [None]:
questionsDF.count()

In [None]:
answersDF.printSchema()

In [None]:
answersDF.show(n=10, truncate=10)

In [None]:
answersDF.count()

### 2. For users find out how many distinct locations we have

Hint:
* use `distinct` or `dropDuplicates`
* docs for [distinct](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.distinct.html#pyspark.sql.DataFrame.distinct)
* docs for [dropDuplicates](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html#pyspark.sql.DataFrame.dropDuplicates)

In [None]:
usersDF.select('location').distinct().count()

### 3. Who asked the question with highest score?

Hint:
* [join](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join) questions with users on the `user_id` column
* use [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html#pyspark.sql.DataFrame.orderBy) + [desc](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.desc.html#pyspark.sql.functions.desc)
* after sorting select only the question score and the user specific attributes

In [None]:
(
    questionsDF
    .join(usersDF.alias('users'), 'user_id')
    .orderBy(desc('score'))
    .select('users.*', 'score')
).show(n=1)

### 4. Is the answer that has the highest score accepted?

Hint:
* join answers with questions
* sort in desc order by the answers score to get the answer with highest score
* check the `accepted_answer_id` column: if the value is the same as the value in `answer_id` colum then it is accepted

In [None]:
(
    answersDF.alias('answers')   
    .join(questionsDF.alias('questions'), 'question_id', 'left')
    .orderBy(desc('answers.score'))
    .select('answers.question_id', 'answer_id', 'accepted_answer_id')
).show()

### 5. Identify the question with the most occurrences of the word `spark` in the body, case-insensitive.

Hint:
* check the functions:
  * [regexp_count](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_count.html)
  * [lower](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lower.html)

In [None]:
(
    questionsDF
    .withColumn('spark_count', regexp_count(lower(col('body')), lit('spark')))
    .orderBy(desc('spark_count'))
    .select('question_id', 'title', 'body', 'spark_count')
).show(truncate=50, n=1)

### 6. Compute response time for spark-related question that has the answer with highest score

Hint:
* in our context spark-related means: Find questions where at least one tag contains expression `spark`
* identify which of these questions has answer with the highest score
* for this particular question with highest score compute the response time (how long it took between posting the question and posting its answer) and convert it to minutes.
* what is the question about? Apart from the response time, select also the title of the question

In [None]:
(
    questionsDF.withColumn('creation_date', col('creation_date').cast('timestamp'))
    .filter(col('tags').like('%spark%'))
    .alias('questions')
    .join(answersDF.alias('answers'), 'question_id')    
    .orderBy(desc('answers.score'))
    .withColumn('response_time', (col('answers.creation_date').cast('long') - col('questions.creation_date').cast('long')) / 60)
    .select('title', 'answers.creation_date', 'questions.creation_date', 'response_time')              
).show(truncate=70, n=1)

In [None]:
spark.stop()