# Data exploration

The goal of this notebook is to get familiar with the datasets that will be used throughout the training. 

Explore these three dataset:
* questions (Json format)
* answers (Parquet format)
* users (Parquet format)


1. For each of them:
  * see the schema
  * see first 10 records
  * find the total count


2. For users find out how many distinct locations we have
3. Compute how many users we have in each location
4. Find three top users with highest reputation
5. Find the top question with the highest score
6. Count how many questions have accepted answer
7. Find the question that has the most answers

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Data inspection')
    .getOrCreate()
)

In [None]:
# Paths to the datasets:

base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'data/questions-json')

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

In [None]:
usersDF = spark.read.parquet(users_input_path)

questionsDF = spark.read.json(questions_input_path)

answersDF = spark.read.parquet(answers_input_path)

#### Check the schemas, counts and some records

Hint:
* use `printSchema`, `count`, `show`

In [None]:
# your code here: 


#### For users find out how many distinct locations we have

Hint:
* use `distinct` or `dropDuplicates`
* docs for [distinct](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.distinct.html#pyspark.sql.DataFrame.distinct)
* docs for [dropDuplicates](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html#pyspark.sql.DataFrame.dropDuplicates)

In [None]:
# your code here: 


#### Compute how many users we have in each location

Hint:
* [filter](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html#pyspark.sql.DataFrame.filter) out nulls using [isNotNull](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.isNotNull.html#pyspark.sql.Column.isNotNull)
* use [groupBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html#pyspark.sql.DataFrame.groupBy) + [count](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.count.html#pyspark.sql.functions.count)

In [None]:
# your code here: 


#### Find three top users with highest reputation

Hint:
* use [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html#pyspark.sql.DataFrame.orderBy) + [desc](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.desc.html#pyspark.sql.functions.desc) + [limit](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.limit.html#pyspark.sql.DataFrame.limit)

In [None]:
# your code here: 


#### Find the top question with the highest score

In [None]:
# your code here: 


#### Count how many questions have accepted answer

In [None]:
# your code here: 


#### Find the question that has the most answers

Hint:
* group answers by question_id
* count
* orderBy the count in desc order
* limit 1
* collect the question_id and use it in a filter for on questions

In [None]:
# your code here: 


In [None]:
 # your code here: 


In [None]:
spark.stop()