# Data exploration

The goal of this notebook is to get familiar with the datasets that will be used throughout the training. 

Explore these three dataset:
* questions (Json format)
* answers (Parquet format)
* users (Parquet format)


1. For each of them:
  * see the schema
  * see first 10 records
  * find the total count


2. For users find out how many distinct locations we have
3. Compute how many users we have in each location
4. Find three top users with highest reputation
5. Find the top question with the highest score
6. Count how many questions have accepted answer
7. Find the question that has the most answers

In [20]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count

import os

In [21]:
spark = (
    SparkSession
    .builder
    .appName('Data inspection')
    .getOrCreate()
)

In [22]:
# Paths to the datasets:

base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'data/questions-json')

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

In [23]:
usersDF = spark.read.parquet(users_input_path)

questionsDF = spark.read.json(questions_input_path)

answersDF = spark.read.parquet(answers_input_path)

#### Check the schemas, counts and some records

Hint:
* use `printSchema`, `count`, `show`

In [24]:
usersDF.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- display_name: string (nullable = true)
 |-- about: string (nullable = true)
 |-- location: string (nullable = true)
 |-- downvotes: long (nullable = true)
 |-- upvotes: long (nullable = true)
 |-- reputation: long (nullable = true)
 |-- views: long (nullable = true)



In [25]:
usersDF.show(n=10)

+-------+------------------+--------------------+--------------------+---------+-------+----------+-----+
|user_id|      display_name|               about|            location|downvotes|upvotes|reputation|views|
+-------+------------------+--------------------+--------------------+---------+-------+----------+-----+
| 189509|           tgmjack|                null|                null|        0|      0|        45|   18|
|  22619|       user1956641|                null|St.Petersburg, Ru...|        0|      0|         1|    1|
| 109648|            Dexter|                null|                null|        0|      0|         1|    0|
|  27389|              Cala|                    |              Greece|        0|      7|       387|   42|
| 242785|Bitthal Maheshwari|                null|                null|        0|      0|         1|    0|
| 178788|        J. Mossman|                null|                null|        0|      0|         1|    0|
| 172407|        Elise Bond|                nu

In [26]:
usersDF.count()

190014

In [27]:
questionsDF.printSchema()

root
 |-- accepted_answer_id: long (nullable = true)
 |-- answers: long (nullable = true)
 |-- body: string (nullable = true)
 |-- comments: long (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- question_id: long (nullable = true)
 |-- score: long (nullable = true)
 |-- tags: string (nullable = true)
 |-- title: string (nullable = true)
 |-- user_id: long (nullable = true)
 |-- views: long (nullable = true)



In [28]:
questionsDF.show(n=10, truncate=10)

+------------------+-------+----------+--------+-------------+-----------+-----+----------+----------+-------+-----+
|accepted_answer_id|answers|      body|comments|creation_date|question_id|score|      tags|     title|user_id|views|
+------------------+-------+----------+--------+-------------+-----------+-----+----------+----------+-------+-----+
|              null|      0|<h2>Que...|       6|   2016-12...|     296663|    0|<radiat...|Probabi...| 123260|   33|
|              6517|      6|<p>I'm ...|       9|   2011-03...|       6505|   12|<classi...|Are wat...|   1272|12769|
|            122806|      1|<p>How ...|       2|   2014-07...|     122781|    3|<statis...|Evaluat...|  52646|  365|
|              null|      1|<p>The ...|       3|   2017-01...|     304107|    0|   <waves>|How do ...| 106228|  200|
|              null|      2|<p>When...|       1|   2015-01...|     160063|    0|<centri...|Why do ...|  70599|  269|
|              null|      1|<p>Supp...|       0|   2017-12...|  

In [29]:
questionsDF.count()

154905

In [30]:
answersDF.printSchema()

root
 |-- answer_id: long (nullable = true)
 |-- creation_date: timestamp (nullable = true)
 |-- body: string (nullable = true)
 |-- comments: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- score: long (nullable = true)
 |-- question_id: long (nullable = true)



In [31]:
answersDF.show(n=10, truncate=10)

+---------+-------------+----------+--------+-------+-----+-----------+
|answer_id|creation_date|      body|comments|user_id|score|question_id|
+---------+-------------+----------+--------+-------+-----+-----------+
|   334556|   2017-05...|<p>Prev...|       2| 156813|    0|     334548|
|   391386|   2018-03...|<p>Grav...|       9| 150025|   10|     391381|
|   201434|   2015-08...|<p>Sinc...|       3|  24022|    1|     201428|
|   408789|   2018-05...|<p>Dual...|       2|   2525|    1|     408737|
|    81529|   2013-10...|<p>Have...|       1|   1325|    1|      81503|
|    63988|   2013-05...|<p>I) P...|       1|   2451|    2|      63950|
|     6401|   2011-03...|<p><str...|       0|   1257|    0|       6370|
|   479648|   2019-05...|<p>In a...|       0|  37364|    1|     479619|
|    47562|   2012-12...|<p>The ...|       0|  14473|    0|      47074|
|   290761|   2016-11...|<p>The ...|       0|   4272|    1|     290515|
+---------+-------------+----------+--------+-------+-----+-----

#### For users find out how many distinct locations we have

Hint:
* use `distinct` or `dropDuplicates`
* docs for distinct: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct

In [32]:
usersDF.select('location').distinct().count()

10457

#### Compute how many users we have in each location

Hint:
* filter out nulls
* use `groupBy` + `count`

In [33]:
(
    usersDF
    .filter(col('location').isNotNull())
    .groupBy('location')
    .agg(count('*').alias('cnt'))
    .orderBy(desc('cnt'))
).show()

+--------------------+----+
|            location| cnt|
+--------------------+----+
|               India|1604|
|       United States| 932|
|             Germany| 792|
|      United Kingdom| 595|
|London, United Ki...| 583|
|              Canada| 404|
|                 USA| 403|
|              France| 367|
|         Netherlands| 311|
|                  UK| 307|
|           Australia| 294|
|Bangalore, Karnat...| 280|
|              Brazil| 280|
|               Italy| 279|
|    Bangalore, India| 254|
|       Paris, France| 252|
|     Berlin, Germany| 243|
|               Earth| 227|
|          London, UK| 223|
|           Singapore| 218|
+--------------------+----+
only showing top 20 rows



#### Find three top users with highest reputation

Hint:
* use `orderBy` + `desc` + `limit`

In [34]:
usersDF.orderBy(desc('reputation')).limit(3).show()

+-------+------------+--------------------+--------------------+---------+-------+----------+-----+
|user_id|display_name|               about|            location|downvotes|upvotes|reputation|views|
+-------+------------+--------------------+--------------------+---------+-------+----------+-----+
|   1325| John Rennie|<p>My career in s...|Chester, United K...|     2594|   2861|    294805|70188|
|   1492|      anna v|<p>Retired experi...|              Greece|      137|   8726|    177737|36752|
|   1236|  Luboš Motl|<p>Hi, I am a str...|      Czech Republic|      917|   2013|    159837|83704|
+-------+------------+--------------------+--------------------+---------+-------+----------+-----+



#### Find the top question with the highest score

In [35]:
(
    questionsDF.orderBy(desc('score')).select('title', 'score').limit(1)
).show(truncate=200)

+--------------------------------------------+-----+
|                                       title|score|
+--------------------------------------------+-----+
|Cooling a cup of coffee with help of a spoon|  703|
+--------------------------------------------+-----+



#### Count how many questions have accepted answer

In [36]:
questionsDF.filter(col('accepted_answer_id').isNotNull()).count()

66417

#### Find the question that has the most answers

Hint:
* join questions with answers on `question_id`
* groupBy question_id
* count
* orderBy the count in desc order
* limit 1
* collect the question_id and use it in the next query

In [37]:
question_id = (
    questionsDF.alias('q')
    .join(answersDF.alias('a'), 'question_id')
    .groupBy('question_id')
    .agg(count('*').alias('cnt'))
    .orderBy(desc('cnt'))
    .limit(1)
).collect()[0]['question_id']

In [38]:
(
    questionsDF.filter(col('question_id') == question_id)
    .select('title')
).show(truncate=False)

+-------------------------------+
|title                          |
+-------------------------------+
|Common false beliefs in Physics|
+-------------------------------+

