# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

In [3]:
spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

In [4]:
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

In [5]:
user_log.createOrReplaceTempView("user_log_table")

# Question 1

Which page did user id ""(empty string) NOT visit?

In [10]:
spark.sql("""SELECT distinct page 
            FROM user_log_table
            WHERE page not in (SELECT page FROM user_log_table WHERE userId = '')
          """).show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [11]:
spark.sql("""SELECT count(distinct userId) 
            FROM user_log_table
            WHERE gender = 'F'
          """).show()

+----------------------+
|count(DISTINCT userId)|
+----------------------+
|                   462|
+----------------------+



# Question 4

How many songs were played from the most played artist?

In [16]:
spark.sql("""SELECT artist, count(song) as songs_played
            FROM user_log_table
            GROUP BY artist
            ORDER BY count(song) desc
            LIMIT 1
          """).show()

+--------+------------+
|  artist|songs_played|
+--------+------------+
|Coldplay|          83|
+--------+------------+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [22]:
spark.sql("""SELECT avg(song_counts)
            FROM 
            (SELECT count(page) song_counts
            FROM
            (SELECT userId, page, ts,
            SUM(CASE WHEN page == 'Home' THEN 1 ELSE 0 END) OVER (PARTITION BY userId ORDER BY ts) as period
            FROM user_log_table
            WHERE userId != '' 
            AND page = 'NextSong' OR page = 'Home'
            ORDER BY userId, ts)
            WHERE page = 'NextSong'
            GROUP BY userId, period)
          """).show()

+------------------+
|  avg(song_counts)|
+------------------+
|6.9558333333333335|
+------------------+

