# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

spark = SparkSession\
        .builder\
        .appName("Data wrangling with Spark SQL")\
        .getOrCreate()

path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)
user_log.createOrReplaceTempView("user_log_table")
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



# Question 1

Which page did user id ""(empty string) NOT visit?

In [2]:
# TODO: write your code to answer question 1
spark.sql("""SELECT DISTINCT page 
             FROM user_log_table
             WHERE page NOT IN (SELECT DISTINCT page 
                                FROM user_log_table
                                WHERE userId = '')
""").show()


+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [3]:
# TODO: write your code to answer question 3
spark.sql("""
    SELECT gender, count(gender)
    FROM (SELECT DISTINCT userId, gender
          FROM user_log_table)
    GROUP BY gender
""").show()

+------+-------------+
|gender|count(gender)|
+------+-------------+
|     F|          462|
|  null|            0|
|     M|          501|
+------+-------------+



# Question 4

How many songs were played from the most played artist?

In [4]:
# TODO: write your code to answer question 4
spark.sql("""
    SELECT artist, COUNT(artist) AS count
    FROM user_log_table
    WHERE userId IS NOT NULL
    GROUP BY artist
    ORDER BY count DESC
    LIMIT 1
""").show()

+--------+-----+
|  artist|count|
+--------+-----+
|Coldplay|   83|
+--------+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [5]:
spark.sql("""
        SELECT userId, sessionId, itemInSession, page,
                 CASE WHEN page = 'Home' THEN 1
                 ELSE 0 
                 END AS flag
        FROM user_log_table
""").show()

+------+---------+-------------+--------+----+
|userId|sessionId|itemInSession|    page|flag|
+------+---------+-------------+--------+----+
|  1046|     5132|          112|NextSong|   0|
|  1000|     5027|            7|NextSong|   0|
|  2219|     5516|            6|NextSong|   0|
|  2373|     2372|            8|NextSong|   0|
|  1747|     1746|            0|    Home|   1|
|  1747|     1746|            1|Settings|   0|
|  1162|     4406|            0|NextSong|   0|
|  1061|     1060|            2|NextSong|   0|
|   748|     5661|            2|    Home|   1|
|   597|     3689|            0|    Home|   1|
|  1806|     5175|           23|NextSong|   0|
|   748|     5661|            3|NextSong|   0|
|  1176|     1175|           82|NextSong|   0|
|  2164|     2163|           28|NextSong|   0|
|  2146|     5272|            3|NextSong|   0|
|  2219|     5516|            7|NextSong|   0|
|  1176|     1175|           83|    Home|   1|
|  2904|     2903|            0|NextSong|   0|
|   597|     

In [6]:
# TODO: write your code to answer question 5
spark.sql("""
    SELECT userId, sessionId, itemInSession, page, flag,
           SUM(flag) OVER (PARTITION BY userId ORDER BY ts) AS running
    FROM (SELECT userId, sessionId, itemInSession, ts, page,
                 CASE WHEN page = 'Home' THEN 1
                 ELSE 0 
                 END AS flag
          FROM user_log_table)
""").take(20)

[Row(userId='1436', sessionId=1435, itemInSession=0, page='NextSong', flag=0, running=0),
 Row(userId='1436', sessionId=1435, itemInSession=1, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=0, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=1, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=2, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=3, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=4, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=5, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=6, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=7, page='NextSong', flag=0, running=0),
 Row(userId='2088', sessionId=2087, itemInSession=8, page='NextSong', flag=0, running=0),
 Row(userI

In [7]:
# TODO: write your code to answer question 5
spark.sql("""
    SELECT userId, running, COUNT(page)
    FROM (
        SELECT userId, sessionId, itemInSession, page, flag,
               SUM(flag) OVER (PARTITION BY userId ORDER BY ts) AS running
        FROM (SELECT userId, sessionId, itemInSession, ts, page,
                     CASE WHEN page = 'Home' THEN 1
                     ELSE 0 
                     END AS flag
              FROM user_log_table))
        WHERE page = 'NextSong'
        GROUP BY userId, running
""").take(20)

[Row(userId='1436', running=0, count(page)=2),
 Row(userId='2088', running=0, count(page)=13),
 Row(userId='2162', running=0, count(page)=15),
 Row(userId='2162', running=2, count(page)=19),
 Row(userId='2294', running=0, count(page)=11),
 Row(userId='2294', running=1, count(page)=4),
 Row(userId='2294', running=2, count(page)=16),
 Row(userId='2294', running=3, count(page)=3),
 Row(userId='2294', running=4, count(page)=17),
 Row(userId='2294', running=5, count(page)=4),
 Row(userId='2904', running=0, count(page)=1),
 Row(userId='691', running=0, count(page)=3),
 Row(userId='1394', running=0, count(page)=17),
 Row(userId='1394', running=1, count(page)=9),
 Row(userId='2275', running=1, count(page)=3),
 Row(userId='2756', running=0, count(page)=1),
 Row(userId='2756', running=2, count(page)=4),
 Row(userId='451', running=1, count(page)=1),
 Row(userId='451', running=2, count(page)=1),
 Row(userId='800', running=0, count(page)=2)]

In [8]:
# TODO: write your code to answer question 5
spark.sql("""
    SELECT AVG(count)
    FROM (
    SELECT userId, running, COUNT(page) AS count
    FROM (
        SELECT userId, sessionId, itemInSession, page, flag,
               SUM(flag) OVER (PARTITION BY userId ORDER BY ts) AS running
        FROM (SELECT userId, sessionId, itemInSession, ts, page,
                     CASE WHEN page = 'Home' THEN 1
                     ELSE 0 
                     END AS flag
              FROM user_log_table))
        WHERE page = 'NextSong'
        GROUP BY userId, running)
""").show()

+------------------+
|        avg(count)|
+------------------+
|6.9558333333333335|
+------------------+

