# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, count, when, col, desc, udf, col, sort_array, asc, avg
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType

In [2]:
spark = SparkSession \
    .builder \
    .appName("Wrangling Data with Spark SQL") \
    .getOrCreate()

In [3]:
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

In [6]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [7]:
user_log.createOrReplaceTempView('log_table')

# Question 1

Which page did user id ""(empty string) NOT visit?

In [10]:
# TODO: write your code to answer question 1
spark.sql('''
    select distinct page 
    from log_table
except
    select distinct page 
    from log_table
    where userId = ''
''').show()

+----------------+
|            page|
+----------------+
|Submit Downgrade|
|       Downgrade|
|          Logout|
|   Save Settings|
|        Settings|
|        NextSong|
|         Upgrade|
|           Error|
|  Submit Upgrade|
+----------------+



# Question 3

How many female users do we have in the data set?

In [13]:
# TODO: write your code to answer question 3
spark.sql('''
select count(distinct userId) As female_count
from log_table
where gender = 'F'
''').show()

+------------+
|female_count|
+------------+
|         462|
+------------+



# Question 4

How many songs were played from the most played artist?

In [17]:
# TODO: write your code to answer question 4
spark.sql('''
SELECT Max(CNT) as Max_CNT
FROM (
    SELECT artist,Count(*) AS CNT
    FROM log_table
    WHERE artist IS NOT NULL
    Group By artist
    ) AS a
''').show()

+-------+
|Max_CNT|
+-------+
|     83|
+-------+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [None]:
# TODO: write your code to answer question 5