# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [22]:
from pyspark.sql import SparkSession


import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

log_data = spark.read.json("data/sparkify_log_small.json")
log_data.createOrReplaceTempView("log_table")

# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

# Question 1

Which page did user id ""(empty string) NOT visit?

In [2]:
# TODO: write your code to answer question 1
log_data.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [16]:
spark.sql("""
            SELECT * FROM (
                            SELECT DISTINCT page
                            FROM log_table
                            WHERE userId ='') AS user_pages
                     RIGHT JOIN (
                            SELECT DISTINCT page 
                            FROM log_table
                            ) AS all_pages
                    ON user_pages.page = all_pages.page
                    WHERE user_pages.page IS NULL    
                        """).show()

+----+----------------+
|page|            page|
+----+----------------+
|null|Submit Downgrade|
|null|       Downgrade|
|null|          Logout|
|null|   Save Settings|
|null|        Settings|
|null|        NextSong|
|null|         Upgrade|
|null|           Error|
|null|  Submit Upgrade|
+----+----------------+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

# Question 3

How many female users do we have in the data set?

In [20]:
# TODO: write your code to answer question 3
spark.sql("""
            SELECT gender, COUNT(DISTINCT userId) AS n
            FROM log_table
            GROUP BY gender
            HAVING gender = 'F'
           """).show()

+------+----+
|gender|   n|
+------+----+
|     F|1905|
+------+----+



# Question 4

How many songs were played from the most played artist?

In [25]:
# TODO: write your code to answer question 4
spark.sql("""
            SELECT artist, COUNT(artist) AS n FROM log_table
            GROUP BY artist
            HAVING artist IS NOT NULL
            ORDER BY n DESC
            LIMIT 1""").show()

+--------+---+
|  artist|  n|
+--------+---+
|Coldplay| 83|
+--------+---+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [None]:
# TODO: write your code to answer question 5
