# Data Wrangling with Spark SQL Quiz

This quiz uses the same dataset and most of the same questions from the earlier "Quiz - Data Wrangling with Data Frames Jupyter Notebook." For this quiz, however, use Spark SQL instead of Spark Data Frames.

In [3]:
# TODOS: 
# 1) import any other libraries you might need
# 2) instantiate a Spark session 
# 3) read in the data set located at the path "data/sparkify_log_small.json"
# 4) create a view to use with your SQL queries
# 5) write code to answer the quiz questions 

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

spark = SparkSession \
    .builder \
    .appName("Data wrangling with Spark SQL") \
    .getOrCreate()

path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)

user_log.createOrReplaceTempView("user_log_table")

spark.sql("SELECT * FROM user_log_table LIMIT 2").show()

+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|Showaddywaddy|Logged In|  Kenneth|     M|          112|Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|   Lily Allen|Logged In|Elizabeth|     F|            7|   Chase|195.23873| free|Shreveport-Bossie...|   PUT|NextSong|1512718541284|     5027|      

# Question 1

Which page did user id ""(empty string) NOT visit?

In [19]:
# TODO: write your code to answer question 1
spark.sql('''
    SELECT *
    FROM (  
        SELECT DISTINCT page FROM user_log_table
    ) AS all_pages
    LEFT JOIN (
        SELECT DISTINCT page
        FROM user_log_table
        WHERE userID = ''    
    ) AS user_pages
    ON user_pages.page = all_pages.page
    WHERE user_pages.page IS NULL
    ''').show()


+----------------+----+
|            page|page|
+----------------+----+
|Submit Downgrade|null|
|       Downgrade|null|
|          Logout|null|
|   Save Settings|null|
|        Settings|null|
|        NextSong|null|
|         Upgrade|null|
|           Error|null|
|  Submit Upgrade|null|
+----------------+----+



# Question 2 - Reflect

Why might you prefer to use SQL over data frames? Why might you prefer data frames over SQL?

## SQL over data frames

  - SQL is easier to read
  
## Data frames over SQL
  
  - This SQL has two nested queries, which right now are simple, but might get large over time. Using dataframes this behavior can be broken apart. After that you can unit test each nested behavior to make sure it returns the expected dataframe, which gives you reduced risk when releasing to production.

# Question 3

How many female users do we have in the data set?

In [30]:
# TODO: write your code to answer question 3
spark.sql('''
    SELECT COUNT(*)
    FROM (
        SELECT DISTINCT userId, gender FROM user_log_table
    ) AS unique_users
    WHERE unique_users.gender = 'F'
    ''').show()

+--------+
|count(1)|
+--------+
|     462|
+--------+



# Question 4

How many songs were played from the most played artist?

In [38]:
# TODO: write your code to answer question 4
spark.sql('''
    SELECT artist, count(artist) as plays
    FROM user_log_table
    GROUP BY artist
    ORDER BY plays desc
    LIMIT 1
    ''').show()

+--------+-----+
|  artist|plays|
+--------+-----+
|Coldplay|   83|
+--------+-----+



# Question 5 (challenge)

How many songs do users listen to on average between visiting our home page? Please round your answer to the closest integer.

In [None]:
# TODO: write your code to answer question 5

