# Data Wrangling with DataFrames  (Q and A)

Query data with an imperative programming (`Spark Data Frames`) approach and a declarative programming (`Spark SQL`) approach


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

spark = SparkSession \
    .builder \
    .appName("Wrangling Data") \
    .getOrCreate()
path = "data/sparkify_log_small.json"
user_log = spark.read.json(path)


# create a view to use for SQL queries for the declarative approach
user_log.createOrReplaceTempView("user_log_table")

In [2]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [3]:
user_log.show(2)

+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|Showaddywaddy|Logged In|  Kenneth|     M|          112|Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|   Lily Allen|Logged In|Elizabeth|     F|            7|   Chase|195.23873| free|Shreveport-Bossie...|   PUT|NextSong|1512718541284|     5027|      

## 1. Which page did user id "" (empty string) NOT visit?

In [4]:

page_all = user_log.select('page').dropDuplicates().sort('page')
page_all.show()


+----------------+
|            page|
+----------------+
|           About|
|       Downgrade|
|           Error|
|            Help|
|            Home|
|           Login|
|          Logout|
|        NextSong|
|   Save Settings|
|        Settings|
|Submit Downgrade|
|  Submit Upgrade|
|         Upgrade|
+----------------+



In [5]:
page_none = user_log.select('page').where(user_log.userId == "").dropDuplicates().sort('page')
page_none.show()

+-----+
| page|
+-----+
|About|
| Help|
| Home|
|Login|
+-----+



In [6]:
set(page_all.collect())-set(page_none.collect())

{Row(page='Downgrade'),
 Row(page='Error'),
 Row(page='Logout'),
 Row(page='NextSong'),
 Row(page='Save Settings'),
 Row(page='Settings'),
 Row(page='Submit Downgrade'),
 Row(page='Submit Upgrade'),
 Row(page='Upgrade')}

In [7]:
#SQL
spark.sql("""
          SELECT * 
            FROM ( 
                SELECT DISTINCT page 
                FROM user_log_table 
                WHERE userID='') AS user_pages 
            RIGHT JOIN ( 
                SELECT DISTINCT page 
                FROM user_log_table) AS all_pages 
            ON user_pages.page = all_pages.page 
            WHERE user_pages.page IS NULL
         """
         ).show() #

+----+----------------+
|page|            page|
+----+----------------+
|null|Submit Downgrade|
|null|       Downgrade|
|null|          Logout|
|null|   Save Settings|
|null|        Settings|
|null|        NextSong|
|null|         Upgrade|
|null|           Error|
|null|  Submit Upgrade|
+----+----------------+



## 2. How many female users do we have in the data set?

In [8]:

user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [9]:

user_log.groupby("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|     F| 3820|
|  null|  336|
|     M| 5844|
+------+-----+



In [10]:
user_log.select('userId', 'gender').filter(user_log.gender == 'F').dropDuplicates().count()

462

In [11]:
# SQL
spark.sql(
        """
        SELECT COUNT(DISTINCT userID) as Count 
        FROM user_log_table \
        WHERE gender = 'F'
        """
         ).show()

+-----+
|Count|
+-----+
|  462|
+-----+



## 3. How many songs were played from the most played artist?

In [12]:

from pyspark.sql.functions import col
user_log.select("artist").groupby('artist').count().sort(col('count').desc()).show()
#user_log.select("page").dropDuplicates().sort("page").show()
#user_log.where(user_log.page == "NextSong").groupby(user_log.hour).count().orderBy(user_log.hour.cast("float")).show()

+--------------------+-----+
|              artist|count|
+--------------------+-----+
|                null| 1653|
|            Coldplay|   83|
|       Kings Of Leon|   69|
|Florence + The Ma...|   52|
|            BjÃÂ¶rk|   46|
|       Dwight Yoakam|   45|
|       Justin Bieber|   43|
|      The Black Keys|   40|
|         OneRepublic|   37|
|                Muse|   36|
|        Jack Johnson|   36|
|           Radiohead|   31|
|        Taylor Swift|   29|
|          Lily Allen|   28|
|               Train|   28|
|Barry Tuckwell/Ac...|   28|
|          Nickelback|   27|
|           Daft Punk|   27|
|           Metallica|   27|
|          Kanye West|   26|
+--------------------+-----+
only showing top 20 rows



In [13]:
user_log.select("artist").groupby('artist').count().sort(col('count').desc()).show(2)

+--------+-----+
|  artist|count|
+--------+-----+
|    null| 1653|
|Coldplay|   83|
+--------+-----+
only showing top 2 rows



In [14]:
# SQL
spark.sql(
        """
        SELECT Artist, COUNT(Artist) AS Count \
        FROM user_log_table \
        GROUP BY Artist \
        ORDER BY Count DESC \
        LIMIT 1
        """
         ).show()

+--------+-----+
|  Artist|Count|
+--------+-----+
|Coldplay|   83|
+--------+-----+



## 4. How many songs does the most played artist have?

In [15]:
user_log.select(['artist','song']).filter(user_log.artist == "Coldplay").dropDuplicates().show(50)

+--------+--------------------+
|  artist|                song|
+--------+--------------------+
|Coldplay|       The Scientist|
|Coldplay|          One I Love|
|Coldplay|             Fix You|
|Coldplay|        See You Soon|
|Coldplay|     Bigger Stronger|
|Coldplay|    Strawberry Swing|
|Coldplay|A Rush Of Blood T...|
|Coldplay|      Glass Of Water|
|Coldplay|               Lost!|
|Coldplay|             Trouble|
|Coldplay|Everything's Not ...|
|Coldplay|                 Yes|
|Coldplay|              Clocks|
|Coldplay|              Shiver|
|Coldplay|Now My Feet Won't...|
|Coldplay|          I Ran Away|
|Coldplay|              Yellow|
|Coldplay|    Til Kingdom Come|
|Coldplay| Life In Technicolor|
|Coldplay|         In My Place|
|Coldplay|            Daylight|
|Coldplay|God Put A Smile U...|
|Coldplay|       White Shadows|
+--------+--------------------+



In [16]:
user_log.select('song').filter(user_log.artist == "Coldplay").dropDuplicates().count()

24

In [17]:
# SQL
spark.sql(
        """
        SELECT Artist, COUNT(Artist) AS PlayCount, COUNT(DISTINCT song) AS SongCount \
        FROM user_log_table \
        GROUP BY Artist \
        ORDER BY PlayCount DESC \
        LIMIT 1
        """
         ).show()

+--------+---------+---------+
|  Artist|PlayCount|SongCount|
+--------+---------+---------+
|Coldplay|       83|       24|
+--------+---------+---------+



**Both (1) `Spark Data Frames` and (2) `Spark SQL` are part of the Spark SQL library. Since the syntax is clearer and can be shared with a wider community (since many analysts and data scientist prefer using SQL) and Spark automatically optimize the SQL code to speed up the process of manipulating and retrieving data, I prefer SQL over Data frames, but Spark Data Frames give more control such as breaking down the queries into smaller steps, which can make debugging easier.**