# SQL Options in PySpark with Google PlayStore Data
* Notebook by Adam Lang
* Date: 4/8/2025

# Overview
* This notebook continues using Spark SQL options in PySpark with a dataset from the Google Play Store.



## Setup Spark Session
* Let's start with Spark SQL. But first we need to create a Spark Session.

In [None]:
## create spark session
import pyspark
from pyspark.sql import SparkSession

In [None]:
## create spark objects
spark = SparkSession.builder.appName("SparkSQL_techniques").getOrCreate()


## get cores
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print(f"Cores you are working with are: {cores} core(s)")
spark

Cores you are working with are: 1 core(s)


## Load Data
* For this notebook we will use a dataset from the Google Play Store
* About this dataset:
  * Contains a list of Google Play Store Apps and info about the apps like the category, rating, reviews, size, etc.

**Source:** https://www.kaggle.com/lava18/google-play-store-apps

In [None]:
## set path to data
data_path = '/content/drive/MyDrive/Colab Notebooks/PySpark Data Science/Spark_Dataframes/'

In [None]:
## load data
google_data = spark.read.csv(data_path+"googleplaystore.csv",
                            header=True,
                            inferSchema=True)


## Exploratory Data Analysis

Let's check out the first few lines of the dataframe to see what we are working with

In [None]:
## view data
google_data.limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


As well as the schema to make sure all the column types were correctly infered

In [None]:
## schema
google_data.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



Summary
* We need to edit some of the datatypes.
* We need to update:
  * Rating, Reviews and Price as integer (float for Rating) values for now, since the Size and Installs variables will need a bit more cleaning.
  


In [None]:
## data type transformations
from pyspark.sql.types import IntegerType, FloatType

## create new df with dtype transformations
new_df = google_data.withColumn("Rating", google_data["Rating"].cast(FloatType())) \
            .withColumn("Reviews", google_data["Reviews"].cast(IntegerType())) \
            .withColumn("Price", google_data["Price"].cast(IntegerType()))

## view new_df
print(new_df.printSchema())
new_df.limit(5).toPandas()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: float (nullable = true)
 |-- Reviews: integer (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: integer (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)

None


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Looks like that worked! Great! Let's dig in.

## 1. Create Tempview
* Let's create a tempview of the dataframe so we can work with it in spark SQL.

In [None]:
## tempview
new_df.createOrReplaceTempView('tempview')

## 2. Select all apps with ratings above 4.1
* Using tempview lets select all apps with ratings above 4.1.

In [None]:
## SQL query
spark.sql('SELECT * FROM tempview WHERE Rating > 4.1').limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
1,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
2,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
3,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
4,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up


## 3. Pass results to an object -- create a spark dataframe
* Select just the App and Rating column where the Category is in the Comic category and the Rating is above 4.5.

In [None]:
## spark df
app_df = spark.sql('SELECT App, Rating FROM tempview WHERE Category == "COMICS" AND Rating > 4.5 ORDER BY Rating ASC')
app_df.limit(5).toPandas()

Unnamed: 0,App,Rating
0,Manga Master - Best manga & comic reader,4.6
1,Manga - read Thai translation,4.6
2,"Laftel - Watching and Announcing Snooping, Str...",4.6
3,Children's cartoons (Mithu-Mina-Raju),4.6
4,Faustop Sounds,4.7


## 4. Which category has the most cumulative reviews
* We will only select the one category with the most reivews.
*Note: will require adding all the review together for each category*

In [20]:
## group by query
spark.sql('SELECT Category, sum(Reviews) as Total_Reviews FROM tempview GROUP BY Category ORDER BY Total_Reviews DESC').limit(1).toPandas()

Unnamed: 0,Category,Total_Reviews
0,GAME,1585422349


Summary
* It appears that `GAME` is the top Category.

## 5. Which App has the most reviews?
* We will display ONLY the top result.
* We will include only the App column and the Reviews column.

In [22]:
## sqark sql query
spark.sql("SELECT App, Reviews FROM tempview ORDER BY Reviews DESC").show(1)

+--------+--------+
|     App| Reviews|
+--------+--------+
|Facebook|78158306|
+--------+--------+
only showing top 1 row



## 6. Select all apps that contain the word 'dating' anywhere in the title


In [25]:
## all apps that contain 'dating'
spark.sql("SELECT * FROM tempview WHERE App LIKE '%dating%'").limit(5).toPandas()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,"Meet, chat & date. Free dating app - Chocolate...",DATING,3.9,8661,9.5M,"1,000,000+",Free,0,Mature 17+,Dating,"April 3, 2018",0.1.11,4.0 and up
1,Friend Find: free chat + flirt dating app,DATING,,23,11M,100+,Free,0,Mature 17+,Dating,"July 31, 2018",1.0,4.4 and up
2,Spine- The dating app,DATING,5.0,5,9.3M,500+,Free,0,Teen,Dating,"July 14, 2018",4.0,4.0.3 and up
3,Princess Closet : Otome games free dating sim,FAMILY,4.5,29495,56M,"1,000,000+",Free,0,Teen,Simulation,"May 24, 2018",1.11.0,4.0.3 and up
4,happn – Local dating app,LIFESTYLE,4.3,1118201,Varies with device,"10,000,000+",Free,0,Mature 17+,Lifestyle,"July 24, 2018",Varies with device,Varies with device


## 7. Use SQL Transformer to display how many free apps there are in this list

In [26]:
## import SQL transformer
from pyspark.ml.feature import SQLTransformer

## create transformer object
sqlTrans = SQLTransformer(
    statement="SELECT count(*) FROM __THIS__ WHERE Type = 'Free'"
)
sqlTrans.transform(google_data).show()

+--------+
|count(1)|
+--------+
|   10037|
+--------+



## 8. What is the most popular Genre?
* Lets see which genre appears most often in the dataframe and show only the top result.

In [29]:
## create transformer object
sqlTrans = SQLTransformer(
    statement="SELECT Genres, count(*) as Total FROM __THIS__ GROUP BY Genres ORDER BY Total DESC"
)
sqlTrans.transform(google_data).show(10)

+---------------+-----+
|         Genres|Total|
+---------------+-----+
|          Tools|  842|
|  Entertainment|  623|
|      Education|  549|
|        Medical|  463|
|       Business|  460|
|   Productivity|  424|
|         Sports|  398|
|Personalization|  392|
|  Communication|  387|
|      Lifestyle|  381|
+---------------+-----+
only showing top 10 rows



## 9. Select all the apps in the 'Tools' genre that have more than 100 reviews

In [28]:
## create transformer object
sqlTrans = SQLTransformer(
    statement="SELECT App, Reviews FROM __THIS__ WHERE Genres = 'Tools' AND Reviews > 100"
)
sqlTrans.transform(google_data).show(10)

+--------------------+--------+
|                 App| Reviews|
+--------------------+--------+
|   Moto File Manager|   38655|
|              Google| 8033493|
|    Google Translate| 5745093|
|        Moto Display|   18239|
|      Motorola Alert|   24199|
|     Motorola Assist|   37333|
|Cache Cleaner-DU ...|12759663|
|  Moto Suggestions ™|     308|
|          Moto Voice|   33216|
|          Calculator|   40770|
+--------------------+--------+
only showing top 10 rows

