# Table of Contents
* [Importing of PySpark Libraries](#importandread)
* [Statistics about dataset and Spark Dataframe Preview](#statsanddf)
* [Data Cleaning](#datac)
* [PySpark SQL Queries](#pysparksql)
    * [Preview of data in SQL tables format](#sqltbl)
    * [Top 10 Applications Ranked by Total Number of Reviews](#top10reviews)
    * [Top 10 Applications Ranked by Type (Paid or Free)](#top10type)
    * [Distribution of Applications Categories by Total Number of Installs](#distcat)
    * [Top Paid Applications](#toppaid)

## Importing of PySpark Libraries <a class="anchor" id="importandread"></a>

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import *

In [None]:
df = spark.read.load('/FileStore/tables/googleplaystore.csv', format='csv',sep=',',header='true',escape='"',inferschema='true')

## Statistics about dataset and Spark Dataframe Preview <a class="anchor" id="statsanddf"></a>

In [None]:
df.count()

Out[4]: 10841

In [None]:
df.show()

+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|                 App|      Category|Rating|Reviews|Size|   Installs|Type|Price|Content Rating|              Genres|      Last Updated|       Current Ver| Android Ver|
+--------------------+--------------+------+-------+----+-----------+----+-----+--------------+--------------------+------------------+------------------+------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159| 19M|    10,000+|Free|    0|      Everyone|        Art & Design|   January 7, 2018|             1.0.0|4.0.3 and up|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967| 14M|   500,000+|Free|    0|      Everyone|Art & Design;Pret...|  January 15, 2018|             2.0.0|4.0.3 and up|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510|8.7M| 5,000,000+|Free|    0|      Everyone|        Art & Design|    August 1, 2018|             1.2.4|4.0.3 

In [None]:
df.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Size: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Content Rating: string (nullable = true)
 |-- Genres: string (nullable = true)
 |-- Last Updated: string (nullable = true)
 |-- Current Ver: string (nullable = true)
 |-- Android Ver: string (nullable = true)



## Data Cleaning <a class="anchor" id="datac"></a>

In [None]:
df = df.drop("size", "Content Rating", "Last Updated", "Android Ver", "Current Ver")

In [None]:
df.show(5)

+--------------------+--------------+------+-------+-----------+----+-----+--------------------+
|                 App|      Category|Rating|Reviews|   Installs|Type|Price|              Genres|
+--------------------+--------------+------+-------+-----------+----+-----+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|    10,000+|Free|    0|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|   500,000+|Free|    0|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510| 5,000,000+|Free|    0|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|50,000,000+|Free|    0|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|   100,000+|Free|    0|Art & Design;Crea...|
+--------------------+--------------+------+-------+-----------+----+-----+--------------------+
only showing top 5 rows



In [None]:
df.printSchema()

root
 |-- App: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Rating: double (nullable = true)
 |-- Reviews: string (nullable = true)
 |-- Installs: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Genres: string (nullable = true)



In [None]:
from pyspark.sql.functions import regexp_replace, col
df = df.withColumn("Reviews",col("Reviews").cast(IntegerType()))\
    .withColumn("Installs",regexp_replace(col("Installs"),"[^0-9]",""))\
    .withColumn("Installs",col("Installs").cast(IntegerType()))\
    .withColumn("Price",regexp_replace(col("Price"),"[$]",""))\
    .withColumn("Price",col("Price").cast(IntegerType()))

In [None]:
df.show(5)

+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|                 App|      Category|Rating|Reviews|Installs|Type|Price|              Genres|
+--------------------+--------------+------+-------+--------+----+-----+--------------------+
|Photo Editor & Ca...|ART_AND_DESIGN|   4.1|    159|   10000|Free|    0|        Art & Design|
| Coloring book moana|ART_AND_DESIGN|   3.9|    967|  500000|Free|    0|Art & Design;Pret...|
|U Launcher Lite –...|ART_AND_DESIGN|   4.7|  87510| 5000000|Free|    0|        Art & Design|
|Sketch - Draw & P...|ART_AND_DESIGN|   4.5| 215644|50000000|Free|    0|        Art & Design|
|Pixel Draw - Numb...|ART_AND_DESIGN|   4.3|    967|  100000|Free|    0|Art & Design;Crea...|
+--------------------+--------------+------+-------+--------+----+-----+--------------------+
only showing top 5 rows



## PySpark SQL <a class="anchor" id="pysparksql"></a>

In [None]:
df.createOrReplaceTempView("Apps")

### Preview of data in SQL tables format <a class="anchor" id="sqltbl"></a>

In [None]:
%sql select * from Apps

App,Category,Rating,Reviews,Installs,Type,Price,Genres
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,10000,Free,0,Art & Design
Coloring book moana,ART_AND_DESIGN,3.9,967,500000,Free,0,Art & Design;Pretend Play
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,5000000,Free,0,Art & Design
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,50000000,Free,0,Art & Design
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,100000,Free,0,Art & Design;Creativity
Paper flowers instructions,ART_AND_DESIGN,4.4,167,50000,Free,0,Art & Design
Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,50000,Free,0,Art & Design
Infinite Painter,ART_AND_DESIGN,4.1,36815,1000000,Free,0,Art & Design
Garden Coloring Book,ART_AND_DESIGN,4.4,13791,1000000,Free,0,Art & Design
Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,10000,Free,0,Art & Design;Creativity


### Top 10 Applications Ranked by Total Number of Reviews <a class="anchor" id="top10reviews"></a>

In [None]:
%sql select App,sum(Reviews) from Apps
group by 1
order by 2 desc
limit 10

App,sum(Reviews)
Instagram,266241989
WhatsApp Messenger,207348304
Clash of Clans,179558781
Messenger – Text and Video Chat for Free,169932272
Subway Surfers,166331958
Candy Crush Saga,156993136
Facebook,156286514
8 Ball Pool,99386198
Clash Royale,92530298
Snapchat,68045010


Databricks visualization. Run in Databricks to view.

### Top 10 Applications Ranked by Type (Paid or Free) <a class="anchor" id="top10type"></a>

In [None]:
%sql select App,Type,sum(Installs) from Apps
group by 1,2
order by 3 desc

App,Type,sum(Installs)
Subway Surfers,Free,6000000000.0
Instagram,Free,4000000000.0
Google Drive,Free,4000000000.0
Hangouts,Free,4000000000.0
Google Photos,Free,4000000000.0
Google News,Free,4000000000.0
Candy Crush Saga,Free,3500000000.0
WhatsApp Messenger,Free,3000000000.0
Gmail,Free,3000000000.0
Temple Run 2,Free,3000000000.0


### Distribution of Applications Categories by Total Number of Installs <a class="anchor" id="distcat"></a>

In [None]:
%sql select Category,sum(Installs) from Apps
group by 1
order by 2 desc

Category,sum(Installs)
GAME,35086024415.0
COMMUNICATION,32647276251.0
PRODUCTIVITY,14176091369.0
SOCIAL,14069867902.0
TOOLS,11452771915.0
FAMILY,10258263505.0
PHOTOGRAPHY,10088247655.0
NEWS_AND_MAGAZINES,7496317760.0
TRAVEL_AND_LOCAL,6868887146.0
VIDEO_PLAYERS,6222002720.0


### Top Paid Applications <a class="anchor" id="toppaid"></a>

In [None]:
%sql select App,sum(Price) from Apps
where Type='Paid'
group by 1
order by 2 desc

App,sum(Price)
I'm Rich - Trump Edition,400
I am Rich Plus,399
I AM RICH PRO PLUS,399
I'm Rich/Eu sou Rico/أنا غني/我很有錢,399
I Am Rich Premium,399
most expensive app (H),399
I Am Rich Pro,399
I am rich(premium),399
I am Rich,399
I am Rich!,399
