# I. Setting up PySpark on Colab

Links I referenced:

*   https://medium.com/grabngoinfo/install-pyspark-3-on-google-colab-the-easy-way-577ec4a2bcd8
*   https://towardsdatascience.com/pyspark-on-google-colab-101-d31830b238be



### 1 &nbsp; Download Java since Spark is written in Scala and requires Java Virtual Machine (JVM)

In [10]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

###2 &nbsp;Download and upzip **Apache Spark** from https://spark.apache.org/downloads.html

In [15]:
!wget -q https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
!tar xf spark-3.3.1-bin-hadoop3.tgz

###3 &nbsp; Setting up the environment for Spark

In [16]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = '/content/spark-3.3.1-bin-hadoop3'

###4 &nbsp;Finding Spark

In [17]:
!pip install -q findspark
import findspark
findspark.init()
findspark.find()

'/content/spark-3.3.1-bin-hadoop3'

###5 nbsp; Create and check a Spark session

In [18]:
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

# II. Sample EDA with Spark

In [22]:
!wget --continue https://raw.githubusercontent.com/GarvitArya/pyspark-demo/main/sample_books.json -O /tmp/sample_books.json

--2022-11-28 23:17:51--  https://raw.githubusercontent.com/GarvitArya/pyspark-demo/main/sample_books.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1565 (1.5K) [text/plain]
Saving to: ‘/tmp/sample_books.json’


2022-11-28 23:17:51 (14.2 MB/s) - ‘/tmp/sample_books.json’ saved [1565/1565]



In [23]:
sample_books = spark.read.json("/tmp/sample_books.json")

In [24]:
sample_books.printSchema()

root
 |-- author: string (nullable = true)
 |-- edition: string (nullable = true)
 |-- price: double (nullable = true)
 |-- title: string (nullable = true)
 |-- year_written: long (nullable = true)



In [25]:
sample_books.show(4)

+---------------+--------------+-----+----------------+------------+
|         author|       edition|price|           title|year_written|
+---------------+--------------+-----+----------------+------------+
|   Austen, Jane|       Penguin| 18.2|Northanger Abbey|        1814|
|   Tolstoy, Leo|       Penguin| 12.7|   War and Peace|        1865|
|   Tolstoy, Leo|       Penguin| 13.5|   Anna Karenina|        1875|
|Woolf, Virginia|Harcourt Brace| 25.0|   Mrs. Dalloway|        1925|
+---------------+--------------+-----+----------------+------------+
only showing top 4 rows



In [28]:
sample_books.count()

13

### Select books with `price > 17` SQL style

In [32]:
sample_books_over_17 = sample_books.filter("price >= 17")

In [33]:
sample_books_over_17.show(100)

+---------------+--------------+-----+-------------------+------------+
|         author|       edition|price|              title|year_written|
+---------------+--------------+-----+-------------------+------------+
|   Austen, Jane|       Penguin| 18.2|   Northanger Abbey|        1814|
|Woolf, Virginia|Harcourt Brace| 25.0|      Mrs. Dalloway|        1925|
|Woolf, Virginia|       Penguin| 29.0|A Room of One's Own|        1922|
|  Rowling, J.K.|Harcourt Brace|19.95|       Harry Potter|        2000|
|  Tolkien, J.R.|       Penguin|27.45|  Lord of the Rings|        1937|
+---------------+--------------+-----+-------------------+------------+



### Use aggregating function in PySpark

In [35]:
from pyspark.sql.functions import avg

In [42]:
avg_price = sample_books.agg(avg("price")).collect()[0][0]

In [45]:
sample_books_above_avg_price = sample_books.filter(sample_books.price > avg_price)
sample_books_above_avg_price.show()

+---------------+--------------+-----+-------------------+------------+
|         author|       edition|price|              title|year_written|
+---------------+--------------+-----+-------------------+------------+
|   Austen, Jane|       Penguin| 18.2|   Northanger Abbey|        1814|
|Woolf, Virginia|Harcourt Brace| 25.0|      Mrs. Dalloway|        1925|
|Woolf, Virginia|       Penguin| 29.0|A Room of One's Own|        1922|
|  Rowling, J.K.|Harcourt Brace|19.95|       Harry Potter|        2000|
|  Tolkien, J.R.|       Penguin|27.45|  Lord of the Rings|        1937|
+---------------+--------------+-----+-------------------+------------+

