# What is spark?

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. (Source wikipedia)

# What is pyspark
PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. (Source pyspark docs)

# Getting the data

For this exercise we will be using the shakespeare dataset

In [1]:
!wget https://raw.githubusercontent.com/garyongguanjie/learning-pyspark/main/data/shakespeare.txt

--2022-06-23 06:14:21--  https://raw.githubusercontent.com/garyongguanjie/learning-pyspark/main/data/shakespeare.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94275 (92K) [text/plain]
Saving to: ‘shakespeare.txt’


2022-06-23 06:14:21 (4.90 MB/s) - ‘shakespeare.txt’ saved [94275/94275]



# Install pyspark

In the latest version of pyspark, the installation will install both apache spark and pyspark for you. 

Note in earlier versions of spark, spark context might be used instead of spark session but they function in similar ways.

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 47.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=871f5bbaf9118ef0fdce5b93ee07100e990f3bf89dfd1c962ed4d5738205c2f5
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


# Spark Basics

Spark has 4 types of APIs,

Occasionally you may see pandas-like apis in the dataframe api as well. 

Most people tend to write pyspark with scala like apis, hence one does not often see `df = df[df.someColumn==1]` but rather `df.filter(df.someColumn==1)` as the former api is not familiar to scala developers which is the default language for spark.

1. RDD
    - this is an older api and is used for lower level transformations (i.e. unstructured data)
2. DataFrame
    - newer API that is very similar to Python's pandas
    - Used for tabular datasets
3. Datasets
    - Not really supported in python
    - Can think of this as strongly typed dataframes.
4. Pandas Pyspark Dataframe
    - Tries to emulate the pandas api in pyspark so that it's familiar to data scientist.

# Spark hello world

In almost every big data course you will do the word count. Here we will do the word count using both RDDs and dataframe apis

In [3]:
from pyspark.sql import SparkSession 

In [4]:
# Create SparkSession
spark = SparkSession.builder.appName('wordcount').getOrCreate()
# rdd usage
sc = spark.sparkContext

## RDD


In [12]:
lines = sc.textFile("shakespeare.txt")
words = lines.flatMap(lambda s: s.split(" "))
counts = words.map(lambda x:(x,1))
wordCounts = counts.reduceByKey(lambda a, b: a + b)

Counts is a pyspark pipelined rdd. You can see the documentation here:

 https://spark.apache.org/docs/0.8.0/api/pyspark/pyspark.rdd.RDD-class.html

In [13]:
type(wordCounts)

pyspark.rdd.PipelinedRDD

In [14]:
wordCounts.take(10)

[('', 559),
 ('Shakespeare', 1),
 ('fairest', 5),
 ('creatures', 2),
 ('we', 14),
 ('increase,', 4),
 ('That', 83),
 ('thereby', 1),
 ("beauty's", 17),
 ('rose', 3)]

In [15]:
wordCounts = wordCounts.sortBy(lambda x:-x[1])

In [16]:
wordCounts.take(10)

[('', 559),
 ('my', 358),
 ('the', 354),
 ('of', 347),
 ('I', 335),
 ('to', 330),
 ('in', 286),
 ('thy', 258),
 ('and', 248),
 ('And', 242)]

## Dataframe API

According to pyspark docs this is the dataframe or 'untyped' dataset api.
This API is similar to SQL.

In [17]:
import pyspark.sql.functions as F

In [18]:
textFile = spark.read.text("shakespeare.txt")

textFile here is a pyspark DataFrame which is very similar to pandas dataframe

You can see the documentation here:

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html

In [19]:
type(textFile)

pyspark.sql.dataframe.DataFrame

Show a portion of the dataframe

In [43]:
textFile.show()

+--------------------+
|               value|
+--------------------+
|         THE SONNETS|
|                    |
|by William Shakes...|
|                    |
|From fairest crea...|
|That thereby beau...|
|But as the riper ...|
|His tender heir m...|
|But thou contract...|
|Feed'st thy light...|
|Making a famine w...|
|Thy self thy foe,...|
|Thou that art now...|
|And only herald t...|
|Within thine own ...|
|And tender churl ...|
|Pity the world, o...|
|To eat the world'...|
|                    |
|When forty winter...|
+--------------------+
only showing top 20 rows



You can convert it into a pandas dataframe also but this is not advisable because it will put everything into memory of the local runtime of the notebook!

In [46]:
textFile.toPandas() # Not advisable

Unnamed: 0,value
0,THE SONNETS
1,
2,by William Shakespeare
3,
4,"From fairest creatures we desire increase,"
...,...
2464,"Came there for cure and this by that I prove,"
2465,"Love's fire heats water, water cools not love."
2466,
2467,


We can use sql like functions here to help count the words

In [56]:
wordCounts = textFile.select(F.explode(F.split(textFile.value, "\s+")).alias("word")).groupBy("word").count().sort("count",ascending=False)

In [61]:
type(wordCounts)

pyspark.sql.dataframe.DataFrame

In [62]:
wordCounts.take(10)

[Row(word='', count=428),
 Row(word='my', count=358),
 Row(word='the', count=354),
 Row(word='of', count=347),
 Row(word='I', count=335),
 Row(word='to', count=330),
 Row(word='in', count=286),
 Row(word='thy', count=258),
 Row(word='and', count=248),
 Row(word='And', count=242)]

In [66]:
%%time
wordCounts.count()

CPU times: user 3.33 ms, sys: 18 µs, total: 3.35 ms
Wall time: 311 ms


4579

Cache to avoid recomputation. This caches it in the cluster memory and not the notebook runtime! 

Useful for lookup tables for joining.

In [70]:
wordCounts.cache() 

DataFrame[word: string, count: bigint]

In [72]:
%%time
wordCounts.count() # still take some amount of time here presumably for transfering the data over to python runtime

CPU times: user 2.72 ms, sys: 0 ns, total: 2.72 ms
Wall time: 301 ms


4579