This notebook was taken from [here](https://github.com/MoranReznik/PySpark-Reference-Notebook/blob/main/PySpark%20Tutorial.ipynb) and edited. It is the accompanying notebook to Moran Reznik's excellent YouTube video crash course on PySpark [here](https://youtu.be/cZS5xYYIPzk?si=eUfy8cPsKZiJKFN4).

# What is PySpark?
PySpark is a Python API for working with Apache Spark. I will first explain what I mean by a "Python API" for something and then explain what, specifically, is 'Apache Spark'.

What I mean by **'Python API'** is that you can use the syntax and agility of Python to interact with and send commands to a system that is not based, at its core, on Python. 

With PySpark, you interact with Apache Spark - a system designed for analyzing, modeling and working with immense amounts of data in many computers at the same time. Putting it differently, Apache Spark allows you to run computations in parallel, instead of sequentially. It allows you to divide one incredibly large task into many smaller tasks and run each such task on a different machine. This allows you to accomplish your analysis goals in reasonable time which would not have been possible on a single machine.

Usually, we would define the amount of data that suits PySpark as what would not fit in a single machine's permanent storage (let alone RAM).

**Important relevant concepts:** 
1. Distributed computing: When you distribute a task into several smaller tasks that run at the same time. This is what PySpark allows you to do with many machines; but it can also be done on a single machine with several threads, for example.
2. Cluster: A network of machines that can take on tasks from a user, interact with one another and return results. These provide the computing resources that PySpark will use to make the computations.
3. Resilient Distributed Dataset (RDD): an immutable, distributed collection of data. Unlike DataFrames which we will work with later, it is not tabular and has no data schema. Therefore, for tabular data wrangling, DataFrames allows for more API options and under-the-hood optimizations. Still, you might encounter RDDs as you learn more about Spark, and should be aware of their existence.

**Parts of PySpark we will cover:**
1. PySpark SQL - contains commands for data processing and manipulation.
2. PySpark MLlib - includes a variety of models, model training and related commands.

**Spark Architecture:**
To send commands and receive results from a cluster, you will need to initiate a Spark 'session'. This object is your gateway for interacting with Spark. Each user of the cluster will have its own Spark session which will allow them to use the cluster in isolation from other users. All sessions communicate with a Spark 'context', which is the master node in the cluster - that is, it assigns each computer in the cluster tasks and coordinates them. Each of the computers in the cluster that perform tasks for a master node is called a 'worker' node. To connect to a worker node, the master node needs to get that node's compute power allocated to it by a cluster 'manager', that is responsible for distributing the cluster's resources. Inside each worker node, there are executor programs that run the tasks - they can run multiple tasks simultaneously, and have their own cache for storing results. So each master node can have multiple worker nodes, which can in turn have multiple tasks running.  

In [2]:
# a SparkSession object can perform the most common data processing tasks
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate() # Will return existing session if one was
                                                           # created before and was not closed


In [343]:
spark


**dataset:**
https://www.kaggle.com/fedesoriano/heart-failure-prediction

In [3]:
# Read csv, all columns will be of type string
df = spark.read.option('header','true').csv('heart.csv')

# Tell pyspark the type of the columns - saves time on large dataset. Note: there are other ways to do this.
schema = 'Age INTEGER, Sex STRING, ChestPainType STRING'
df = spark.read.csv('heart.csv', schema=schema, header=True)

# Let PySpark infer the schema
df = spark.read.csv('heart.csv', inferSchema=True, header=True)

# Replace nulls with other value at reading time
df = spark.read.csv('heart.csv', nullValue='NA')

# Save data
df.write.format("csv").save("heart_save.csv")

# If you want to overwrite the file
df.write.format("csv").mode("overwrite").save("heart_save.csv")


In [345]:
# Show head of table
df.show(3)


+---+---+-------------+
|Age|Sex|ChestPainType|
+---+---+-------------+
| 40|  M|          ATA|
| 49|  F|          NAP|
| 37|  M|          ATA|
+---+---+-------------+
only showing top 3 rows



In [346]:
# Count number of rows
df.count()


918

In [347]:
# Show parts of the table
df.select('Age').show(3)
df.select(['Age','Sex']).show(3)


+---+
|Age|
+---+
| 40|
| 49|
| 37|
+---+
only showing top 3 rows

+---+---+
|Age|Sex|
+---+---+
| 40|  M|
| 49|  F|
| 37|  M|
+---+---+
only showing top 3 rows



## Pandas DataFrame vs. PySpark DataFrame

Both represent a table of data with rows and columns. However, under the hood they are different, as PySpark dataframe needs to support distributed computations. As we move forward, we will see more and more features of it that are not present in Pandas DataFrames. That being said - if you know how to use Pandas, then moving to PySpark will feel like a natural transition.

## DAG
Directed Acyclic Graph is the way Spark runs computations. When you give it a series of transformations to apply to the dataset, it builds a graph out of those transformations, so it knows what to do. But it does not execute those commands immediately if it does not have to. Rather, it is lazy - it will go through the DAG and apply the transformations only when it must, to provide the required results. This allows better performance, since spark knows what's ahead of a certain computation and gets to optimize the process accordingly.

## Transformations vs. Actions
In PySpark, there are two types of command: transformations and actions. Transformation commands are added to the DAG, but do not get executed. They transform one DataFrame into another, not changing the input DataFrame. On the other hand, actions make PySpark execute the DAG but do not create a new DataFrame - instead, they output the result of the DAG.

## Caching
Every time you run a DAG, it will be re-computed from the beginning. That is, the results are not saved in memory. So, if we want to save a result so it won't have to be recomputed, we can use the cache command. Note that this will occupy space in the working node's memory, so be careful with the size of datasets you are caching! By default, the cached DF is stored to RAM, and is unserialized (not converted into a stream of bytes). You can change both of these - store data to hard disk, serialize it, or both!

## Collecting
Even after caching a DataFrame, it still sits in the worker node's memory. If you want to collect its pieces, assemble them and save them on the master node so you won't have to pull it every time, use the command for collecting. Again, be very careful with this, since the collected file will have to fit in the master node's memory! You will rarely issue this command directly.

In [None]:
df.cache()
df.collect()


In [348]:
# Convert PySpark DataFrame to Pandas DataFrame
pd_df = df.toPandas()

# Convert it back
spark_df = spark.createDataFrame(pd_df)


In [349]:
# Show first three rows as three row objects, which is how Spark represents single rows from a table.
# We will learn more about this in a bit
df.head(3)


[Row(Age=40, Sex='M', ChestPainType='ATA'),
 Row(Age=49, Sex='F', ChestPainType='NAP'),
 Row(Age=37, Sex='M', ChestPainType='ATA')]

In [350]:
# Show the types of columns and whether they are nullable
df.printSchema()


root
 |-- Age: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)



In [351]:
# Column dtypes as list of tuples
df.dtypes


[('Age', 'int'), ('Sex', 'string'), ('ChestPainType', 'string')]

In [352]:
# Cast a column from one type to another
from pyspark.sql.types import FloatType
df = df.withColumn("Age",df.Age.cast(FloatType()))
df = df.withColumn("RestingBP",df.Age.cast(FloatType()))


In [353]:
# Compute summary statistics
df.select(['Age','RestingBP']).describe().show()


+-------+------------------+------------------+
|summary|               Age|         RestingBP|
+-------+------------------+------------------+
|  count|               918|               918|
|   mean|53.510893246187365|53.510893246187365|
| stddev|  9.43261650673202|  9.43261650673202|
|    min|              28.0|              28.0|
|    max|              77.0|              77.0|
+-------+------------------+------------------+



In [354]:
# Add a new column or replace an existing one
AgeFixed = df['Age'] + 1  # `select` always returns a DataFrame object, but we need a column object
df = df.withColumn('AgeFixed', AgeFixed)


In [355]:
df.select(['AgeFixed','Age']).describe().show()


+-------+------------------+------------------+
|summary|          AgeFixed|               Age|
+-------+------------------+------------------+
|  count|               918|               918|
|   mean|54.510893246187365|53.510893246187365|
| stddev|  9.43261650673202|  9.43261650673202|
|    min|              29.0|              28.0|
|    max|              78.0|              77.0|
+-------+------------------+------------------+



In [356]:
# Remove columns
df.drop('AgeFixed').show(1) # Add `df = ` to get the new DataFrame into a variable


+----+---+-------------+---------+
| Age|Sex|ChestPainType|RestingBP|
+----+---+-------------+---------+
|40.0|  M|          ATA|     40.0|
+----+---+-------------+---------+
only showing top 1 row



In [357]:
# Rename a column
df.withColumnRenamed('Age','age').select('age').show(1)

# To rename more than a single column, I would suggest a loop
name_pairs = [('Age','age'),('Sex','sex')]
for old_name, new_name in name_pairs:
    df = df.withColumnRenamed(old_name,new_name)


+----+
| age|
+----+
|40.0|
+----+
only showing top 1 row



In [358]:
df.select(['age','sex']).show(1)


+----+---+
| age|sex|
+----+---+
|40.0|  M|
+----+---+
only showing top 1 row



In [359]:
# Drop all rows that contain any NA
df = df.na.drop()
df.count()

# Drop all rows where all values are NA
df = df.na.drop(how='all')

# Drop all rows where more than 2 values are NA
df = df.na.drop(thresh=2)

# Drop all rows where any of the values in specific columns are NAs
df = df.na.drop(how='any', subset=['age','sex']) # 'any' is the default


In [360]:
# Fill missing values in a specific column with a '?'
df = df.na.fill(value='?',subset=['sex'])

# Replace NAs with the mean of the column
from pyspark.ml.feature import Imputer # In statistics, imputation is the process of
                                       # replacing missing data with substituted values
imptr = Imputer(inputCols=['age','RestingBP'],
                outputCols=['age','RestingBP']).setStrategy('mean') # Can also be 'median', etc.

df = imptr.fit(df).transform(df)


In [361]:
# Filter to adults only and calculate mean
df.filter('age > 18')
df.where('age > 18')# 'where' is an alias to 'filter'
df.where(df['age'] > 18) # third option

# Add another condition ('&' means and, '|' means or)
df.where((df['age'] > 18) | (df['ChestPainType'] == 'ATA'))

# Take every record where the 'ChestPainType' is NOT 'ATA'
df.filter(~(df['ChestPainType'] == 'ATA'))


DataFrame[age: float, sex: string, ChestPainType: string, RestingBP: float, AgeFixed: float]

In [5]:
df.filter('age > 18').show()


+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
|Age|Sex|ChestPainType|RestingBP|Cholesterol|FastingBS|RestingECG|MaxHR|ExerciseAngina|Oldpeak|ST_Slope|HeartDisease|
+---+---+-------------+---------+-----------+---------+----------+-----+--------------+-------+--------+------------+
| 40|  M|          ATA|      140|        289|        0|    Normal|  172|             N|    0.0|      Up|           0|
| 49|  F|          NAP|      160|        180|        0|    Normal|  156|             N|    1.0|    Flat|           1|
| 37|  M|          ATA|      130|        283|        0|        ST|   98|             N|    0.0|      Up|           0|
| 48|  F|          ASY|      138|        214|        0|    Normal|  108|             Y|    1.5|    Flat|           1|
| 54|  M|          NAP|      150|        195|        0|    Normal|  122|             N|    0.0|      Up|           0|
| 39|  M|          NAP|      120|        339|        0| 

In [365]:
# Evaluate a string expression into a command
from pyspark.sql.functions import expr
exp = 'age + 0.2 * AgeFixed'
df.withColumn('new_col', expr(exp)).select('new_col').show(3)


+-------+
|new_col|
+-------+
|   48.2|
|   59.0|
|   44.6|
+-------+
only showing top 3 rows



In [24]:
# group by age
disease_by_age = df.groupby('age').mean().select(['age','avg(HeartDisease)'])
# Sort values in descending/ascending order
from pyspark.sql.functions import desc, asc
disease_by_age.orderBy(desc("age")).show(5)


+---+------------------+
|age| avg(HeartDisease)|
+---+------------------+
| 77|               1.0|
| 76|               0.5|
| 75|0.6666666666666666|
| 74|0.7142857142857143|
| 73|               1.0|
+---+------------------+
only showing top 5 rows



In [25]:
disease_by_age = df.groupby('age').mean().select(['age','avg(HeartDisease)'])
disease_by_age.orderBy(desc("age")).show(3)


+---+------------------+
|age| avg(HeartDisease)|
+---+------------------+
| 77|               1.0|
| 76|               0.5|
| 75|0.6666666666666666|
+---+------------------+
only showing top 3 rows



In [28]:
# Aggregate to get several statistics for several columns
# The available aggregate functions are avg, max, min, sum, count
from pyspark.sql import functions as F
df.agg(F.min(df['age']),F.max(df['age']),F.avg(df['sex'])).show()


DataFrame[min(age): int, max(age): int, avg(sex): double]

In [29]:
df.groupby('HeartDisease').agg(F.min(df['age']),F.avg(df['sex'])).show()


+------------+--------+--------+--------+
|HeartDisease|min(age)|max(age)|avg(sex)|
+------------+--------+--------+--------+
|           1|      31|      77|    null|
|           0|      28|      76|    null|
+------------+--------+--------+--------+



In [370]:
# Run an SQL query on the data
df.createOrReplaceTempView("df") # tell PySpark how the table will be called in the SQL query
spark.sql("""SELECT sex from df""").show(2)

# We can also choose columns using SQL syntax with a command that combines '.select()' and '.sql()'
df.selectExpr("age >= 40 as older", "age").show(2)


+---+
|sex|
+---+
|  M|
|  F|
+---+
only showing top 2 rows

+-----+----+
|older| age|
+-----+----+
| true|40.0|
| true|49.0|
+-----+----+
only showing top 2 rows



In [36]:
df.groupby('age').pivot('sex', ("M", "F")).count().show(3)


+---+---+---+
|age|  M|  F|
+---+---+---+
| 31|  1|  1|
| 65| 17|  4|
| 53| 27|  6|
+---+---+---+
only showing top 3 rows



In [38]:
# pivot - Note: this is a very expensive operation. You can mitigate this by
# specifying the values to piivot on, as we've done below.
df.selectExpr("age >= 40 as older", "age",'sex').groupBy("sex")\
                    .pivot("older", ("true", "false")).count().show()


+---+----+-----+
|sex|true|false|
+---+----+-----+
|  F| 174|   19|
|  M| 664|   61|
+---+----+-----+



In [42]:
df.select(['age','MaxHR','Cholesterol']).show(4)


+---+-----+-----------+
|age|MaxHR|Cholesterol|
+---+-----+-----------+
| 40|  172|        289|
| 49|  156|        180|
| 37|   98|        283|
| 48|  108|        214|
+---+-----+-----------+
only showing top 4 rows



In [45]:
# divide dataset to training features and target
X_column_names = ['Age','Cholesterol']
target_colum_name = ['MaxHR']

# Convert feature columns into a columns where the values are feature vectors
from pyspark.ml.feature import VectorAssembler
v_asmblr = VectorAssembler(inputCols=X_column_names, outputCol='Fvec')
df = v_asmblr.transform(df)
X = df.select(['Age','Cholesterol','Fvec','MaxHR'])
X.show(3)


+---+-----------+------------+-----+
|Age|Cholesterol|        Fvec|MaxHR|
+---+-----------+------------+-----+
| 40|        289|[40.0,289.0]|  172|
| 49|        180|[49.0,180.0]|  156|
| 37|        283|[37.0,283.0]|   98|
+---+-----------+------------+-----+
only showing top 3 rows



In [46]:
# Divide dataset into training and testing sets
trainset, testset = X.randomSplit([0.8,0.2])


In [47]:
# Predict using linear regression
from pyspark.ml.regression import LinearRegression
model = LinearRegression(featuresCol='Fvec', labelCol='MaxHR')
model = model.fit(trainset)
print(model.coefficients)
print(model.intercept)


[-0.9981223334822935,0.04620857054247365]
181.31618579521276


In [48]:
# Evaluate model
model.evaluate(testset).predictions.show(3)


+---+-----------+------------+-----+------------------+
|Age|Cholesterol|        Fvec|MaxHR|        prediction|
+---+-----------+------------+-----+------------------+
| 28|        132|[28.0,132.0]|  185|159.46829176931507|
| 30|        237|[30.0,237.0]|  170|162.32394700931022|
| 34|        210|[34.0,210.0]|  192|157.08382627073425|
+---+-----------+------------+-----+------------------+
only showing top 3 rows



In [None]:
# Handle categorical features with ordinal indexing
from pyspark.ml.feature import StringIndexer
indxr = StringIndexer(inputCol='ChestPainType', outputCol='ChestPainTypeInxed')
indxr.fit(df).transform(df).select('ChestPainTypeInxed').show(3)
