# Apache Spark 3.2 (PySpark) Tutorial
- Author: Akira Takihara Wang (https://github.com/akiratwang)

Tutorial Operating System(s):
- Windows 10 and WSL2
- Linux

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("PySpark Pandas API") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()

21/11/24 15:52:26 WARN Utils: Your hostname, NeonEx resolves to a loopback address: 127.0.1.1; using 10.1.1.247 instead (on interface eth0)
21/11/24 15:52:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/24 15:52:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# `pandas`-on-Spark (Spark's `pandas` API)
From Apache Spark 3.2+, you can now use the `pandas` API: 
- https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Essentially, this implements the `pandas.DataFrame` API on top of PySpark. For example, `pandas` users can simply scale their workloads by changing the imports.

**Terminology:**
- `pandas-on-Spark` refers to Spark's `pandas` API.
- `sdf`: Spark DataFrame
- `pdf`: Pandas DataFrame (specifically used to differentiate between `sdf` and `pdf`)
- `psdf`: `pandas-on-Spark` DataFrame (PySpark's version of the `pandas.DataFrame`)
- `df`: DataFrame (usually from the `pandas` library)
- `ps`: Spark's `pandas` API (`pyspark.pandas`)

Previously, we read in a `CSV` with the following code:

In [2]:
sdf = spark.read.csv('../data/sample.csv', header=True)
sdf.limit(5)

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,1/12/15 0:00,1/12/15 0:05,5,0.96,-73.97994232,40.76538086,1,N,-73.96630859,40.76308823,1,5.5,0.5,0.5,1.0,0.0,0.3,7.8
2,1/12/15 0:00,1/12/15 0:00,2,2.69,-73.97233582,40.76237869,1,N,-73.99362946,40.74599838,1,21.5,0.0,0.5,3.34,0.0,0.3,25.64
2,1/12/15 0:00,1/12/15 0:00,1,2.62,-73.96884918,40.76453018,1,N,-73.97454834,40.79164124,1,17.0,0.0,0.5,3.56,0.0,0.3,21.36
1,1/12/15 0:00,1/12/15 0:05,1,1.2,-73.99393463,40.74168396,1,N,-73.99766541,40.74746704,1,6.5,0.5,0.5,0.2,0.0,0.3,8.0
1,1/12/15 0:00,1/12/15 0:09,2,3.0,-73.98892212,40.72698975,1,N,-73.97559357,40.6968689,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3


Now, we can do something like this:

In [45]:
import pyspark.pandas as ps

pdf = ps.read_csv('../data/sample.csv')
pdf.head()

21/11/24 16:09:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/11/24 16:09:43 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,1/12/15 0:00,1/12/15 0:05,5,0.96,-73.979942,40.765381,1,N,-73.966309,40.763088,1,5.5,0.5,0.5,1.0,0.0,0.3,7.8
1,2,1/12/15 0:00,1/12/15 0:00,2,2.69,-73.972336,40.762379,1,N,-73.993629,40.745998,1,21.5,0.0,0.5,3.34,0.0,0.3,25.64
2,2,1/12/15 0:00,1/12/15 0:00,1,2.62,-73.968849,40.76453,1,N,-73.974548,40.791641,1,17.0,0.0,0.5,3.56,0.0,0.3,21.36
3,1,1/12/15 0:00,1/12/15 0:05,1,1.2,-73.993935,40.741684,1,N,-73.997665,40.747467,1,6.5,0.5,0.5,0.2,0.0,0.3,8.0
4,1,1/12/15 0:00,1/12/15 0:09,2,3.0,-73.988922,40.72699,1,N,-73.975594,40.696869,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3


Before we dive into this new API, there are some other concepts you need to know. You'll notice there are some warnings from `WindowExec` when you run the code above. 

What is it? From https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html:
- Window functions allow users of Spark SQL to calculate results such as the rank of a given row or a moving average over a range of input rows. 
- At its core, a window function calculates a return value for every input row of a table based on a group of rows, called the `Frame`. 
- Every input row can have a unique `frame` associated with it. 
- This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way. 

## Options and Settings
Like `pandas`, `pandas-on-Spark` allows for options such as the maximum number of rows to display and/or compute.

In [4]:
# original pandas way to change max rows
import pandas as pd

pd.options.display.max_rows = 100
pd.options.display.max_rows

100

In [5]:
# it's the exact same in spark's pandas API as well
import pyspark.pandas as ps

ps.options.display.max_rows = 100
ps.options.display.max_rows

100

So what kind of options are available for us?
1. `ps.get_option(your option)`
2. `ps.set_option(your option, new value)`
3. `ps.reset_option(your option)`

The code above has the following equivalent:
```python
ps.set_option('display.max_rows', 100)
ps.get_option('display.max_rows')
```

In [6]:
# lets set the max compute rows to 2000
ps.set_option('compute.max_rows', 2000)

In [7]:
# now let's see it 
ps.get_option('compute.max_rows')

2000

In [8]:
# reset it to default (1000 rows)
ps.reset_option('compute.max_rows')

# and verify it
ps.get_option('compute.max_rows')

1000

# `.transform()` and `.apply()`
The main differences are:
- Transform requires a return of the same length as the input (`n` rows in, `n` rows out)
- Apply does not require it (`n` rows in, arbitrary number of rows out)

Let's create a sample `psdf` and show the differences.

In [22]:
psdf = ps.DataFrame({'Col A': range(5), 'Col B':range(6, 11)})
psdf

Unnamed: 0,Col A,Col B
0,0,6
1,1,7
2,2,8
3,3,9
4,4,10


In [23]:
def add_one_transform(series):
    """
    adds one to the series and returns all rows
    """
    return series + 1

psdf.transform(add_one_transform)

Unnamed: 0,Col A,Col B
0,1,7
1,2,8
2,3,9
3,4,10
4,5,11


In [27]:
def add_one_apply(series):
    """
    adds one to the series, but only returns rows that only have odd numbers
    """
    series += 1
    return series.loc[series % 2 == 1]

psdf.apply(add_one_apply)

Unnamed: 0,Col A,Col B
0,1,7
2,3,9
4,5,11


You can also specifically apply functions across the rows or columns.

Here are some examples:
1. Create a new series that combines both columns as a list of values
2. Take the sum across each column for each row
3. Take the max across each row for each column

In [36]:
psdf.apply(list, axis='columns')

0     [0, 6]
1     [1, 7]
2     [2, 8]
3     [3, 9]
4    [4, 10]
dtype: object

In [37]:
psdf.apply(sum, axis='columns')

0     6
1     8
2    10
3    12
4    14
dtype: int64

In [41]:
psdf.apply(max)

Col A     4
Col B    10
dtype: int64