# Apache Spark 3.2 (PySpark) Tutorial
- Author: Akira Takihara Wang (https://github.com/akiratwang)

Tutorial Operating System(s):
- Windows 10 and WSL2
- Linux

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("PySpark Conversion and Transformations") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()

21/11/24 16:23:47 WARN Utils: Your hostname, NeonEx resolves to a loopback address: 127.0.1.1; using 10.1.1.247 instead (on interface eth0)
21/11/24 16:23:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/24 16:23:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
schema = """
`VendorID` INT,  
`tpep_pickup_datetime` STRING, 
`tpep_dropoff_datetime` STRING,
`passenger_count` INT, 
`trip_distance` DOUBLE, 
`pickup_longitude` DOUBLE, 
`pickup_latitude` DOUBLE,
`RateCodeID` INT, 
`store_and_fwd_flag` STRING, 
`dropoff_longitude` DOUBLE, 
`dropoff_latitude` DOUBLE,
`payment_type` INT, 
`fare_amount` DOUBLE, 
`extra` DOUBLE, 
`mta_tax` DOUBLE, 
`tip_amount` DOUBLE,
`tolls_amount` DOUBLE, 
`improvement_surcharge` DOUBLE, 
`total_amount` DOUBLE
"""

# Transformations and Lazy Evaluation
Transformations in PySpark will transform a Spark DataFrame into a _new_ DataFrame _without_ altering the original data. This means that Spark is **immutable** (i.e there is no `inplace=True` argument like some `pandas` methods).

For example, operations will return transformed results rather than mutating the original. Therefore, it is quite common to see:
- `sdf = sdf.with_some_transformation()`

Finally, Spark operations are evaluated lazily. This is because there is a driver under-the-hood which looks to optimize and make your operations more efficient. This means that your data _does not "move"_ until called upon.

## Renaming Columns
If we work with the full 2015 dataset, you'll notice some inconsistencies in the column names (`RateCodeID` vs `RatecodeID`). To rename fields, we need to use the `.withColumnRenamed(original name, new name)` method.

In [7]:
sdf = spark.read.csv('../data/sample.csv', schema=schema, header=True)

In [None]:
sdf.limit(5)

# Data Type Conversions
If we look at our DataFrame, our `datetime` field is of form `1/12/15 0:00` which follows neither formats. Let's resolve this now!

In [8]:
sdf.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)



There are several ways to do it, but we will go through the most simple approach:
- `.withColumn(new column name, expression for the new column)`
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html#pyspark.sql.DataFrame.withColumn

In [9]:
from pyspark.sql.functions import to_timestamp

dtime_format = 'd/M/yy H:mm' 
dtime_cols = ('tpep_pickup_datetime', 'tpep_dropoff_datetime')

In [10]:
for dtime_col in dtime_cols:
    sdf = sdf.withColumn(dtime_col, 
                         to_timestamp(sdf['tpep_pickup_datetime'], dtime_format)
    )

In [11]:
sdf.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- pickup_longitude: double (nullable = true)
 |-- pickup_latitude: double (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_longitude: double (nullable = true)
 |-- dropoff_latitude: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)



Now, we've converted it into a `TimestampType` as required. 

Let's try some more advanced conversions. For example, if we look at the `store_and_fwd_flag`, it actually represents a boolean condition. According to the Data Dictionary though, we currently have `N` and `Y` representing No and Yes respectively. 

In `pandas`, we would have done something like this:
```python
df['store_and_fwd_flag_bool'] = (df['store_and_fwd_flag'] == 'Y').astype(bool)
```

In Spark, we can use the `.cast()` method to cast a column into a specific data type.

In [20]:
sdf = sdf.withColumn('store_and_fwd_flag',
                    (sdf["store_and_fwd_flag"] == 'Y').cast('BOOLEAN')
      )
sdf.limit(5)

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-12-01 00:00:00,2015-12-01 00:00:00,5,0.96,-73.97994232,40.76538086,1,False,-73.96630859,40.76308823,1,5.5,0.5,0.5,1.0,0.0,0.3,7.8
2,2015-12-01 00:00:00,2015-12-01 00:00:00,2,2.69,-73.97233582,40.76237869,1,False,-73.99362946,40.74599838,1,21.5,0.0,0.5,3.34,0.0,0.3,25.64
2,2015-12-01 00:00:00,2015-12-01 00:00:00,1,2.62,-73.96884918,40.76453018,1,False,-73.97454834,40.79164124,1,17.0,0.0,0.5,3.56,0.0,0.3,21.36
1,2015-12-01 00:00:00,2015-12-01 00:00:00,1,1.2,-73.99393463,40.74168396,1,False,-73.99766541,40.74746704,1,6.5,0.5,0.5,0.2,0.0,0.3,8.0
1,2015-12-01 00:00:00,2015-12-01 00:00:00,2,3.0,-73.98892212,40.72698975,1,False,-73.97559357,40.6968689,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3


# TODO TODO TODO

# User Defined Functions (UDF) and Pandas UDFs
So far, all the functions and methods have been about simple aggregations or filtering rows. However, preprocessing and data cleansing usually requires more powerful tools such as `regex`.

Unlike Pandas's `apply()` method (and also `rdd.map()`), we need to do a "bit" more work to generate UDFs.

1. Create a function with a `@udf()` decorator.
2. Specify an output data type (i.e `StringType()`) as format `@udf("string")` or `@udf(StringType())`.
3. Apply onto column(s) of choice (remembering that Spark is immutable).

Alternatively, if we want to use Pandas framework:
1. Create a function with a `@pandas_udf()` decorator and format as required.
2. Apply onto column(s) of choice.

In the following example, we will create a tuple consisting of pickup lat/lon to 4 decimal places.

In [None]:
# using UDF
@F.udf(ArrayType(DoubleType(), True))
def create_coords(lat, lon):
    return round(lat, 4), round(lon, 4)

In [None]:
small_sdf.withColumn("pickup_coords", create_coords(col("pickup_latitude"), col("pickup_longitude"))) \
    .limit(10)

And here's an example of mapping values from our data dictionary using a Pandas UDF:
- Type definition Syntax: https://www.python.org/dev/peps/pep-0484/#type-definition-syntax
- Function Decorators: https://johnpaton.net/posts/clean-spark-udfs/

The Pandas UDF is also quite new so there isn't much *help* other than the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html?highlight=pandas%20udf

Syntax:
```python
@pandas_udf(THE DATATYPE OF THE OUTPUT)
def FUNCTION_NAME(ARGUMENTS: INPUT DATA FORMAT) -> OUTPUT DATA FORMAT:
    ...
    return ...

sdf.withColumn(COLUMN OUT, FUNCTION_NAME(col(COLUMN IN)))
```

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType

In [None]:
vendors = {1: 'Creative Mobile Technologies, LLC', 2: 'VeriFone Inc.'}

@pandas_udf("string")
def vendorMap(vid_col: pd.Series) -> pd.Series:
    return vid_col.map(vendors)

In [None]:
small_sdf.withColumn("VendorName", vendorMap(col("VendorID"))) \
    .limit(10)

And that's the basics of PySpark! If you would like to further increase your scope, here are some pathways:
- Data Science: Continue with Spark's MLlib to perform machine learning.
- Data Engineering: Learn Spark SQL and Spark Connectors (i.e connecting to data sources such as S3 buckets)