A beginner's guide to Apache Spark 3.2 (PySpark) for Data Engineering.
This is essentially (for now) just an updated version of my MAST30034 PySpark advanced tutorials, but will be reworked for a more general tutorial in future.
- Installation (Windows 11 + WSL2, Linux, MacOS)
- Fundamentals (Spark Session, reading data in, filtering, aggregating, Spark SQL basics, saving data)
- PySpark's Pandas API (Basics of PySpark's new pandas API as of Apache Spark 3.2)
- Transformations and Functions (Conerting data types, creating User-Defined-Functions, and also pandas UDFs)
- Common methods, attributes, and functions for Data Engineering
- AWS, PySpark, and Redshift (or postgres) (EMR clusters)
- Common libraries:
psycopg2
,cloudpathlib
(used in conjuction with Spark)