Wenqiang Feng's notes for pySpark using real air quality data. I'm leaving practical and ready to use commands
An RDD in Spark is simply an immutable distributed collection of objects sets. Each RDD is split into multiple partitions (similar pattern with smaller sets), which may be computed on different nodes of the cluster.
- Start Spark environment
- Importing data from different sources and transforming them into RDDs
- List of commands for doing different actions over RDDs.
Source: https://runawayhorse001.github.io/LearningApacheSpark/rdd.html