Welcome to the series of notebooks on Apache Spark! The main goal of this series is to get familiar with Apache Spark, and in particular its Python API called PySpark. The goal is to introduce few functionalities of interest (and by no means complete!) and I might have done many errors, left many typos, and some parts might be deprecated. Feel free to open an issue (@JulienPeloton) if this is the case!
Under construction
-
Part I: Installation and first steps
- Apache Spark: what it is?
- Installation @ HOME
- Using the PySpark shell
- Your first Spark program
-
Part II: Spark SQL and DataFrames
- Apache Spark Data Sources
- Spark SQL and DataFrames
- Loading and distributing data: RDD vs DataFrame, partitioning, limits, ...
-
Part III: From Scala to Python
- Strict vs non-strict (lazy) evaluation
- Transformations vs actions
- User Defined Functions
- Cache or not cache? That's the question.
-
Part IV: PySpark, Numpy, Pandas and co.
- Can I use my regular python packages with PySpark?
- More complex UDF
- Visualising data faster than ever
-
Part V: Spark UI and debugging
- Monitoring logs and understanding them
-
Part VI: Testing Spark applications
- Testing using doctest.
- Automating with Travis.
-
Part VII: Interfacing with Spark
- From Spark to PySpark: understanding log4j
- Interfacing C++ libraries with PySpark
- C/C++/Fortran to Scala
-
Appendix A: Apache Spark @ NERSC
- Apache Spark and HPC machines
- Batch jobs @ NERSC
- JupyterLab @ NERSC
-
Appendix B: Apache Spark @ DESC
- Some examples on how to use Spark with LSST data
- Databricks: Research papers on Apache Spark (e.g. foundation, rdd)
- HDFS: The Hadoop distributed file system.
- Scientific Spark: spark-fits, Apache Spark for physicists + see bibliography at the end of those papers.
- Books: Spark, Scala.
- Apache Spark documentation: https://spark.apache.org/docs/latest/