Learning Apache Spark by the example

Welcome to the series of notebooks on Apache Spark! The main goal of this series is to get familiar with Apache Spark, and in particular its Python API called PySpark. The goal is to introduce few functionalities of interest (and by no means complete!) and I might have done many errors, left many typos, and some parts might be deprecated. Feel free to open an issue (@JulienPeloton) if this is the case!

Under construction

Table of content:

Part I: Installation and first steps
- Apache Spark: what it is?
- Installation @ HOME
- Using the PySpark shell
- Your first Spark program
Part II: Spark SQL and DataFrames
- Apache Spark Data Sources
- Spark SQL and DataFrames
- Loading and distributing data: RDD vs DataFrame, partitioning, limits, ...
Part III: From Scala to Python
- Strict vs non-strict (lazy) evaluation
- Transformations vs actions
- User Defined Functions
- Cache or not cache? That's the question.
Part IV: PySpark, Numpy, Pandas and co.
- Can I use my regular python packages with PySpark?
- More complex UDF
- Visualising data faster than ever
Part V: Spark UI and debugging
- Monitoring logs and understanding them
Part VI: Testing Spark applications
- Testing using doctest.
- Automating with Travis.
Part VII: Interfacing with Spark
- From Spark to PySpark: understanding log4j
- Interfacing C++ libraries with PySpark
- C/C++/Fortran to Scala
Appendix A: Apache Spark @ NERSC
- Apache Spark and HPC machines
- Batch jobs @ NERSC
- JupyterLab @ NERSC
Appendix B: Apache Spark @ DESC
- Some examples on how to use Spark with LSST data
FAQ

Bibliography and useful links

Bibliography

Databricks: Research papers on Apache Spark (e.g. foundation, rdd)
HDFS: The Hadoop distributed file system.
Scientific Spark: spark-fits, Apache Spark for physicists + see bibliography at the end of those papers.
Books: Spark, Scala.

Links

Apache Spark documentation: https://spark.apache.org/docs/latest/

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
spark_tutorial_FAQ.ipynb		spark_tutorial_FAQ.ipynb
spark_tutorial_appA_at_nersc.ipynb		spark_tutorial_appA_at_nersc.ipynb
spark_tutorial_part1_basics.ipynb		spark_tutorial_part1_basics.ipynb
spark_tutorial_part2_io.ipynb		spark_tutorial_part2_io.ipynb
spark_tutorial_part3_scala2python.ipynb		spark_tutorial_part3_scala2python.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Apache Spark by the example

Table of content:

Bibliography and useful links

Bibliography

Links

About

Uh oh!

Releases

Packages

Languages

License

astrolabsoftware/spark-tutorials

Folders and files

Latest commit

History

Repository files navigation

Learning Apache Spark by the example

Table of content:

Bibliography and useful links

Bibliography

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages