PySpark ML Crashcourse
This repository contains exercises and solutions for a one-day crash course for PySpark and Spark ML. The repository only contains Jupyter Notebooks which assume a working PySpark kernel with Python 3.5 and Spark 2.1.
All notebooks have been create by Kaya Kupferschmidt @ dimajix. In case you have any questions, feel free to contact me at email@example.com
01 - PySpark DataFrame Introduction
This notebook contains some simple snippets to get a basic understanding how to interact with Spark DataFrames in Python.
02 - PySpark Word Count (exercise + solution)
These notebooks contain the classic word count, implemented with DataFrames.
03 - Linear Regression (skeleton + solution)
These notebooks contain a simple linear regression exercise as an introduction to machine learning with Spark.
04 - Text Classification (exercise + solution)
After being exposed to a simple linear regression, these notebooks contain an exercise to perform a simple statistical text classification.
05 - Hyper Parameter Tuning (exercise + solution)
As with many complex algorithms and ML pipelines, the text classification has many hyper parameters. These notebooks show how to perform hyper parameter tuning with PySpark.