Skip to content
This is a repo documenting the best practices in PySpark.
Branch: master
Clone or download
publisher
publisher moar changes
Latest commit 3342f8c Mar 20, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE Update issue templates Feb 26, 2019
gitbook moar changes Mar 21, 2019
src Revert "rename all" (#20) Mar 21, 2019
.gitignore
README.md Revert "rename all" (#20) Mar 21, 2019

README.md

Spark-Syntax

This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.

you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.

Contributing/Topic Requests

If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁, you'll most likely be right.

If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 😁.

Acknowledgement

Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.

Table of Contexts:

Chapter 1 - Getting Started with Spark:

Chapter 2 - Exploring the Spark APIs:

Chapter 3 - Aggregates:

Chapter 4 - Window Objects:

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

  • 6.1 - Primer to Understanding Your Spark Application

    • 6.1.1 - Understanding how Spark Works

    • 6.1.2 - Understanding the SparkUI

    • 6.1.3 - Understanding how the DAG is Created

    • 6.1.4 - Understanding how Memory is Allocated

  • 6.2 - Analyzing your Spark Application

    • 6.1 - Looking for Skew in a Stage

    • 6.2 - Looking for Skew in the DAG

    • 6.3 - How to Determine the Number of Partitions to Use

  • 6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

  • 7.0 - The Types of Join Strategies in Spark

    • 7.0.1 - You got a Small Table? (Broadcast Join)
    • 7.0.2 - The Ideal Strategy (BroadcastHashJoin)
    • 7.0.3 - The Default Strategy (SortMergeJoin)
  • 7.1 - Improving Joins

  • 7.2 - Repeated Work on a Single Dataset (caching)

    • 7.2.1 - caching layers
  • 7.3 - Spark Parameters

    • 7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)
    • 7.3.2 - The magical number 2001 (partitions)
    • 7.3.3 - Using a lot of UDFs? (python memory)
  • 7. - Bloom Filters :o?

You can’t perform that action at this time.