This story represents an easy path for beginners to Process Data using PySpark. Along with Transformation, Spark Memory Management is also taken care. Here Freddie-Mac Acquisition and Performance Data from year 1999–2018 is used to create a Single o/p file which can further be used for Data Analysis or Building Machine Learning Models.
Tools/Software Used: Service - Data Bricks, EMR. Storage - PC, S3. Language - PySpark, Python, SAS(Reference).
Checklist Followed: Understanding Data and Transformation Logic. Spark Parallelism and Job Life Cycle. Spark Memory Management. Data Processing with Spark-Submit.
Two I/p files are of size 145 Gb and o/p file is of 5 Gb for Full Volume. With Sample Files, total I/p and O/p file size is nearly 3Gb. O/p file is similar to Acquisition file, but with more columns like Loan Default Status, Default UPB, etc. from Performance Data.