Navigation Menu

Skip to content

abhilash499/An-Introduction-to-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An-Introduction-to-PySpark

This story represents an easy path for beginners to Process Data using PySpark. Along with Transformation, Spark Memory Management is also taken care. Here Freddie-Mac Acquisition and Performance Data from year 1999–2018 is used to create a Single o/p file which can further be used for Data Analysis or Building Machine Learning Models.

Tools/Software Used: Service - Data Bricks, EMR. Storage - PC, S3. Language - PySpark, Python, SAS(Reference).

Checklist Followed: Understanding Data and Transformation Logic. Spark Parallelism and Job Life Cycle. Spark Memory Management. Data Processing with Spark-Submit.

Two I/p files are of size 145 Gb and o/p file is of 5 Gb for Full Volume. With Sample Files, total I/p and O/p file size is nearly 3Gb. O/p file is similar to Acquisition file, but with more columns like Loan Default Status, Default UPB, etc. from Performance Data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published