An-Introduction-to-PySpark

This story represents an easy path for beginners to Process Data using PySpark. Along with Transformation, Spark Memory Management is also taken care. Here Freddie-Mac Acquisition and Performance Data from year 1999–2018 is used to create a Single o/p file which can further be used for Data Analysis or Building Machine Learning Models.

Tools/Software Used: Service - Data Bricks, EMR. Storage - PC, S3. Language - PySpark, Python, SAS(Reference).

Checklist Followed: Understanding Data and Transformation Logic. Spark Parallelism and Job Life Cycle. Spark Memory Management. Data Processing with Spark-Submit.

Two I/p files are of size 145 Gb and o/p file is of 5 Gb for Full Volume. With Sample Files, total I/p and O/p file size is nearly 3Gb. O/p file is similar to Acquisition file, but with more columns like Loan Default Status, Default UPB, etc. from Performance Data.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
0. SAS.sas		0. SAS.sas
1. data_download.py		1. data_download.py
2. freddie_mac.dbc		2. freddie_mac.dbc
3. process_all_persist.py		3. process_all_persist.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0. SAS.sas

0. SAS.sas

1. data_download.py

1. data_download.py

2. freddie_mac.dbc

2. freddie_mac.dbc

3. process_all_persist.py

3. process_all_persist.py

README.md

README.md

Repository files navigation

An-Introduction-to-PySpark

About

Releases

Packages

Languages

Navigation Menu

abhilash499/An-Introduction-to-PySpark

Folders and files

Latest commit

History

Repository files navigation

An-Introduction-to-PySpark

About

Resources

Stars

Watchers

Forks

Languages