PySpark-Boilerplate

A boilerplate for writing PySpark Jobs

For details see accompanyiong blog post at https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f#.wg3iv4kie

Pre-requisite

Setup the following environment files which resides at ./src/config/fetch_from_mongo folder.

Config.env: Credentials are stored in this file. THIS FILE IS NOT PUSHED TO GIT.

`

MONGO_USER_NAME=<mongo-db-user-name>
MONGO_USER_PWD=<password>
POSTGRES_USER_NAME=<username> usually postgres
POSTGRES_USER_PWD=<password> usually postgres

`

Config.json: This configuration varies as per job, information except (credentials) are stored in this file.

How to run the job in the repository

Creates the dist folder which has config file, libraries and source code. sh deploy.sh
Go to dist folder cd dist
Run the job spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 --jars ~/Downloads/postgresql-42.2.24.jar --py-files jobs.zip,libs.zip main.py --job fetch_from_mongo

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
noteboooks		noteboooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
deploy.sh		deploy.sh
dev_requirements.txt		dev_requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark-Boilerplate

Pre-requisite

How to run the job in the repository

About

Releases

Packages

Contributors 2

Languages

gauravguptabits/pyspark-scripts

Folders and files

Latest commit

History

Repository files navigation

PySpark-Boilerplate

Pre-requisite

How to run the job in the repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages