GitHub - carmel-wenga/Apache-Beam-Pipeline-To-load-CSV-data-into-a-PostgreSQL-Table: This sample project aims at loading data from a csv file into a PostgreSQL

What is this sample beam pipeline about ?

This sample project aims at loading data from a csv file into a PostgreSQL database.

The project structure is the following

├── database
│   ├── Dockerfile
│   └── init.sql
├── pipeline.py
├── README.md
├── requirements.txt
└── source
    └── salaries_data.csv

The data that will be loaded into the postgres database are in the source/salaries_data.csv file. I downloaded the salary data from kaggle. This dataset contains salary data based on experience, age, gender, function and level of education. You can also use other data in csv format.

The apache beam pipeline itself is defined by the pipeline.py file, which reads the csv file and loads its contents into a postgres table.

Finally, the postgres database is defined with a Dockerfile (database/Dockerfile). The init.sql file creates the salaries table in the database at startup.

Create Virtual Environment

python3.10 -m venv .env

Install Requirements

The main requirements are the following:

apache-beam==2.52.0
beam-postgres-connector==0.1.3
psycopg2-binary==2.9.9

The requirements.txt file contain the complete list of requirements for this sample project.

source .env/bin/activate && pip install -r requirements.txt

Build and Run the database container

cd database && docker build -t pipelinedb .

Run the pipelinedb container with the command below

docker run -it -p 5432:5432 --name pipelinedb --env-file .env pipelinedb

The .env file contain environment variables (DB, USER, PASSWORD) useful for the postgres container. Note that the .env is not versioned in the repository. Go to the database folder and run the following command to create the .env file

cat <<EOF >> .env
POSTGRES_DB=beamdb
POSTGRES_USER=beamdb
POSTGRES_PASSWORD=ra5hoxetRami5
EOF

The postgres database will be accessible from localhost with the following uri postgres://localhost:5432/beamdb

Run the pipeline

Go back in the project's root directory and run the beam pipeline with the following command

python pipeline.py --input source/salaries_data.csv --dbhost localhost --dbport 5432

Output

After running the pipeline, salary data must be loaded into the salaries table in the postgres database.

To inspect your database, run the commands below

Connect to the container

docker exec -it pipelinedb bash

Connect to the database

psql -U beamdb -d beamdb

Run the query below to check that the data has been loaded into the salaries table

select * from salaries;

The result of the above query should look something like the following

 age | gender | education_level |               job_title               | year_of_experience | salary 
-----+--------+-----------------+---------------------------------------+--------------------+--------
  32 | Male   | Bachelor        | Software Engineer                     |                  5 |  90000
  28 | Female | Master          | Data Analyst                          |                  3 |  65000
  45 | Male   | PhD             | Senior Manager                        |                 15 | 150000
  36 | Female | Bachelor        | Sales Associate                       |                  7 |  60000
  52 | Male   | Master          | Director                              |                 20 | 200000
  29 | Male   | Bachelor        | Marketing Analyst                     |                  2 |  55000
  42 | Female | Master          | Product Manager                       |                 12 | 120000
  31 | Male   | Bachelor        | Sales Manager                         |                  4 |  80000
  26 | Female | Bachelor        | Marketing Coordinator                 |                  1 |  45000
  38 | Male   | PhD             | Senior Scientist                      |                 10 | 110000
  29 | Male   | Master          | Software Developer                    |                  3 |  75000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this sample beam pipeline about ?

Create Virtual Environment

Install Requirements

Build and Run the database container

Run the pipeline

Output

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
database		database
source		source
.gitignore		.gitignore
README.md		README.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

carmel-wenga/Apache-Beam-Pipeline-To-load-CSV-data-into-a-PostgreSQL-Table

Folders and files

Latest commit

History

Repository files navigation

What is this sample beam pipeline about ?

Create Virtual Environment

Install Requirements

Build and Run the database container

Run the pipeline

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages