- The motivation of this project is to provide ability of processing data in real-time from various sources like openmrs, eid, e.t.c
Make sure you have the latest docker and docker compose
- Install Docker.
- Install Docker-compose.
- Make sure you have Spark 2.3.0 running
- Clone this repository
You will only have to run only 3 commands to get the entire cluster running. Open up your terminal and run these commands:
# this will install 5 containers (mysql, kafka, connect (dbz), openmrs, zookeeper, portainer and cAdvisor)
# cd /openmrs-etl
export DEBEZIUM_VERSION=0.8
docker-compose -f docker-compose.yaml up
# Start MySQL connector
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-mysql.json
# build and run spark cluster. (realtime streaming and processing)
# https://www.youtube.com/watch?v=MNPI925PFD0
sbt package
sbt run
If everything runs as expected, expect to see all these containers running:
You can access this here: http://localhost:9000
Openmrs Application will be eventually accessible on http://localhost:8080/openmrs. Credentials on shipped demo data:
- Username: admin
- Password: Admin123
docker-compose -f docker-compose.yaml exec mysql bash -c 'mysql -u $MYSQL_USER -p$MYSQL_PASSWORD inventory'
docker-compose -f docker-compose.yaml exec kafka /kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --from-beginning --property print.key=true --topic schema-changes.openmrs
curl -H "Accept:application/json" localhost:8083/connectors/
docker-compose -f docker-compose.yaml down
- All you have to do is change the topic to --topic dbserver1.openmrs.
docker-compose -f docker-compose.yaml exec kafka /kafka/bin/kafka-console-consumer.sh \
--bootstrap-server kafka:9092 \
--from-beginning \
--property print.key=true \
--topic dbserver1.openmrs.obs
-
This section attempts to explain how the clusters work by breaking everything down
-
Everything here has been dockerized so you don't need to do these steps
project
│ README.md
│ kafka.md
│ debezium.md
│ spark.md
│ docker-compose.yaml
│ build.sbt
│
└───src
│ │ file011.txt
│ │ file012.txt
│ │
│ └───subfolder1
│ │ file111.txt
│ │ file112.txt
│ │ ...
│
└───project
│ file021.txt
│ file022.txt
-
How many brokers will we have? this will determine how scalable and fast the cluster will be.
-
How many producers & consumers will we need inorder to ingest and process encounter, obs,orders,person e.t.c?
-
How many partitions will we have per topic?
- we will definitely need to come with an intelligent way of calculating number of partition per topic.
- keeping in mind that this is correlated with "fault tolerance" and speed of access
-
Will we allow automatic partition assignment or go manual?
- going manual is crucial for parallel processing
-
will we need consumer group in this design
- keep in mind that the obs producer will have so many transactions in parallel
-
What Replication factor (RF)? RF is number of copies of each partition stored on different brokers
- Keeping in mind replication factor is used to achieve fault tolerance
- it also depends on number Brokers we will have.
- should be predetermined and set during topic creation
-
Kafka doesn't retain data forever that's not it's work. There are 2 properties log.retention.ms and log.retention.bytes which determines retention. default is 7 days
- log.retention.ms - retention by time (default is 7 day) data will be deleted after 7 days
- log.retention.bytes - retention by size (size is applicable to partition)
-
How many times should we set the producer to retry after getting an error (default is 0)
-
order of delivery in asynchronous send is not guaranteed? could this be a potential threat
-
Do we need to use consumer group (this can scale up speed of processing)
- we will have to consider designing for rebalancing using offset
- why do we even need it ?
- allows you to parallel process a topic
- automatically manages partition assignment
- detects entry/exit/failure of a consumer and perform partition rebalancing
-
What about autocommit? should we override it to false
- this will allow us to ensure that we don't lose data from the pipline incase our permanent storage service goes down just intime after data processing
-
Schema evolution design strategy
- so that our producers and consumers can evolve - otherwise we will have to create duplicate producers and consume in case of changes in the