This repository contains the code and the report for the Doodle Hiring Challenge described here: https://github.com/tamediadigital/hiring-challenges/tree/master/data-engineer-challenge
⇨ see report.md for details ⇦
Folders:
cmd
contains the main program,baseline
contains a basic baseline program that just read frames from Kafka,benchmarks
contains the script + results of a little benchmark experiment
Files:
report.md
: explains and discuss the current solution and what else we could do,docker-compose-yml
: what I used to run Kafka on my Mac
- have a working go installation
- have kafka running on
localhost:9092
- have two kafka topics
doodle
anddoodle-out
(see constants incmd/main.go
) - have the content of
stream.jsonl
in the kafka topicdoodle
(see their readme) (do it only once !) - (if using my docker-compose) have a folder
data
with thestream.jsonl
file (see below)
- clone this repo
- install dependencies:
go get github.com/segmentio/kafka-go
orgo get cmd/..
- build and run using
go run cmd/*.go
, or build then run usinggo build -o doodlechallenge cmd/*.go && ./doodlechallenge
Get the data:
mkdir data
cd data
wget http://tx.tamedia.ch.s3.amazonaws.com/challenge/data/stream.jsonl.gz
gunzip -k stream.jsonl.gz
cd ..
Launch the containers:
docker-compose up -d
Get a bash into the kafka container:
docker exec -it doodlechallenge_kafka_1
Setup the topics and the data (still inside the kafka container):
/opt/bitnami/kafka/bin/kafka-topics.sh --create \
--zookeeper zookeeper:2181 \
--replication-factor 1 --partitions 1 \
--topic doodle
/opt/bitnami/kafka/bin/kafka-topics.sh --create \
--zookeeper zookeeper:2181 \
--replication-factor 1 --partitions 1 \
--topic doodle-out
# optionally, set a very low retention policy on the out topic, in case we run many times the program
/opt/bitnami/kafka/bin/kafka-configs.sh \
--zookeeper localhost:2181 \
--alter \
--topic doodle-out \
--config retention.ms=1000
cat /data/stream.jsonl | /opt/bitnami/kafka/bin/kafka-console-producer.sh --broker-list kafka:9092 --topic doodle
Note that the last command may take a while, there are 1M records to ingest !
Derlin @ 2020 (during the corona virus outbreak)