Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 4.51 KB

thought-process.md

File metadata and controls

33 lines (22 loc) · 4.51 KB

Notes

Farming of abalone began in the late 1950s and early 1960s in Japan and China!

Pre-coding

  • I tried to understand what streaming mentality is, and the landscape around it. One of the most important question that I tried to answer was: "Why is streaming better than triggering the database after each update".
  • Then, I figured out the code stack in Java world and then found out the corresponding stack in Python world. I spend some time learning about the ELK/EKK stack. I decided to use Kafka in Docker, and Python with faust and confluent_kafka. Here is a great introductory series on Kafka.
  • I tried to figure out what part of the Kafka is opinionated, and what are the limits to it being configurable. I read some source code of Kafka, found out that Scala looks very nice. I realized that Kafka is very configurable, e.g. Partitioner.java being very consice and the existence of some low-level JVM settings.

Coding

At this point I go on with complete trust to the stack and my general approach.

  • I first divided the task into 5 or so pieces, the first one being "produce and consume abalone_full csv file with just confluent_kafka". This was basically reading the confluent-kafka documentation, and connecting everything together.
  • I also needed to setup a kafka environment, for which I decided to use Docker, however this was painful. These notes were helpful.
  • For processing, I first tried to have a trivial faust app, where it basically writes the stream to a csv file, hence generating an equivalent csv file to the input. Then I slowly found my way into the tasks.
  • One of the first problems I faced was that group_by keyword in Faust seem to not work on int typed field descriptors. This made me use a hack instead of that keyword, which is probably not healthy due to that whole async world.
  • Another problem was to decide when to close the csv files. Since it's in theory an infinite process, I thought a decent way is to just check if there's an update to the new csv files every some (let's say 10) seconds. I didn't want to make it depend on the input file, but this feels extremely hacky too. I couldn't get the data in the buffer without closing the file.

Assumptions

  • Concurrency concerns are mentioned in the faust user guide. I decided to not change that variable, since it would affect only the individual agent, and also one cannot use the group_by keyword anymore. I realized that csv files for the different subtasks were getting created simultaneously, so I assume that Faust is giving some level of concurrency by default. There are some vague comments about this in here.

Comments

  • Even though it's very rigid and discrete, Kafka having concepts like Stream Processing Topology is exciting, and I have the vague idea that it can be generalized (at least to the simplicial complexes).
  • Faust was not well-documented and buggy. Sending kill signal and restarting the worker was a pain. I see that it has very recent issues on these topics on github. However, I can't tell if these are just about asyncio.
  • The fact that Spotify's docker image for Kafka + Zookeeper is documented worse (and hence harder to use) than wurstmeister's docker image was surprising.
  • I was looking for the low-level algorithms and efficiency hacks that Kafka uses. Sendfile operation is one of them.

Post-coding

  • Here is a great article about distributed consensus stuff.