Skip to content

Latest commit

 

History

History
163 lines (125 loc) · 10.8 KB

real-world-dataflow-problems.md

File metadata and controls

163 lines (125 loc) · 10.8 KB

real world dataflow problems

Some problems I observed in real dataflow pipelines

ETL

SQL

SQL nowadays is so much more than SQL92 (which most people are familiar with). Arrays, json, xml ... can be handled. In case of distributed systems ordering (total ordering vs partial ordering within partitions) turn out to be important concepts to master as well:

dataflow

  • understand the limits of your architecture / technology and dataflows regarding speed, capacity, latency, types of data handlable...
  • define clear APIs (schemata) between different data sources and pipelines to allow building workflows on top of the ingested data streams
  • be clear what type of data you want to process (batch vs. streaming). Do not force batch semantics for a stream processor. Idempotent jobs will make your life much easier.
  • for batch workloads consider oozie over nifi. it just works TM
  • some great NiFi tipps https://pierrevillard.com/best-of-nifi/

security

Project management

When working on a data project is is even more important to convey a story https://www.youtube.com/watch?v=plFPTDwk66s shows 6 points how to improve visualization by telling a clear story.

operations

scale out over multiple datacenters

monitoring

  • build in monitoring E2E by design

llap

llap might not start (in case of a small development cluster) if not enough memory is available or a node is down. However, currently in HDP 2.6.4 no meaningful error message is displayed

hardware

machine learning

  • no proper strategy for holdout group and prevention of feedback
  • model serving not scalable, no fully fledged ML solution available
  • when woring on a machine learning prototype the chance is high - if the results look promising that the model needs to be deployed int oa production environment. Business stakeholders will expect a smooth & quick transition to production mode (results already look so great). Therefore, make sure to only use data sources which are actually available in a production setting, and make sure to get the data directly at the source
  • understand the problem domain. Very often regular k-fold cross validation is not a perfect fit as there is a dependency on time. Use time series cross validation (possibly customized) to perform CV in a setting which resembles the actual business use case
  • proper understanding of the data. Cehck for errors (too much / not ehough / wrong units / ...)
  • work with and talk to the department to prevent data leakage into the model
  • reproducibility is a problem (https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/) and the overall pipeline E2E needs to be thought out very well
  • model management is a big problem. This book https://mapr.com/ebooks/machine-learning-logistics/ describes it nicely
  • planning http://deon.drivendata.org
  • spark pipeline example https://engineering.autotrader.co.uk/2018/10/03/productionizing-days-to-sell.html

Why do many analytics projects fail? https://www.fast.ai/2018/07/12/auto-ml-1/

It is not always about the algorithm: https://www.youtube.com/watch?v=kYMfE9u-lMo the details and thoughts around is what matters. Think about a whole system (and ideally it is simple) which actually delivers value vs. a very complex algorithm only running in a notebook https://www.youtube.com/watch?v=68ABAU_V8qI. And more importantly: make the results tangible for example using tools like https://github.com/gradio-app/gradio to drive trust from business units.

https://arxiv.org/abs/2107.00079 collects various anti patterns https://towardsdatascience.com/how-not-to-do-mlops-96244a21c35e gives a short summary of them.

business value

  • evaluation of models and presentation to non technical audience https://modelplot.github.io, https://github.com/gradio-app/gradio
  • before starting out with a data science use-case clearly define what constitutes success, how it is measured (= oftentimes this means how to obtain labels in a setting where no previous labelled data was collected)

hiring

teams & organization

culture

big data

DO NOT do big data! unless you really have big data and fully understand all the consequences of a distributed system. Instead, invest a couple of $$ into beefier single node computers. High single thread performance + lots of RAM will make you so much more productive.

scalability Sometimes extreme scalability is not required! Do not get stuck in thinking you actually need it. Think of a scenario of many events for each user but the number of users being alsmost constant. Such a scenario can warrant some different algorithms to optimally process the data.

Still, if required build for scale, i.e. for many users. But even more important have a scalable architecture of small and reuasable components. Git submoduels can be a tool which supports this even for otherwise hard to version artifacts.

small files problem many small files (a lot smaller than HDFS block size) cause a performance degredation. workarounds:

specific helpful issues

serving results of big data computation

java stuff

containers

great quotes

innovation

google learnings

architecture and cloud

https://thefrugalarchitect.com/

project management - general