Skip to content

Big Data Architecture

fcrimins edited this page Apr 21, 2017 · 4 revisions

Out-of-core data options (4/21/17)

  • Why not just use TensorFlow for everything?
    • Dask creates its own execution graphs, but why is this necessary when TF already has them?
    • In particular, TF even has support for reading from files. So if that is the case, then why not just construct the files and start the TF graph there?
    • .tfrecords file format: all records for an entire training/validation/test set are intended to be written to a single file. See example here (which also includes good example usage of argparser and tf.app.
  • Dask
    • Out-of-core functional/numpy/dataframes promoted by @jakevdp--so it must be good.
  • Xray + Dask: Out-of-Core, Labeled Arrays in Python
    • Xray seems to have a clunky interface.
    • And doesn't Dask have the same functionality?
  • Good YouTube talk describing all of the differences and the history of relational dbs (SQL) -> semi-structured -> document stores (NoSQL) along with a description of Hadoop (an architecture paradigm) along the way
Clone this wiki locally