Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What the hell is dat? #305

Closed
balupton opened this issue Apr 22, 2015 · 15 comments
Closed

What the hell is dat? #305

balupton opened this issue Apr 22, 2015 · 15 comments

Comments

@balupton
Copy link

I've been reading the various links and I can't find anything on what the hell is dat? Only things of some high level features mentioned on the dat homepage and then instructions on how to use it.

Is there a talk or something on it?

@ErieMeyer
Copy link

ErieMeyer commented Apr 24, 2015 via email

@dfockler
Copy link

As of right now, it seems like an easily sharable, syncing file format, essentially git for large datasets. It's kind of weirdly structured because it's meant to hook into other scripts and applications to actually do data transformations. It also seems like they made a move to work on sharing scientific data, which was not the first intended purpose. But the docs don't really explain a clear use case, and it's still in alpha so that could have something to do with it.

@nichoth
Copy link

nichoth commented Apr 27, 2015

@gobengo
Copy link

gobengo commented Apr 30, 2015

@okdistribute
Copy link
Collaborator

Haha! You made me lol.

There are 3 of us working on it now, and we're in the process of finishing beta. We'll update the website with some tutorials, getting started guides, and stuff like that. Thanks @dfockler @nichoth and @gobengo :)

@balupton
Copy link
Author

balupton commented May 8, 2015

Thanks everyone, the talk became the best resource for me as it goes into the use case, what it does, when to use it, why it is important, etc etc. The walkthrough isn't that helpful initially, as it is how to do it, rather than the why, whats, or what ifs which is needed prior to making the investment to the hows.

That being said, now that I know what it is, it seems really nifty. Keen to follow this project. Looking forward to the new website and marketing.

@nylen
Copy link

nylen commented May 8, 2015

I can't quite figure out what the hell it is either... but it's really cool and has amazing amounts of potential.

It seems like a lot of the promised features are not implemented yet. See #296 and #300 for recent examples.

@gobengo
Copy link

gobengo commented May 12, 2015

I found the 'get-dat' walkthrough very useful (and very technically impressive...).

I realized a cool use case, which is as a 'sink' for any script you would run that spits newline-delimitted JSON to stdout. For example, JSON sourced from an API.

Last week I made something like that for work, livefyre-geo-collection. In the near future I'll take a stab at piping it into dat as a way to persist the results of the 'archive' command.

@okdistribute
Copy link
Collaborator

@balupton @nylen for some background, I just wrote this whitepaper draft: https://github.com/maxogden/dat/blob/master/docs/whitepaper.md

I'd be happy to receive questions/comments/PRs/suggestions/etc if you have the time! :)

@balupton
Copy link
Author

balupton commented Jun 7, 2015

@Karissa cool that was useful.

@balupton
Copy link
Author

balupton commented Jun 7, 2015

So... what do you do with the data once it is in dat?

It seems like the following:

  1. There are external data sources
  2. You pull these external data sources into dat, dat versions and merges them into a local dat database
  3. You then export the dat database into a json file so you can import the data into a database that you can query and work with

Item 3 here, seems more like it should be:

  • Dat can then be configured with automatic export of the latest final data to an external database, so you can query and work with the data

Or something along the lines of being able to work with the current latest/final data in our ideal structures for querying/rendering/etc


My actual use case here is to develop a static site generator, which database is a local leveldb or mongodb or pouchdb database, which uses dat to import data into that local database from several external sources (prismic.io, wordpress.com, ghost.org, tumblr.com, soundcloud.com, etc) - allowing for the ability to consolidate a person's online data and render it nicely for an always up to date personal portfolio website. Idea brief.

@joyrexus
Copy link
Contributor

@Karissa thanks for the whitepaper. I found it helpful.

I'd like to offer some general feedback: You close the first section with ...

We introduce Dat, a version-controlled distributed database and data tool that has the user interface of a version control system (VCS).

... and follow with a nice summary of key features. You then compare/contrast w/ Kafka. As a result, the reader has an initial mental model of a distributed database and/or messaging system, both of which may be somewhat misleading.

FWIW, I'd suggest that a somewhat better mental model to put right up front might be a CSV file / spreadsheet table ... with row-level versioning. Whenever I try to explain dat I always start with that, because everyone's familiar with a table of data in Excel and can grok the idea of the table's row-by-row change history. Once that sinks in, I mention that the various versions of the table are "clone-able": so, table of data ... that's versioned ... and replicable.

Anyway ... you do a nice job of explaining dat, but I think putting the idea of a tabular-sheet-of-data-with-change-history front-and-center might be useful to newcomers looking for a simple, concrete mental model. Everything else (blob storage, the REST interface, etc.) can hang off that.

@okdistribute
Copy link
Collaborator

@joyrexus awesome, thanks for your feedback. I made some quick edits but I'd like to go through at some point to do a more robust editing on that point!

@tbuckl
Copy link

tbuckl commented Jul 13, 2015

Hi all, I love the spirit of this project. I agree with @joyrexus that an example of versioning the row and column changes of simple, small CSV would be required to convince me that @dfockler's comment that Dat has the focus or capacity to be "essentially git for...datasets."

Maybe I am just missing something? I realize this is a nascent enormous project, and like I said: love the spirit of it, but it would take less, in a way, for me to start playing with it, if that makes sense.

@okdistribute
Copy link
Collaborator

After a whole year of feedback from the community, we recently published a new version of dat. Would you mind trying out the new dat with npm install -g dat? You can read about the new dat announcement on the website and how it works in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants