Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider renaming CsvProvider #48

Closed
ovatsus opened this issue Jan 29, 2013 · 5 comments
Closed

Consider renaming CsvProvider #48

ovatsus opened this issue Jan 29, 2013 · 5 comments

Comments

@ovatsus
Copy link

ovatsus commented Jan 29, 2013

I know this suggestion is a little bold, but thinking about it, CsvProvider currently works not only with just csv files but also with tab separated files, or any other similar textual format, and in the future it might well support more formats of tabular data (like xls/xlsx, hdf5/netCDF4, .rdata, .mat, etc...), either directly or maybe as plugins (I have some ideas about how to make that work without changing the api or creating dependencies...). But the inference and generation of typed properties is the same between all the formats.
Both the R tools and the several Python libraries that work with all those kind of files are usually called read.table or read_table (even though they have overloads called read.csv or read_csv that the only thing they do is to set the default separator to ',')
Do you think renaming CsvProvider to TabularDataProvider would be a good idea? Or are people expecting that name and we can always make the same type provider available under other additional names (like we do with freebase and worldbank that have two versions each)?

@ovatsus
Copy link
Author

ovatsus commented Feb 6, 2013

Related to this, I like the fact that we have a JsonValue that's useful on its own and then build a type provider on top of it to add type safety. This way we always have access to underlying JsonValue for edge cases when needed. The XmlProvider is similar, giving access to the underlying XElement.
I think we should have the same pattern for CsvProvider. I'm prototyping something that replaces the base classes CsvRow and CsvFile under RuntimeImplementation and promotes them to first class citizens, giving them more functionality, getting closer to the functionality of R's DataFrame (this includes dynamic lookup as described in #64). Then CsvProvider can built on top adding type safety, but you can always escape to the underlying values and do the .AsXxx like in the Json provider

@tpetricek
Copy link
Member

  • I think keeping CSV in the name is probably a good idea (I expect that people know the name and realize that this is actually working for wider range of tabular data sources and I think nobody really expects that the comma in Comma-Separated-Values has to be a comma :-))
  • I think we do not need multiple providers. The reason why this is needed for WorldBank is that there are default values for all parameters and so one version is not parameterized (WorldBank.Countries....). For CSV (etc.) we always need at least the input.

But:

  • I really like the idea of changing CsvProvider and CsvRow to follow the same style as JsonValue and be standalone types that people can use for dynamic access (I think we can pretty much follow the same pattern and have a module that adds dynamic operator and various AsXxx extensions).

    If you're happy to look into that, I'll leave it to you (if we do this, we'll need to add another *.fsx file with some documentation for the dynamic access).

@ovatsus
Copy link
Author

ovatsus commented Feb 7, 2013

I've been doing a bunch of R code lately, so I'll try to convert some of it to use FSharp.Data instead so to get a feel what would work better as a JsonValue-like API

@ovatsus
Copy link
Author

ovatsus commented Apr 10, 2013

With the latest changes from #122, we already have a decent enough dynamic API. I did a comparison between using the type provider, using the dynamic api, and using R here: https://gist.github.com/ovatsus/5354187

One advantage the dynamic version has is that we can slice the columns directly (https://gist.github.com/ovatsus/5354187#file-csvfile-fsx-L45), but we could eventually be able to do something like that with the typed version. On both cases, the average by column is not very easy to do, unless we consider a csv file to have similar operations to a matrix, and that's not easy to do in unless all the columns are of the same time

The R code is still more concise when doing filtering and mapping on the datasets, I think we have a lot of room of improvement here. FMat is able to get a Matlab/R-like syntax, maybe we could get some of that too. A possible idea would be something like this https://gist.github.com/ovatsus/5355630. I'm using the dynamic api and hardcoded a few things to make it look like the typed api. But even if we could make that work on the type provider version, I'm not very happy with it either. @tpetricek do you have any bright idea?

@ovatsus
Copy link
Author

ovatsus commented Apr 12, 2013

I think the api is good enough for now, and the csv name is not ideal but it's ok, so I'm closing this. Let's keep things minimal until we have more real world feedback

@ovatsus ovatsus closed this as completed Apr 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants