Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-153] Read file with extension 'csv' not require mtd file #66

Closed

Conversation

deroneriksson
Copy link
Member

If a read statement reads a file with a .csv extension with no format parameter specified, such as read("m.csv"), the file is considered to be a csv file so that a metadata file is optional and not required.

@mboehm7
Copy link
Contributor

mboehm7 commented Feb 10, 2016

Did we actually arrive at a consensus on this JIRA?

@dusenberrymw
Copy link
Contributor

Tried this out on Spark and it works great. Having the ability to automatically infer a CSV format for a ".csv" file extension is smart, and avoids confusion for the new user. This doesn't affect any existing options such as supplying a format="csv" argument to the read(...) statement, and instead will simply add a much-needed convenience.

Also, I ran this using our standalone distribution, and received a java.lang.ClassNotFoundException: org.apache.commons.io.FilenameUtils error. No problems on Spark though.

@deroneriksson
Copy link
Member Author

Great catch @dusenberrymw . I had assumed commons-io was already added to the libs for standalone since it is such a common library, but I was mistaken. I'll update the standalone assembly to include it.

@mboehm7 I don't believe there was a consensus. My main concern with this issue is that I would prefer not to intimidate new users by requiring them to do things like echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd when they first try to use one of the main algorithms. The required metadata file for csv (if there was no format parameter) was one of the things I first felt frustrated by when I started using SystemML. If I created a csv file containing "1,2,3,4", I did not want to also have to create a JSON-formatted metadata file. Perhaps we can get feedback from some new users to find out if others feel the same way?

@ethanyxu
Copy link
Contributor

As a new user of SystemML I vote for the idea. When I first started to use SystemML the accidental errors in meta data was the most frustrating part. It took me awhile to search through documentations to find my mistakes only to find the error was a typo in mtd.

To build a simple sample I had to write 3 mtds manually:

  • For the original data file
echo '{
    "data_type": "frame",
    "format": "csv",
    "sep": ",",
    "header": false,
    "na.strings": [ "NA", "null", "NULL", "NaN" ]
}' | hadoop fs -put - my.data.file.csv.mtd
  • For the type file as an input argument of the transform() function:
echo '1,2,2,1,1' | hadoop fs -put - file-type.csv
echo '{"rows": 1, "cols" : 5, "format":"csv"}' | hadoop fs -put - file-type.csv.mtd
  • For splitting percentage file as input of 'sample.dml'
printf "0.7\n0.3" | hadoop fs -put - split-perc.csv
echo '{"rows": 2, "cols": 1, "format": "csv"}' | hadoop fs -put - split-perc.csv.mtd

@ethanyxu
Copy link
Contributor

I think the first one is not avoidable though.

@deroneriksson
Copy link
Member Author

@ethanyxu Thank you for your comments. I had a very similar experience. Your second and third examples are exactly the kinds of situations where I don't want to supply metadata.

@deroneriksson
Copy link
Member Author

Since no consensus could be reached and the MLContext API allows data input without what I would consider burdensome mandatory metadata requirements, I'll close this PR.

asfgit pushed a commit that referenced this pull request Mar 27, 2020
Add construction of federated object with `federated`
Adds a new function `federated` which takes two parameters `addresses`
and `ranges`.

Closes #66.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants