-
Notifications
You must be signed in to change notification settings - Fork 29.1k
SPARK-12347 [ML][WIP] Add a script to test Spark ML examples. #15279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So at first glance this seems like it might be a work in progress PR - if that is the case you can tag it with [WIP] in the title. I think if we had a blacklist of examples not to run this could be a good start, and then we would want to add it to jenkins script. I'll try and follow up next week with a more thorough examination :) |
|
Can you please change the title to have: "SPARK-12347" -> "[SPARK-12347]"? |
|
ok to test |
|
Test build #68646 has finished for PR 15279 at commit
|
|
Hi, @ethanluoyc . |
|
Hi - how are we on this PR? |
|
@felixcheung Last time I was working on this PR, the main challenge is to work out a nice way to record the examples that require input, and pass those arguments properly. Do you have any suggestion on that? |
|
perhaps a test driver .yml file? |
|
Test build #70088 has finished for PR 15279 at commit
|
|
@felixcheung I uploaded the yaml file for configuring the arguments passed into examples. |
|
Test build #71768 has finished for PR 15279 at commit
|
|
@ethanluoyc this seems reasonable although I'm not 100% how this go yet. |
|
@felixcheung I actually agree with you on examples being self contained. In that case, the example authors can just declare the arguments in the examples. What I think may be appropriate will be some macro kind of text in the examples themselves. (e.g. like an shebang line but for files.) Then instead of looking for the arguments in the yml we can look for it in the examples. Another way to do it will be to let the examples be ready in passing in their own default arguments. But I would not say this is a good idea because then all examples will need to be modified. |
|
What if we have a bunch of default values when arguments are not set, and those are the values we could test with? This way the same sample code can run with and without arguments? |
|
@felixcheung I think we can, although it requires changes to every examples. If you are fine with that, I can start on some of the examples from now onwards:D |
|
I think that should be reasonable, but let's start with one or two example and discuss. thanks! |
|
@felixcheung I changed two examples in python. These are the simple ones. However, some of the examples take really complicated arguments and I am not sure if doing it in this way would scale. What do you think? |
|
Test build #72787 has finished for PR 15279 at commit
|
| Usage: sort <file> | ||
| """ | ||
|
|
||
| filename = "../resources/people.txt" if len(sys.argv) != 2 else sys.args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe better to reference relative to SPARK_HOME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably would have one check if len(sys.argv) > 1 and then set all parameter in the if block - we can assume if any parameter is there the user needs to specify all of it?
|
that looks reasonable. could you point me to the ones with complicated argument you were referring to? |
|
For example https://github.com/apache/spark/blob/master/examples/src/main/python/parquet_inputformat.py#L26. It requires that we pass in jar files, which are not so easy to detect directly. |
|
shall we use |
|
Hi @ethanluoyc, is it still WIP? I would rather like to propose to close this if it is inactive. |
|
@HyukjinKwon I am really sorry but please close this since I have not found the time to complete all the cases. |
This PR addresses SPARK-12347 and may also be helpful for SPARK-15571.
What changes were proposed in this pull request?
This PR adds a python script to drive all the examples located in the
examplessubdirectory. It should be able to streamline the testing of the examples in R, Python, Scala and Java to see if any of them has incompatible behavior with the codebase.How was this patch tested?
This PR is not yet fully ready for merging. I would like to have some reviews for how it works best for those who will indeed be using this script. For now, it introduces the following features:
Note that the last TODO is really important, for that I would like to hear suggestions from the reviewers for how it should should be best implemented. For now, I think one good way will be to have comments as directives in the example code. Like how
# $example on$are introduced to facilitate doc generation. We can do something similar to hint what arguments should be passed in for testing. Otherwise, we can always fall back to the way we discussed in the JIRA SPARK-12347Also, some of the functionality replicates that in run-tests.py. Perhaps we can find a way to integrate both?