Skip to content

Conversation

@ethanluoyc
Copy link

@ethanluoyc ethanluoyc commented Sep 28, 2016

This PR addresses SPARK-12347 and may also be helpful for SPARK-15571.

What changes were proposed in this pull request?

This PR adds a python script to drive all the examples located in the examples subdirectory. It should be able to streamline the testing of the examples in R, Python, Scala and Java to see if any of them has incompatible behavior with the codebase.

How was this patch tested?

This PR is not yet fully ready for merging. I would like to have some reviews for how it works best for those who will indeed be using this script. For now, it introduces the following features:

  • Configuring according to environment variables.
  • Run those examples requiring arguments to be passed in.
  • select examples of a specific language to run (R, Python, Scala or Java)
  • running in parallem (like that in run-test.py)

Note that the last TODO is really important, for that I would like to hear suggestions from the reviewers for how it should should be best implemented. For now, I think one good way will be to have comments as directives in the example code. Like how # $example on$ are introduced to facilitate doc generation. We can do something similar to hint what arguments should be passed in for testing. Otherwise, we can always fall back to the way we discussed in the JIRA SPARK-12347

Also, some of the functionality replicates that in run-tests.py. Perhaps we can find a way to integrate both?

@holdenk
Copy link
Contributor

holdenk commented Sep 29, 2016

So at first glance this seems like it might be a work in progress PR - if that is the case you can tag it with [WIP] in the title. I think if we had a blacklist of examples not to run this could be a good start, and then we would want to add it to jenkins script. I'll try and follow up next week with a more thorough examination :)

@ethanluoyc ethanluoyc changed the title SPARK-12347 [ML] Add a script to test Spark ML examples. SPARK-12347 [ML][WIP] Add a script to test Spark ML examples. Oct 11, 2016
@jkbradley
Copy link
Member

Can you please change the title to have: "SPARK-12347" -> "[SPARK-12347]"?

@jkbradley
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 15, 2016

Test build #68646 has finished for PR 15279 at commit c566a5b.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 29, 2016

Hi, @ethanluoyc .
Could you add Apache License Header to your file, too?

@felixcheung
Copy link
Member

Hi - how are we on this PR?

@ethanluoyc
Copy link
Author

@felixcheung Last time I was working on this PR, the main challenge is to work out a nice way to record the examples that require input, and pass those arguments properly. Do you have any suggestion on that?

@felixcheung
Copy link
Member

perhaps a test driver .yml file?
or perhaps it is better to change example to have default input values so they run standalone if no input is passed?

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70088 has finished for PR 15279 at commit 7b9fe15.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ethanluoyc
Copy link
Author

@felixcheung I uploaded the yaml file for configuring the arguments passed into examples.
Basically, what we need to do now is to fill in the yaml with all the examples. And look for a way to exclude examples as well.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71768 has finished for PR 15279 at commit 7d5ea6d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@ethanluoyc this seems reasonable although I'm not 100% how this go yet. wordcount.py in the yml file - does it mean 1 yml for all examples? I think perhaps we need a way for the example to be self-contained, say on disk so someone can take a subdirectory and run by itself, with default values for parameters?

@ethanluoyc
Copy link
Author

@felixcheung I actually agree with you on examples being self contained. In that case, the example authors can just declare the arguments in the examples.

What I think may be appropriate will be some macro kind of text in the examples themselves. (e.g. like an shebang line but for files.) Then instead of looking for the arguments in the yml we can look for it in the examples.

Another way to do it will be to let the examples be ready in passing in their own default arguments. But I would not say this is a good idea because then all examples will need to be modified.

@felixcheung
Copy link
Member

What if we have a bunch of default values when arguments are not set, and those are the values we could test with? This way the same sample code can run with and without arguments?

@ethanluoyc
Copy link
Author

@felixcheung I think we can, although it requires changes to every examples. If you are fine with that, I can start on some of the examples from now onwards:D

@felixcheung
Copy link
Member

felixcheung commented Feb 1, 2017

I think that should be reasonable, but let's start with one or two example and discuss. thanks!

@ethanluoyc
Copy link
Author

@felixcheung I changed two examples in python. These are the simple ones. However, some of the examples take really complicated arguments and I am not sure if doing it in this way would scale. What do you think?

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72787 has finished for PR 15279 at commit 276eddd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Usage: sort <file>
"""

filename = "../resources/people.txt" if len(sys.argv) != 2 else sys.args[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe better to reference relative to SPARK_HOME?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably would have one check if len(sys.argv) > 1 and then set all parameter in the if block - we can assume if any parameter is there the user needs to specify all of it?

@felixcheung
Copy link
Member

that looks reasonable. could you point me to the ones with complicated argument you were referring to?

@ethanluoyc
Copy link
Author

For example https://github.com/apache/spark/blob/master/examples/src/main/python/parquet_inputformat.py#L26. It requires that we pass in jar files, which are not so easy to detect directly.

@felixcheung
Copy link
Member

shall we use spark.driver.extraClassPath in that case?
in python these we could set as options to SparkSession

@HyukjinKwon
Copy link
Member

Hi @ethanluoyc, is it still WIP? I would rather like to propose to close this if it is inactive.

@ethanluoyc
Copy link
Author

@HyukjinKwon I am really sorry but please close this since I have not found the time to complete all the cases.

@ethanluoyc ethanluoyc closed this May 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants