MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error by chu11 · Pull Request #235 · apache/mahout

chu11 · 2016-05-26T00:55:19Z

No description provided.

…rks by default

andrewpalumbo · 2016-05-26T01:01:42Z

examples/bin/cluster-syntheticcontrol.sh

    echo "Uploading Synthetic control data to HDFS"
    $DFSRM ${WORK_DIR}/testdata
-    $DFS -mkdir ${WORK_DIR}/testdata
+    $DFS -mkdir -p ${WORK_DIR}/testdata


the problem with $DFS -mkdir -p is that it's not backwards compatible with Hadoop 1, which Mahout still technically supports.

Wasn't aware that was not compatible with Hadoop 1. However, it's already used in some other examples so those presumably won't work with Hadoop 1 either?

>grep DFS *.sh | grep mkdir classify-20newsgroups.sh: $DFS -mkdir -p ${WORK_DIR} classify-20newsgroups.sh: $DFS -mkdir ${WORK_DIR}/20news-all classify-wikipedia.sh: $DFS -mkdir -p ${WORK_DIR} cluster-reuters.sh: $DFS -mkdir -p $WORK_DIR cluster-reuters.sh: $DFS -mkdir -p ${WORK_DIR}/ cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-sgm cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-out cluster-syntheticcontrol.sh: $DFS -mkdir -p ${WORK_DIR}/testdata

Hmm.. I suppose not :). I guess that's something to look at. I'm not sure when those crept in. (I may have done it myself). Anyway I'm not sure that Hadoop 1 incompatibility is a dealbreaker going forward- at least not in the example scripts, This is something that we need to discuss on dev@. Thanks for pointing this out. I will bring it up.

andrewpalumbo · 2016-05-26T02:01:29Z

The commit that you mention, as well as the -mkdir -p introductions looks like it was part of MAHOUT-1794. I've started a dissussion on dev@ about Hadoop 1, So we'll probably see how that ends up before committing a fix for this. Like I said not a deal breaker IMO if the examples are not backwards compatible but we should make an effort.

An easy alternative to -mkdir -p ${WORK_DIR}/testdata should just be:

$DFS -mkdir /tmp/${WORK_DIR}
$DFS -mkdir /tmp/${WORK_DIR}/testdata

That should work with both Hadoop 1 and 2.

chu11 · 2016-05-26T02:51:32Z

Oh, it ends up MAHOUT-1794 was my patch. I guess those -mkdir -p's were mine :-) The cluster_syntheticcontrol.sh I patched against only created the relative path 'testdata', which is why I never added it in there.

I think your alternate to -mkdir -p would only work if WORK_DIR were a directory only one deep long? If it were longer, like /tmp/foo/bar/ then the -p is still needed. Obviously could loop through any path, but that seems like a rat hole no one should go down into.

andrewpalumbo · 2016-05-26T16:27:47Z

Oh yes- I'd forgotten that we're allowing for user-defined directories now in these scripts so we don't know how long the path will be. So that simple alternative won't work without as you mentioned a loop- these scripts are already complicated enough.. We've discussed tearing them down completely and re-doing them but haven't had a chance (Would you be interested? :))

So I'll have to test this out but I'm for committing this as is. It needs to at least be working on hadoop 2.

We can then look at getting all the scripts back to hadoop 1 compatible later.

chu11 · 2016-05-26T17:18:13Z

I certainly don't mind helping out. If anything it's an opportunity to dig into Mahout more :-)

I think some fixes/cleanup to the java files would be good too. Originally I hacked up Kmeans.java to have code to handle a whole bunch of default situations, i.e. things like this.

-    int maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+    int maxIterations;
+    if (hasOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION)) {
+      maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+    }
+    else {
+      maxIterations = 10;
+    }

But eventually gave up when I realized:

A) The example required either all of the arguments or none.

B) When handling defaults, a collection of options were required to be passed in by the user. For example t1 & t2 are required to be passed in even though in Kmeans you can just specify number of clusters.

C) I didn't want to go down the rat hole of figuring out what to change in DefaultOptionCreator.java and what would be ok w/ the rest of the examples.

And there's cleanup stuff too (e.g. not hard coding defaults like I did above, making some constant somewhere for that).

So is the Mahout team interested in a wide range of "cleanup" kind of patches?

andrewpalumbo · 2016-05-26T19:09:55Z

We are actually not doing anything in the way of MapReduce anymore. Since Mahout 0.9: MAHOUT-1510, we're doing all of the our new work in what we've called the Mahout "Samsara" environment: http://mahout.apache.org/users/sparkbindings/home.html

aka "Mahout on Spark", "Mahout on Flink", "Mahout on H2O", etc..

Many of the committers responsible for maintaining the MapReduce code are not currently active, making it hard to review patches.

That being said, There may be some interest in clean up of certain MapReduce algorithms maintained by the maintainers that are still around if they you have some obvious bug fixes that you've found. Maybe you could shoot an email to dev@ if you have some in mind?

If you'd like to get more involved in Mahout, you'll find yourself very welcomed! Your time would probably be better spent working on the new framework. We have a good amount of JIRAs started for the next (0.13.0) release and will be adding more, as we just finished up the milestone 0.12.x releases.

Thank you again for the patch!

chu11 added 2 commits May 25, 2016 16:48

Add -p option to hdfs mkdir in cluster_syntheticcontrol.sh

5b4e49b

Pass appropriate options job to ensure cluster_syntheticcontrol.sh wo…

e8d9029

…rks by default

andrewpalumbo reviewed May 26, 2016
View reviewed changes

asfgit closed this in 1f3566d May 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235

MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235
chu11 wants to merge 2 commits intoapache:masterfrom
chu11:MAHOUT-1863

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo May 26, 2016

Uh oh!

chu11 May 26, 2016

Uh oh!

andrewpalumbo May 26, 2016

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo May 26, 2016

Choose a reason for hiding this comment

Uh oh!

chu11 May 26, 2016

Choose a reason for hiding this comment

Uh oh!

andrewpalumbo May 26, 2016

Choose a reason for hiding this comment

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

chu11 commented May 26, 2016

Uh oh!

andrewpalumbo commented May 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants