Skip to content

MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235

Closed
chu11 wants to merge 2 commits intoapache:masterfrom
chu11:MAHOUT-1863
Closed

MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235
chu11 wants to merge 2 commits intoapache:masterfrom
chu11:MAHOUT-1863

Conversation

@chu11
Copy link
Copy Markdown
Contributor

@chu11 chu11 commented May 26, 2016

No description provided.

echo "Uploading Synthetic control data to HDFS"
$DFSRM ${WORK_DIR}/testdata
$DFS -mkdir ${WORK_DIR}/testdata
$DFS -mkdir -p ${WORK_DIR}/testdata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with $DFS -mkdir -p is that it's not backwards compatible with Hadoop 1, which Mahout still technically supports.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't aware that was not compatible with Hadoop 1. However, it's already used in some other examples so those presumably won't work with Hadoop 1 either?

>grep DFS *.sh | grep mkdir
classify-20newsgroups.sh:    $DFS -mkdir -p ${WORK_DIR}
classify-20newsgroups.sh:    $DFS -mkdir ${WORK_DIR}/20news-all
classify-wikipedia.sh:    $DFS -mkdir -p ${WORK_DIR}
cluster-reuters.sh:  $DFS -mkdir -p $WORK_DIR
cluster-reuters.sh:        $DFS -mkdir -p ${WORK_DIR}/
cluster-reuters.sh:        $DFS -mkdir ${WORK_DIR}/reuters-sgm
cluster-reuters.sh:        $DFS -mkdir ${WORK_DIR}/reuters-out
cluster-syntheticcontrol.sh:    $DFS -mkdir -p ${WORK_DIR}/testdata

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I suppose not :). I guess that's something to look at. I'm not sure when those crept in. (I may have done it myself). Anyway I'm not sure that Hadoop 1 incompatibility is a dealbreaker going forward- at least not in the example scripts, This is something that we need to discuss on dev@. Thanks for pointing this out. I will bring it up.

@andrewpalumbo
Copy link
Copy Markdown
Member

The commit that you mention, as well as the -mkdir -p introductions looks like it was part of MAHOUT-1794. I've started a dissussion on dev@ about Hadoop 1, So we'll probably see how that ends up before committing a fix for this. Like I said not a deal breaker IMO if the examples are not backwards compatible but we should make an effort.

An easy alternative to -mkdir -p ${WORK_DIR}/testdata should just be:

$DFS -mkdir /tmp/${WORK_DIR}
$DFS -mkdir /tmp/${WORK_DIR}/testdata

That should work with both Hadoop 1 and 2.

@chu11
Copy link
Copy Markdown
Contributor Author

chu11 commented May 26, 2016

Oh, it ends up MAHOUT-1794 was my patch. I guess those -mkdir -p's were mine :-) The cluster_syntheticcontrol.sh I patched against only created the relative path 'testdata', which is why I never added it in there.

I think your alternate to -mkdir -p would only work if WORK_DIR were a directory only one deep long? If it were longer, like /tmp/foo/bar/ then the -p is still needed. Obviously could loop through any path, but that seems like a rat hole no one should go down into.

@andrewpalumbo
Copy link
Copy Markdown
Member

Oh yes- I'd forgotten that we're allowing for user-defined directories now in these scripts so we don't know how long the path will be. So that simple alternative won't work without as you mentioned a loop- these scripts are already complicated enough.. We've discussed tearing them down completely and re-doing them but haven't had a chance (Would you be interested? :))

So I'll have to test this out but I'm for committing this as is. It needs to at least be working on hadoop 2.

We can then look at getting all the scripts back to hadoop 1 compatible later.

@chu11
Copy link
Copy Markdown
Contributor Author

chu11 commented May 26, 2016

I certainly don't mind helping out. If anything it's an opportunity to dig into Mahout more :-)

I think some fixes/cleanup to the java files would be good too. Originally I hacked up Kmeans.java to have code to handle a whole bunch of default situations, i.e. things like this.

-    int maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+    int maxIterations;
+    if (hasOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION)) {
+      maxIterations = Integer.parseInt(getOption(DefaultOptionCreator.MAX_ITERATIONS_OPTION));
+    }
+    else {
+      maxIterations = 10;
+    }

But eventually gave up when I realized:

A) The example required either all of the arguments or none.

B) When handling defaults, a collection of options were required to be passed in by the user. For example t1 & t2 are required to be passed in even though in Kmeans you can just specify number of clusters.

C) I didn't want to go down the rat hole of figuring out what to change in DefaultOptionCreator.java and what would be ok w/ the rest of the examples.

And there's cleanup stuff too (e.g. not hard coding defaults like I did above, making some constant somewhere for that).

So is the Mahout team interested in a wide range of "cleanup" kind of patches?

@andrewpalumbo
Copy link
Copy Markdown
Member

We are actually not doing anything in the way of MapReduce anymore. Since Mahout 0.9: MAHOUT-1510, we're doing all of the our new work in what we've called the Mahout "Samsara" environment: http://mahout.apache.org/users/sparkbindings/home.html

aka "Mahout on Spark", "Mahout on Flink", "Mahout on H2O", etc..

Many of the committers responsible for maintaining the MapReduce code are not currently active, making it hard to review patches.

That being said, There may be some interest in clean up of certain MapReduce algorithms maintained by the maintainers that are still around if they you have some obvious bug fixes that you've found. Maybe you could shoot an email to dev@ if you have some in mind?

If you'd like to get more involved in Mahout, you'll find yourself very welcomed! Your time would probably be better spent working on the new framework. We have a good amount of JIRAs started for the next (0.13.0) release and will be adding more, as we just finished up the milestone 0.12.x releases.

Thank you again for the patch!

@asfgit asfgit closed this in 1f3566d May 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants