MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235
MAHOUT-1863: Several fixes to cluster-syntheticcontrol.sh to fix "Input path does not exist" error#235chu11 wants to merge 2 commits intoapache:masterfrom
Conversation
| echo "Uploading Synthetic control data to HDFS" | ||
| $DFSRM ${WORK_DIR}/testdata | ||
| $DFS -mkdir ${WORK_DIR}/testdata | ||
| $DFS -mkdir -p ${WORK_DIR}/testdata |
There was a problem hiding this comment.
the problem with $DFS -mkdir -p is that it's not backwards compatible with Hadoop 1, which Mahout still technically supports.
There was a problem hiding this comment.
Wasn't aware that was not compatible with Hadoop 1. However, it's already used in some other examples so those presumably won't work with Hadoop 1 either?
>grep DFS *.sh | grep mkdir
classify-20newsgroups.sh: $DFS -mkdir -p ${WORK_DIR}
classify-20newsgroups.sh: $DFS -mkdir ${WORK_DIR}/20news-all
classify-wikipedia.sh: $DFS -mkdir -p ${WORK_DIR}
cluster-reuters.sh: $DFS -mkdir -p $WORK_DIR
cluster-reuters.sh: $DFS -mkdir -p ${WORK_DIR}/
cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-sgm
cluster-reuters.sh: $DFS -mkdir ${WORK_DIR}/reuters-out
cluster-syntheticcontrol.sh: $DFS -mkdir -p ${WORK_DIR}/testdata
There was a problem hiding this comment.
Hmm.. I suppose not :). I guess that's something to look at. I'm not sure when those crept in. (I may have done it myself). Anyway I'm not sure that Hadoop 1 incompatibility is a dealbreaker going forward- at least not in the example scripts, This is something that we need to discuss on dev@. Thanks for pointing this out. I will bring it up.
|
The commit that you mention, as well as the An easy alternative to That should work with both Hadoop 1 and 2. |
|
Oh, it ends up MAHOUT-1794 was my patch. I guess those I think your alternate to |
|
Oh yes- I'd forgotten that we're allowing for user-defined directories now in these scripts so we don't know how long the path will be. So that simple alternative won't work without as you mentioned a loop- these scripts are already complicated enough.. We've discussed tearing them down completely and re-doing them but haven't had a chance (Would you be interested? :)) So I'll have to test this out but I'm for committing this as is. It needs to at least be working on hadoop 2. We can then look at getting all the scripts back to hadoop 1 compatible later. |
|
I certainly don't mind helping out. If anything it's an opportunity to dig into Mahout more :-) I think some fixes/cleanup to the java files would be good too. Originally I hacked up Kmeans.java to have code to handle a whole bunch of default situations, i.e. things like this. But eventually gave up when I realized: A) The example required either all of the arguments or none. B) When handling defaults, a collection of options were required to be passed in by the user. For example C) I didn't want to go down the rat hole of figuring out what to change in DefaultOptionCreator.java and what would be ok w/ the rest of the examples. And there's cleanup stuff too (e.g. not hard coding defaults like I did above, making some constant somewhere for that). So is the Mahout team interested in a wide range of "cleanup" kind of patches? |
|
We are actually not doing anything in the way of MapReduce anymore. Since Mahout 0.9: MAHOUT-1510, we're doing all of the our new work in what we've called the Mahout "Samsara" environment: http://mahout.apache.org/users/sparkbindings/home.html aka "Mahout on Spark", "Mahout on Flink", "Mahout on H2O", etc.. Many of the committers responsible for maintaining the MapReduce code are not currently active, making it hard to review patches. That being said, There may be some interest in clean up of certain MapReduce algorithms maintained by the maintainers that are still around if they you have some obvious bug fixes that you've found. Maybe you could shoot an email to dev@ if you have some in mind? If you'd like to get more involved in Mahout, you'll find yourself very welcomed! Your time would probably be better spent working on the new framework. We have a good amount of JIRAs started for the next (0.13.0) release and will be adding more, as we just finished up the milestone 0.12.x releases. Thank you again for the patch! |
No description provided.