Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2184 Enable IndexingJob to function with no crawldb #95

Closed
wants to merge 1 commit into from

Conversation

lewismc
Copy link
Member

@lewismc lewismc commented Mar 2, 2016

OK folks, this issue addresses https://issues.apache.org/jira/browse/NUTCH-2184 by

  • rebasing the NUTCH-2184v2.patch against master branch
  • making the IndexerMapReduceMapper and IndexerMapReduceReducer in IndexerMapReduce code explicit so that these functions can be tested
  • adding in some mrunit tests for testing the IndexerMapReduceMapper and IndexerMapReduceReducer
  • removing some trivial imports which are unsed
  • formatting ivy.xml which has somehow (again) become a dogs dinner
  • adding default constructor to NutchIndexAction()

Any questions, then please let me know. I would really appreciate if people could pull this code and try it out within your test or local environment.
Thanks, also thanks Markus for the original suggestions for tests, etc.

* Implementation of {@link org.apache.hadoop.mapred.Reducer}
* which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s
* from combinations of various Nutch data structures. Essentially
* teh result is a key representing a URL and a value representing a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo teh -> the

@chrismattmann
Copy link
Contributor

@lewismc what's the status of this patch? are we close to merging?

@chrismattmann
Copy link
Contributor

ping @lewismc any status on this?

@lewismc
Copy link
Member Author

lewismc commented Mar 27, 2016

I think I will fire a PR up today.

On Sun, Mar 27, 2016 at 11:56 AM, Chris Mattmann notifications@github.com
wrote:

ping @lewismc https://github.com/lewismc any status on this?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#95 (comment)

Lewis

@chrismattmann
Copy link
Contributor

@lewismc happy to review anything here later tonight if you have it. Cheers aye

@chrismattmann
Copy link
Contributor

@lewismc ping if you're ready on this please let me know happy to get this sorted.

@naegelejd
Copy link
Contributor

Also looking forward to this. Anything I can do to help get it in 1.12?

naegelejd added a commit to naegelejd/nutch that referenced this pull request May 25, 2016
this patch is the one from Jan 2016, but has been updated at
apache#95, so once 1.12 is released
check if the PR made it in.
Option noCommitOpt = OptionBuilder
.withArgName("noCommit")
.withDescription(
"do the commits once and for all the reducers in one go (optional)")
Copy link
Contributor

@naegelejd naegelejd May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for -noCommit is backward: this option tells the Indexer not to do a final commit after the job finishes.

@lewismc
Copy link
Member Author

lewismc commented May 27, 2016

Hi @naegelejd @chrismattmann the PR has conflicts.
I was working on a patch but got up ended and worked on other things. I am now working on a full codebase Hadoop API upgrade... which is taking time in between the pub.

  1. I acknowledge my neglect for this issue
  2. .Our tests are pretty lacking on this issue, that was the problem. I absolutely agree with Markus Jelsma we need to have tests here as this is a patch which affects a critical workflow.

@naegelejd are you able to rebase and provide a test? If not then I can go back to my branch.

@naegelejd
Copy link
Contributor

I can't promise anything. I'm not familiar with mrunit yet but I may find time soon to continue work on this in addition to a handful of other issues I'm hoping to fix or have merged soon.

sebastian-nagel added a commit to sebastian-nagel/nutch that referenced this pull request Nov 22, 2019
- make the CrawlDb argument passed to indexing job optional
- improve command-line help
- pick various improvements from PR apache#95
@sebastian-nagel
Copy link
Contributor

Closed in favor of #486

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants