Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to initialize the Nutch object #4

Closed
antrikss opened this issue Oct 9, 2015 · 15 comments
Closed

Unable to initialize the Nutch object #4

antrikss opened this issue Oct 9, 2015 · 15 comments

Comments

@antrikss
Copy link

antrikss commented Oct 9, 2015

I used the following command to initialize the Nutch object.

nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')

But it gave me the following error

nutch.py: GET Endpoint: /config/crawlTest
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 10:26:13 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {}
Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>
    nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 609, in __init__
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 302, in __getitem__
KeyError

Ideally, the above should have worked, because it should have used the default configuration, and should have been able to find it. But unfortunately, it doesn't and throws the KeyError.

I even tried explicitly giving the default config (although it doesn't matter because its the default param) but in vain.

nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

The above gave me the following error.

Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>

    nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

TypeError: __init__() got multiple values for keyword argument 'confId'
@chrismattmann
Copy link
Owner

@ahmadia any ideas?

@ahmadia
Copy link
Contributor

ahmadia commented Oct 9, 2015

@antrikss - Thanks for the report!

What version of Nutch are you running against? What's the output of the server? What happens when you run:

nt = Nutch()

Which tests from py.test pass/fail?

@antrikss
Copy link
Author

antrikss commented Oct 9, 2015

Hi @ahmadia , thanks for the prompt reply!
I'm using Nutch version 1.11-SNAPSHOT.
I'm sorry I do not understand what do you mean by output of the server.
And when I run
nt = Nutch()
I get the following response

nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}
nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}

And I could work with the above nutch initialization, but my main aim was to give a different crawlId for my crawl.
And the py.test passes all conditions.

================================================================================ test session starts ================================================================================
platform darwin -- Python 2.7.10, pytest-2.8.2, py-1.4.30, pluggy-0.3.1
rootdir: /Users/Antrromet/Documents/USC/Fall2015/IR/nutch-python, inifile: 
collected 15 items 

test_nutch.py ...............

============================================================================ 15 passed in 21.84 seconds =============================================================================

@ahmadia
Copy link
Contributor

ahmadia commented Oct 9, 2015

Okay, cool. I haven't tried to do what you're doing before, so I'll need to take a look.

@antrromet
Copy link

Just a note,
nt = Nutch()
calls the GET config/default twice. As you can see from the logs above.
nutch.py: GET Endpoint: /config/default
Am not sure if that is intended, but just wanted you to know that its not a copy paste error.

@ahmadia
Copy link
Contributor

ahmadia commented Oct 9, 2015

@antrromet - our documentation is out of date and needs to be updated. Refer to the test_nutch.py file for a complete tour of functionality. For now:

In [6]: Nutch?
Init signature: Nutch(self, confId='default', serverEndpoint='http://localhost:8081', raiseErrors=True, **args)
Docstring:      <no docstring>
Init docstring:
Nutch client for interacting with a Nutch instance over its REST API.

Constructor:

nt = Nutch()

Optional arguments:

confID - The name of the default configuration file to use, by default: nutch.DefaultConfig
serverEndpoint - The location of the Nutch server, by default: nutch.DefaultServerEndpoint
raiseErrors - raise exceptions if server response is not 200

Provides functions:
    server - getServerStatus, stopServer
    config - get and set parameters for this configuration
    job - get list of running jobs, get job metadata, stop/abort a job by id, and create a new job

To start a crawl job, use:
    Crawl() - or use the methods inject, generate, fetch, parse, updatedb in that order.

To run a crawl in one method, use:
-- nt = Nutch()
-- response, status = nt.crawl()

To override a confId, you'd need to create a configuration first. To use default:

nt = Nutch('default')

To use a custom configuration:

nt = Nutch() # a little wonky, we assume a configuration for interacting with Nutch (default here)
nt.Configs().create('custom_conf', {override_param: here})
nt = Nutch('custom_conf')

@antrromet
Copy link

Got it. Yes, this will work. But can you tell me a way to change the crawlId?

So, I tried the Rest APIs given here, for creating a job, and you can specify a parameter like
"crawlId":"crawl01"
The above APIs seems to work perfectly fine when I tried on a Rest client.
But can this be done using nutch-python?

@ahmadia
Copy link
Contributor

ahmadia commented Oct 9, 2015

@antrromet - Our documentation is seriously lacking here, but you can create a custom JobClient with:

nt = Nutch()
jc = nt.Jobs('your_crawl_id')

You can then use the job client to submit jobs and with the CrawlClient as well. There are a few examples of using the job client in test.python.

@antrromet
Copy link

@ahmadia Got it Aron, thanks a lot! Really appreciate your help.

@ahmadia
Copy link
Contributor

ahmadia commented Oct 9, 2015

No problem. We need to fix our documentation :(

@chrismattmann
Copy link
Owner

@antrikss can you update our docs for this? Would you be willing to submit a PR?

@chrismattmann
Copy link
Owner

@antrromet

@antrromet
Copy link

@chrismattmann Sure thing. I'll look into it.

@chrismattmann
Copy link
Owner

@ayberk if you have time, would appreciate a PR

@chrismattmann
Copy link
Owner

See the wiki I think this takes care of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants