diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml index 273cfccc57..084ad86d07 100644 --- a/conf/nutch-default.xml +++ b/conf/nutch-default.xml @@ -72,9 +72,18 @@ http.agent.name - HTTP 'User-Agent' request header. MUST NOT be empty - + 'User-Agent' name: a single word uniquely identifying your crawler. + + The value is used to select the group of robots.txt rules addressing your + crawler. It is also sent as part of the HTTP 'User-Agent' request header. + + This property MUST NOT be empty - please set this to a single word uniquely related to your organization. + Following RFC 9309 the 'User-Agent' name (aka. 'product token') + "MUST contain only uppercase and lowercase letters ('a-z' and + 'A-Z'), underscores ('_'), and hyphens ('-')." + NOTE: You should also check other related properties: http.robots.agents @@ -84,7 +93,6 @@ http.agent.version and set their values appropriately. - @@ -95,13 +103,13 @@ parser would look for in robots.txt. Multiple agents can be provided using comma as a delimiter. eg. mybot,foo-spider,bar-crawler - The ordering of agents does NOT matter and the robots parser would make - decision based on the agent which matches first to the robots rules. - Also, there is NO need to add a wildcard (ie. "*") to this string as the - robots parser would smartly take care of a no-match situation. + The ordering of agents does NOT matter and the robots.txt parser combines + all rules to any of the agent names. Also, there is NO need to add + a wildcard (ie. "*") to this string as the robots parser would smartly + take care of a no-match situation. If no value is specified, by default HTTP agent (ie. 'http.agent.name') - would be used for user agent matching by the robots parser. + is used for user-agent matching by the robots parser. @@ -166,9 +174,9 @@ http.agent.url - A URL to advertise in the User-Agent header. This will + A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this - should be a URL of a page explaining the purpose and behavior of this + should be a URL to a page that explains the purpose and behavior of this crawler. @@ -176,9 +184,9 @@ http.agent.email - An email address to advertise in the HTTP 'From' request - header and User-Agent header. A good practice is to mangle this - address (e.g. 'info at example dot com') to avoid spamming. + An email address to advertise in the HTTP 'User-Agent' (and + 'From') request headers. A good practice is to mangle this address + (e.g. 'info at example dot com') to avoid spamming. @@ -202,7 +210,7 @@ http.agent.rotate.file agents.txt - File containing alternative user agent names to be used instead of + File containing alternative user-agent names to be used instead of http.agent.name on a rotating basis if http.agent.rotate is true. Each line of the file should contain exactly one agent specification including name, version, description, URL, etc.