diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 273cfccc57..084ad86d07 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -72,9 +72,18 @@
http.agent.name
- HTTP 'User-Agent' request header. MUST NOT be empty -
+ 'User-Agent' name: a single word uniquely identifying your crawler.
+
+ The value is used to select the group of robots.txt rules addressing your
+ crawler. It is also sent as part of the HTTP 'User-Agent' request header.
+
+ This property MUST NOT be empty -
please set this to a single word uniquely related to your organization.
+ Following RFC 9309 the 'User-Agent' name (aka. 'product token')
+ "MUST contain only uppercase and lowercase letters ('a-z' and
+ 'A-Z'), underscores ('_'), and hyphens ('-')."
+
NOTE: You should also check other related properties:
http.robots.agents
@@ -84,7 +93,6 @@
http.agent.version
and set their values appropriately.
-
@@ -95,13 +103,13 @@
parser would look for in robots.txt. Multiple agents can be provided using
comma as a delimiter. eg. mybot,foo-spider,bar-crawler
- The ordering of agents does NOT matter and the robots parser would make
- decision based on the agent which matches first to the robots rules.
- Also, there is NO need to add a wildcard (ie. "*") to this string as the
- robots parser would smartly take care of a no-match situation.
+ The ordering of agents does NOT matter and the robots.txt parser combines
+ all rules to any of the agent names. Also, there is NO need to add
+ a wildcard (ie. "*") to this string as the robots parser would smartly
+ take care of a no-match situation.
If no value is specified, by default HTTP agent (ie. 'http.agent.name')
- would be used for user agent matching by the robots parser.
+ is used for user-agent matching by the robots parser.
@@ -166,9 +174,9 @@
http.agent.url
- A URL to advertise in the User-Agent header. This will
+ A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
- should be a URL of a page explaining the purpose and behavior of this
+ should be a URL to a page that explains the purpose and behavior of this
crawler.
@@ -176,9 +184,9 @@
http.agent.email
- An email address to advertise in the HTTP 'From' request
- header and User-Agent header. A good practice is to mangle this
- address (e.g. 'info at example dot com') to avoid spamming.
+ An email address to advertise in the HTTP 'User-Agent' (and
+ 'From') request headers. A good practice is to mangle this address
+ (e.g. 'info at example dot com') to avoid spamming.
@@ -202,7 +210,7 @@
http.agent.rotate.file
agents.txt
- File containing alternative user agent names to be used instead of
+ File containing alternative user-agent names to be used instead of
http.agent.name on a rotating basis if http.agent.rotate is true.
Each line of the file should contain exactly one agent
specification including name, version, description, URL, etc.