Skip to content

crawler-commons-1.4

Latest
Compare
Choose a tag to compare
@sebastian-nagel sebastian-nagel released this 18 Jul 11:56
· 21 commits to master since this release

Important Changes

  • Java 11 is now required to run or build crawler-commons
  • the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309

Full List of Changes

  • [Robots.txt] Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (sebastian-nagel, Richard Zowalla) #245, #360
  • [Robots.txt] Close groups of rules as defined in RFC 9309 (kkrugler, garyillyes, jnioche, sebastian-nagel) #114, #390, #430
  • [Robots.txt] Empty disallow statement not to clear other rules (sebastian-nagel, jnioche) #422, #424
  • [Robots.txt] SimpleRobotRulesParser main() to follow five redirects (sebastian-nagel, jnioche) #428
  • [Robots.txt] Add more spelling variants and typos of robots.txt directives (sebastian-nagel, jnioche) #425
  • [Robots.txt] Document effect of rules merging in combination with multiple agent names (sebastian-nagel, Richard Zowalla) #423, #426
  • [Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) (sebastian-nagel, Richard Zowalla) #427
  • [Robots.txt] Rename default user-agent / robot name in unit tests (sebastian-nagel, Richard Zowalla) #429
  • [Robots.txt] Add units test based on examples in RFC 9309 (sebastian-nagel, Richard Zowalla) #420
  • [BasicNormalizer] Query parameters normalization in BasicURLNormalizer (aecio, sebastian-nagel, Richard Zowalla) #308, #421
  • [Robots.txt] Deduplicate robots rules before matching (sebastian-nagel, jnioche) #416
  • [Robots.txt] SimpleRobotRulesParser main to use the new API method (sebastian-nagel, jnioche) #413
  • Generate JaCoCo reports when testing (jnioche) #409, #412
  • Push Code Coverage to Coveralls (Richard Zowalla, jnioche) #414
  • [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters (tkalistratov, sebastian-nagel, Richard Zowalla) #195, #408
  • [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters (sebastian-nagel, Richard Zowalla, aecio) #389, #401
  • [Robots.txt] Improve readability of robots.txt unit tests (sebastian-nagel, Richard Zowalla) #383
  • Upgrade project to use Java 11 (Avi Hayun, Richard Zowalla, aecio, sebastian-nagel) #320, #376
  • [Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (sebastian-nagel, Richard Zowalla) #362
  • [Robots.txt] Matching user-agent names does not conform to robots.txt RFC (YossiTamari, sebastian-nagel) #192
  • [Robots.txt] Improve robots check draft rfc compliance (Eduardo Jimenez) #351
  • Upgrade dependencies (dependabot) #379, #384, #394, #399, #404, #419
  • Upgrade Maven plugins (dependabot) #377, #381, #386, #396, #397, #398, #400, #402, #403, #405, #406, #407, #415, #418
  • Javadoc: ensure Javascript search is working (sebastian-nagel, Richard Zowalla, aecio) #378, #380