Permalink
Browse files

NUTCH-706 Url regex normalizer: pattern for session id removal not to…

… match "newsId"

git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1396796 13f79535-47bb-0310-9956-ffa450edef68
  • Loading branch information...
1 parent 582919d commit ddca587784eba9a52fb2537a10660d995f8832d2 @sebastian-nagel sebastian-nagel committed Oct 10, 2012
View
@@ -2,6 +2,8 @@ Nutch Change Log
(trunk) Current Development:
+* NUTCH-706 Url regex normalizer: pattern for session id removal not to match "newsId" (Meghna Kukreja via snagel)
+
* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel)
* NUTCH-1441 AnchorIndexingFilter should use plain HashSet (ferdy via lewismc)
@@ -29,7 +29,7 @@
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
- <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
+ <pattern>([;_]?\b((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
<substitution>$4</substitution>
</regex>
@@ -11,6 +11,8 @@ http://www.foo.com/foo.html;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED http://w
http://www.foo.com/foo.html?param=1&another=2;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED http://www.foo.com/foo.html?param=1&another=2
http://www.foo.com/foo.html;jsessionid=1E6FEC0D14D044541DD84D2D013D29ED?param=1&another=2 http://www.foo.com/foo.html?param=1&another=2
http://www.foo.com/foo.php?&x=1&sid=xyz&something=1 http://www.foo.com/foo.php?x=1&something=1
+# but NewsId is not a session id (NUTCH-706, NUTCH-1328)
+http://www.foo.com/fa/newsdetail.aspx?NewsID=1567539 http://www.foo.com/fa/newsdetail.aspx?NewsID=1567539
# test removal default pages
http://www.foo.com/home/index.html http://www.foo.com/home/
@@ -13,7 +13,7 @@
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
- <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
+ <pattern>([;_]?\b((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
<substitution>$4</substitution>
</regex>

0 comments on commit ddca587

Please sign in to comment.