Permalink
Browse files

NUTCH-1043 Add pattern for filtering .js in default url filters

git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1147798 13f79535-47bb-0310-9956-ffa450edef68
  • Loading branch information...
1 parent de02071 commit 6ea0ba8279ded6140d1f02f7aa61a08db6654f46 @jnioche jnioche committed Jul 18, 2011
Showing with 6 additions and 2 deletions.
  1. +2 −0 CHANGES.txt
  2. +2 −1 conf/automaton-urlfilter.txt.template
  3. +2 −1 conf/regex-urlfilter.txt.template
View
@@ -2,6 +2,8 @@ Nutch Change Log
Release 2.0 - Current Development
+* NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
+
* NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
* NUTCH-1011 Normalize duplicate slashes in URL's (markus)
@@ -25,7 +25,8 @@
-(file|ftp|mailto):.*
# skip image and other suffixes we can't yet parse
--.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)
+# for a more extensive coverage use the urlfilter-suffix plugin
+-.*\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)
# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*
@@ -26,7 +26,8 @@
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
--\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+# for a more extensive coverage use the urlfilter-suffix plugin
+-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

0 comments on commit 6ea0ba8

Please sign in to comment.