Skip to content
This repository has been archived by the owner on Nov 14, 2019. It is now read-only.

Nothing is crawled while using ExcludeFilter #70

Open
Choumy opened this issue Oct 13, 2014 · 2 comments
Open

Nothing is crawled while using ExcludeFilter #70

Choumy opened this issue Oct 13, 2014 · 2 comments
Labels

Comments

@Choumy
Copy link

Choumy commented Oct 13, 2014

my
"excludeFilter" : ["http://www.mysite.com/keyword1/keyword2/my_web-page-number/545577.html","http://www.mysite.com/keyword1/keyword2/my_web-page-number/545577.html"]

but nothing happens, is that normal ?

@marevol
Copy link
Contributor

marevol commented Oct 13, 2014

Hmm.., it's not reproduced.
I think you missed a mapping.
In the following example, team-list.html page is excluded.

curl -XPUT "localhost:9200/web?pretty" 
curl -XPUT "localhost:9200/web/sample/_mapping?pretty" -d '
{
  "sample" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}'
curl -XPUT 'localhost:9200/_river/sample/_meta?pretty' -d '{
    "type" : "web",
    "crawl" : {
        "index" : "web",
        "url" : ["http://fess.codelibs.org/"],
        "includeFilter" : ["http://fess.codelibs.org/.*"],
        "excludeFilter" : ["http://fess.codelibs.org/team-list.html"],
        "maxDepth" : 1,
        "maxAccessCount" : 100,
        "numOfThread" : 3,
        "interval" : 0,
        "target" : [
          {
            "pattern" : {
              "url" : "http://fess.codelibs.org/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "title"
              },
              "body" : {
                "text" : "body",
                "trimSpaces" : true
              }
            }
          }
        ]
    }
}'

@Choumy
Copy link
Author

Choumy commented Oct 16, 2014

I don't thing it's about the mapping, but about the url strings, is there any way it could be because of lenght of the filter, i have about 90 URL to exclude from filter with too much special characters (-_, spaces)

i tested with simple url pattern on the same site with same mapping it worked just fine.

could it be URL encoding ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants