Skip to content

tolerance for unknown/ bad charsets #320

@cybergaukler

Description

@cybergaukler

this is a fringe case since the crawler works wonderfully 99.9% of the time

as I was checking some results that did not work I came across the (german) site I initially got no content from:
http://www.lorei-baustoffe.de/

The problem was that it returned the following in the header:
Content-Type: text/html; charset=none

This in term threw an error since iconv does not know "none" as an encoding

I did a rough patch in my code to change "none" into "utf-8" and got the site.

I am not sure if this is would be a desired feature for the crawler as well.

  • pro: you would not need to re-crawl on charset errors
  • con: you would not be able to see that such an error exists

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions