Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace fromstring(html) by fromstring(html, parser=newparser) #1575

dalf opened this issue Apr 22, 2019 · 0 comments


None yet
1 participant
Copy link

commented Apr 22, 2019

It is a possible speed improvement : lxml.html.fromstring without a specific parser doesn't benefit from multithread. See :

I confirm the benchmark, but the result may be different with searx since fromhtml function is called at different time.

So in for example :

def response(resp):
    results = []
    doc = fromstring(resp.text)

should be replace by

def response(resp):
    results = []
    parser = HTMLParser()
    doc = fromstring(resp.text, parser=parser)

Most probably it would convenient to add utility function doing that in searx.utils module :

from lxml.html import fromstring, HTMLParser

def htmlfromstring(str, **kwargs):
    parser = HTMLParser()
    return fromstring(resp.text, parser=parser, **kwargs)

Note : selectolax is faster than lxml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.