Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace fromstring(html) by fromstring(html, parser=newparser) #1575

Open
dalf opened this issue Apr 22, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@dalf
Copy link
Collaborator

commented Apr 22, 2019

It is a possible speed improvement : lxml.html.fromstring without a specific parser doesn't benefit from multithread. See :
https://stackoverflow.com/questions/32285453/why-does-multithreading-do-not-speed-up-parsing-html-with-lxml

I confirm the benchmark, but the result may be different with searx since fromhtml function is called at different time.

So in duckduckgo.py for example :

def response(resp):
    results = []
    doc = fromstring(resp.text)

should be replace by

def response(resp):
    results = []
    parser = HTMLParser()
    doc = fromstring(resp.text, parser=parser)

Most probably it would convenient to add utility function doing that in searx.utils module :

from lxml.html import fromstring, HTMLParser

def htmlfromstring(str, **kwargs):
    parser = HTMLParser()
    return fromstring(resp.text, parser=parser, **kwargs)

Note : selectolax is faster than lxml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.