Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine code: describe which XPath can fail, which must not. #1802

Open
dalf opened this issue Jan 10, 2020 · 1 comment
Open

Engine code: describe which XPath can fail, which must not. #1802

dalf opened this issue Jan 10, 2020 · 1 comment

Comments

@dalf
Copy link
Collaborator

@dalf dalf commented Jan 10, 2020

In the result parsing, engines parse the HTML using lxml (most of the time). If an XPath request doesn't return at least one result, it may be fine or trigger an error later. In the later case, it is difficult to know exactly what is going on without looking at the downloaded HTML.

This issue suggests:

  • to add new exception classes.
  • to add two optional parameters to eval_xpath function to check the result count.

It may help to know when an engine starts to be broken if the engine codes says which XPath request should not fail (?).

I'm not sure if it is useful and /or a privacy problem if searx makes statistics about broken XPath ?

Class hierarchy

SearxException
	SearxParameterException
	SearxEngineException
		SearxEngineCaptchaException (instead of RuntimeWarning in google.py)
		SearxEngineXPathException

eval_xpath

def eval_xpath(element, xpath_str, eq=None, gte=None):
    xpath = get_xpath(xpath_str)
    result = xpath(element)
    # new code: check result count now
    if eq is not None and len(result) != eq:
	raise SearxEngineXPathException(xpath, eq=eq)
    if gte is not None and len(result) < gte:
	raise SearxEngineXPathException(xpath, gte=gte)
    return result

usage examples

extract_url

https://github.com/asciimoo/searx/blob/master/searx/engines/xpath.py#L53

def extract_url(xpath_results, search_url):
    if xpath_results == []:
        raise Exception('Empty url resultset')	

--> Make the check before calling extract_url

bing engine

	...
    for result in eval_xpath(dom, '//div[@class="sa_cc"]'):
        link = eval_xpath(result, './/h3/a', eq=1)[0]
	...
    for result in eval_xpath(dom, '//li[@class="b_algo"]'):
        link = eval_xpath(result, './/h2/a', eq=1)[0]
	...

google engine

	title = extract_text(eval_xpath(result, title_xpath, eq=1)[0])
	url = parse_url(extract_url(eval_xpath(result, url_xpath, eq=1), google_url), google_hostname)

The huge try/catch to ignore all the parsing errors would be able to display the XPath in the logs.

Another way without try/catch and without modification to the eval_xpath function:

	title_xpr = eval_xpath(result, title_xpath)
	url_xpr = eval_xpath(result, url_xpath)
	if len(title_xpr) > 0 and len(url_xpr) > 0:
		title = extract_text(title_xpr[0])
		url = parse_url(extract_url(url_xpr, google_url), google_hostname)
		...

The eq and gte parameters can't help much for the result count.

Using eq:

    try:
        results_num = int(eval_xpath(dom, '//div[@id="resultStats"]//text()', eq=1)[0]
                          .split()[1].replace(',', ''))
        results.append({'number_of_results': results_num})
    except:
        pass

Without eq, with more checking:

	results_num_xpath = eval_xpath(dom, '//div[@id="resultStats"]//text()')
	if len(results_num_xpath) > 0:
		results_num_text = results_num_xpath[0]
		results_num_text_first = results_num_text.split()[1].replace(',', '') 
		try:
			results_num = int(results_num_text_first)
			results.append({'number_of_results': results_num})
		except ValueError:
			pass
@dalf dalf changed the title Engine code: explicitly which XPath can fail, which must not. Engine code: describe which XPath can fail, which must not. Jan 10, 2020
@asciimoo

This comment has been minimized.

Copy link
Owner

@asciimoo asciimoo commented Jan 12, 2020

This is a good idea, i like it a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.