[Bug]: Search result with broken HTML for matches inside links #1303

benjaoming · 2023-09-24T11:07:17Z

Discussed in #1302

^{Originally posted by hermannromanek September 24, 2023}

Describe the bug

When searched terms are found within a links slug, the filter get_content_snippet returns invalid HTML.

Problem is the filter cleaning HTML (maybe intentionally to also search through that?) for the keyword. My quick fix was to do a strip_tags at the start of the function:

This probably means it is no longer neccessary in the nested function clean_text.

Steps to reproduce.

Create a page under slug lorem-ipsum
Create a second page linking to this, e.g. with content:
The text [Lorem Ipsum](/lorem-ipsum/) will display incorrectly when searching for "lorem".
On a scope containing the buggy page, search for term "lorem".

Expected behavior

Expected would be correct HTML, highlighting the search term within the a tag.

May require more complex logic if the search should find slugs that are named differently than the link pointing there.

Python Version

3.9+

Django Version

4.0+

Extra context

No response

The text was updated successfully, but these errors were encountered:

oscarmcm · 2023-09-25T22:32:18Z

I'll take a look on this!

oscarmcm · 2023-09-26T17:42:34Z

This has some very tricky edge cases here, what if you store the following:

I like the apples from the [fresh market](https://delimarket.example)

And then you search for delimarket:

First Django will go to the search view, and there we use the most common FTS technique icontains

https://github.com/django-wiki/django-wiki/blob/main/src/wiki/views/article.py#L703-L706

Later on, Django renders the template tied to the search view, in the render process we have inside a loop that then calls an include which finally executes:

<p><small>{{ article.render|get_content_snippet:search_query }}</small></p>

https://github.com/django-wiki/django-wiki/blob/main/src/wiki/templates/wiki/includes/searchresult.html

The problem here is that when get_content_snippet get called, what it receives is HTML content from the render method, something like this:

<p>I like the apples from the <a href="https://delimarket.example">fresh market</a></p>\n

We might have two possible solutions:

Solution A - Do the search inside the markdown content

Gets the full markdown content
Search the term inside the content
Do the content split from the term to the left and to the right
Check the final markdown is valid
Complete it until it's valid

Solution B - Do the search inside the HTML content

Get the content from the render method
Do not strip tags
- Because if we do, then searching for the attr in the link will be "invalid"
Do the content split from the term to the left and to the right
Check the final HTML is valid
Complete it until it's valid

Both solutions share the same complexity level - Cc @benjaoming @hermannromanek

oscarmcm · 2023-10-02T18:52:08Z

@benjaoming what option do you think will be the best here?

benjaoming · 2023-10-02T20:14:35Z

It's possibly Solution B that's best and easiest way? The consideration that I'd give is that there can be additional plugin contents included after calling render().

Rendered HTML is cached so that should be easy.

An optional refinement to Solution B could be to create a search index only from specific HTML attribute values and text nodes in the HTML tree, given that certain HTML contents would be irrelevant to the search. Let's say that the user wants to find the keyword "href", that would be pretty hard if only rendered HTML is parsed.

In Solution A (markdown search), it could similarly be argued that are contents irrelevant to the search index... but stripping out irrelevant custom markdown tags is a much more difficult problem to solve than using a conventional HTML/XML parser.

teury · 2024-01-12T01:29:39Z

I want to add another error that occurs when searching for the title of an h2. Conflicts with the edit plugin

This is the way I solved it

@register.filter
def get_text(html):
  cleaned = bleach.clean(html, tags=[], strip=True)

  # Remove all occurrences of [edit]
  cleaned = re.sub(r'\[edit\]', '', cleaned)

  return cleaned

      {{ article.render|get_text|get_content_snippet:search_query }}

I simply take advantage of the bleach plugin that is already installed and clean the html with all the tags before searching

benjaoming added the bug Confirmed bug label Sep 24, 2023

oscarmcm self-assigned this Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Search result with broken HTML for matches inside links #1303

[Bug]: Search result with broken HTML for matches inside links #1303

benjaoming commented Sep 24, 2023

Describe the bug

Steps to reproduce.

Expected behavior

Python Version

Django Version

Extra context

oscarmcm commented Sep 25, 2023

oscarmcm commented Sep 26, 2023

oscarmcm commented Oct 2, 2023

benjaoming commented Oct 2, 2023

teury commented Jan 12, 2024 •

edited

[Bug]: Search result with broken HTML for matches inside links #1303

[Bug]: Search result with broken HTML for matches inside links #1303

Comments

benjaoming commented Sep 24, 2023

Discussed in #1302

Describe the bug

Steps to reproduce.

Expected behavior

Python Version

Django Version

Extra context

oscarmcm commented Sep 25, 2023

oscarmcm commented Sep 26, 2023

oscarmcm commented Oct 2, 2023

benjaoming commented Oct 2, 2023

teury commented Jan 12, 2024 • edited

teury commented Jan 12, 2024 •

edited