Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Search result with broken HTML for matches inside links #1303

Open
benjaoming opened this issue Sep 24, 2023 Discussed in #1302 · 5 comments
Open

[Bug]: Search result with broken HTML for matches inside links #1303

benjaoming opened this issue Sep 24, 2023 Discussed in #1302 · 5 comments
Assignees
Labels
bug Confirmed bug

Comments

@benjaoming
Copy link
Member

Discussed in #1302

Originally posted by hermannromanek September 24, 2023

Describe the bug

When searched terms are found within a links slug, the filter get_content_snippet returns invalid HTML.

Problem is the filter cleaning HTML (maybe intentionally to also search through that?) for the keyword. My quick fix was to do a strip_tags at the start of the function:

grafik

This probably means it is no longer neccessary in the nested function clean_text.

Steps to reproduce.

  1. Create a page under slug lorem-ipsum
  2. Create a second page linking to this, e.g. with content:
    The text [Lorem Ipsum](/lorem-ipsum/) will display incorrectly when searching for "lorem".
  3. On a scope containing the buggy page, search for term "lorem".

grafik

Expected behavior

Expected would be correct HTML, highlighting the search term within the a tag.

May require more complex logic if the search should find slugs that are named differently than the link pointing there.

Python Version

3.9+

Django Version

4.0+

Extra context

No response

@benjaoming benjaoming added the bug Confirmed bug label Sep 24, 2023
@oscarmcm oscarmcm self-assigned this Sep 25, 2023
@oscarmcm
Copy link
Member

I'll take a look on this!

@oscarmcm
Copy link
Member

This has some very tricky edge cases here, what if you store the following:

I like the apples from the [fresh market](https://delimarket.example)

And then you search for delimarket:

First Django will go to the search view, and there we use the most common FTS technique icontains

https://github.com/django-wiki/django-wiki/blob/main/src/wiki/views/article.py#L703-L706

Later on, Django renders the template tied to the search view, in the render process we have inside a loop that then calls an include which finally executes:

<p><small>{{ article.render|get_content_snippet:search_query }}</small></p>

https://github.com/django-wiki/django-wiki/blob/main/src/wiki/templates/wiki/includes/searchresult.html

The problem here is that when get_content_snippet get called, what it receives is HTML content from the render method, something like this:

<p>I like the apples from the <a href="https://delimarket.example">fresh market</a></p>\n

We might have two possible solutions:

  1. Solution A - Do the search inside the markdown content
  • Gets the full markdown content
  • Search the term inside the content
  • Do the content split from the term to the left and to the right
  • Check the final markdown is valid
  • Complete it until it's valid
  1. Solution B - Do the search inside the HTML content
  • Get the content from the render method
  • Do not strip tags
    • Because if we do, then searching for the attr in the link will be "invalid"
  • Do the content split from the term to the left and to the right
  • Check the final HTML is valid
  • Complete it until it's valid

Both solutions share the same complexity level - Cc @benjaoming @hermannromanek

@oscarmcm
Copy link
Member

oscarmcm commented Oct 2, 2023

@benjaoming what option do you think will be the best here?

@benjaoming
Copy link
Member Author

It's possibly Solution B that's best and easiest way? The consideration that I'd give is that there can be additional plugin contents included after calling render().

Rendered HTML is cached so that should be easy.

An optional refinement to Solution B could be to create a search index only from specific HTML attribute values and text nodes in the HTML tree, given that certain HTML contents would be irrelevant to the search. Let's say that the user wants to find the keyword "href", that would be pretty hard if only rendered HTML is parsed.

In Solution A (markdown search), it could similarly be argued that are contents irrelevant to the search index... but stripping out irrelevant custom markdown tags is a much more difficult problem to solve than using a conventional HTML/XML parser.

@teury
Copy link
Contributor

teury commented Jan 12, 2024

I want to add another error that occurs when searching for the title of an h2. Conflicts with the edit plugin

image

This is the way I solved it

@register.filter
def get_text(html):
  cleaned = bleach.clean(html, tags=[], strip=True)

  # Remove all occurrences of [edit]
  cleaned = re.sub(r'\[edit\]', '', cleaned)

  return cleaned
      {{ article.render|get_text|get_content_snippet:search_query }}

I simply take advantage of the bleach plugin that is already installed and clean the html with all the tags before searching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug
Projects
None yet
Development

No branches or pull requests

3 participants