Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete() is not escaping special XML characters in its query or ID parameters. #357

Closed
4 tasks done
ramayer opened this issue Jan 22, 2021 · 5 comments
Closed
4 tasks done

Comments

@ramayer
Copy link
Contributor

ramayer commented Jan 22, 2021

I have

Expected behaviour

The statement

    self.solr.delete(id='cats<dogs')

should delete a document with the ID that has a string value of "cats<dogs", and the statement

    self.solr.delete(id='doc_4</id><id>doc_3</id><id>doc_2</id><id>doc_1',commit=True)

should only delete a document with an ID that has the string value of doc_4</id><id>doc_3</id><id>doc_2</id><id>doc_1 and the statement

     self.solr.delete(q='id:*</query><query> id:999 AND id:9999') 

should not delete all documents.

Actual behaviour

  • The first statement above gives an error message of malformed XML.
  • The second statement deletes more documents than it should, and fails to delete the one that it should delete.
  • The third statement deletes all documents.

Steps to reproduce the behaviour

  1.    self.solr.delete(id='cats<dogs')
    

Configuration

  • Operating system version: PopOS, Ubuntu, Microsoft.
  • Search engine version: Solr 7.2, Solr 8.4
  • Python version: 3.8.5
  • pysolr version: v3.9.0 or earlier
@acdha
Copy link
Collaborator

acdha commented Jan 22, 2021

This is an interesting find: pysolr uses ElementTree to construct most of its requests but not the deletes and since it has been predominantly used with projects like django-haystack which don't pass unvalidated user input through that hasn't come up before.

I think it would be more consistent to make https://github.com/django-haystack/pysolr/blob/master/pysolr.py#L1122 use ElementTree like the rest of code, or possibly to switch to using the JSON request format since that's usually faster and it should be supported by all but truly ancient versions of Solr at this point.

@ramayer
Copy link
Contributor Author

ramayer commented Jan 22, 2021

I thought about that but was trying to make minimal changes.

I'm happy to re-submit with either of the other approaches instead.

@ramayer
Copy link
Contributor Author

ramayer commented Jan 22, 2021

I submitted [#358] which has the same functionality and tests as [#356] but uses ElementTree to generate the XML. This feels safer to me than switching to JSON, since in most cases (those with no special characters involved), solr will be receiving the exact same messages that it always has with earlier versions of pysolr. If people did prefer JSON over XML, I'd be tempted to submit an entirely separate pull request that switches all the API calls (add/update, etc) to JSON too - but think that's out of scope for this issue.

Oh and we also don't pass unvalidated user input -- but we did have deletes by query on fields that had organization names in them. Annoyingly there were XML special characters in quite a few of those ("Smith & Wesson", etc.). Investigating why those failed led to noticing that delete by ID had a similar problem.

@ramayer
Copy link
Contributor Author

ramayer commented Jan 22, 2021

I noticed my pull request failed the flake8 test (missing whitespace and code too long) -- I'll submit a cleaner version today.

@acdha
Copy link
Collaborator

acdha commented Jan 23, 2021

I merged #358 on the theory that it's consistent with what we're doing for everything else. I do like the idea of switching it to JSON but that should probably be part of a phased deprecation.

Thanks for this, it's a good improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants