delete() is not escaping special XML characters in its query or ID parameters. #357

ramayer · 2021-01-22T01:12:12Z

I have

Tested with the latest release
Searched for similar existing issues
Tested with the current master branch
... And I submitted a pull request [Escape special characters in the XML generated in the delete() method #356] fixing the issue. [Edit: Based on comments below I also submitted a patch [Escape xml in delete() using ElementTree #358] changing the way delete generates XML to use ElementTree. ]

Expected behaviour

The statement

    self.solr.delete(id='cats<dogs')

should delete a document with the ID that has a string value of "cats<dogs", and the statement

    self.solr.delete(id='doc_4</id><id>doc_3</id><id>doc_2</id><id>doc_1',commit=True)

should only delete a document with an ID that has the string value of doc_4</id><id>doc_3</id><id>doc_2</id><id>doc_1 and the statement

     self.solr.delete(q='id:*</query><query> id:999 AND id:9999')

should not delete all documents.

Actual behaviour

The first statement above gives an error message of malformed XML.
The second statement deletes more documents than it should, and fails to delete the one that it should delete.
The third statement deletes all documents.

Steps to reproduce the behaviour

```
   self.solr.delete(id='cats<dogs')
```

Configuration

Operating system version: PopOS, Ubuntu, Microsoft.
Search engine version: Solr 7.2, Solr 8.4
Python version: 3.8.5
pysolr version: v3.9.0 or earlier

The text was updated successfully, but these errors were encountered:

acdha · 2021-01-22T01:22:03Z

This is an interesting find: pysolr uses ElementTree to construct most of its requests but not the deletes and since it has been predominantly used with projects like django-haystack which don't pass unvalidated user input through that hasn't come up before.

I think it would be more consistent to make https://github.com/django-haystack/pysolr/blob/master/pysolr.py#L1122 use ElementTree like the rest of code, or possibly to switch to using the JSON request format since that's usually faster and it should be supported by all but truly ancient versions of Solr at this point.

ramayer · 2021-01-22T01:28:37Z

I thought about that but was trying to make minimal changes.

I'm happy to re-submit with either of the other approaches instead.

ramayer · 2021-01-22T01:44:05Z

I submitted [#358] which has the same functionality and tests as [#356] but uses ElementTree to generate the XML. This feels safer to me than switching to JSON, since in most cases (those with no special characters involved), solr will be receiving the exact same messages that it always has with earlier versions of pysolr. If people did prefer JSON over XML, I'd be tempted to submit an entirely separate pull request that switches all the API calls (add/update, etc) to JSON too - but think that's out of scope for this issue.

Oh and we also don't pass unvalidated user input -- but we did have deletes by query on fields that had organization names in them. Annoyingly there were XML special characters in quite a few of those ("Smith & Wesson", etc.). Investigating why those failed led to noticing that delete by ID had a similar problem.

ramayer · 2021-01-22T16:02:11Z

I noticed my pull request failed the flake8 test (missing whitespace and code too long) -- I'll submit a cleaner version today.

acdha · 2021-01-23T15:57:58Z

I merged #358 on the theory that it's consistent with what we're doing for everything else. I do like the idea of switching it to JSON but that should probably be part of a phased deprecation.

Thanks for this, it's a good improvement!

ramayer mentioned this issue Jan 22, 2021

Escape special characters in the XML generated in the delete() method #356

Closed

acdha closed this as completed Jan 23, 2021

ramayer mentioned this issue Jan 25, 2021

Escape xml in delete() using ElementTree #358

Merged

rmayer-sst mentioned this issue May 9, 2024

Code Injection #400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete() is not escaping special XML characters in its query or ID parameters. #357

delete() is not escaping special XML characters in its query or ID parameters. #357

ramayer commented Jan 22, 2021 •

edited

acdha commented Jan 22, 2021

ramayer commented Jan 22, 2021

ramayer commented Jan 22, 2021 •

edited

ramayer commented Jan 22, 2021

acdha commented Jan 23, 2021

delete() is not escaping special XML characters in its query or ID parameters. #357

delete() is not escaping special XML characters in its query or ID parameters. #357

Comments

ramayer commented Jan 22, 2021 • edited

I have

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Configuration

acdha commented Jan 22, 2021

ramayer commented Jan 22, 2021

ramayer commented Jan 22, 2021 • edited

ramayer commented Jan 22, 2021

acdha commented Jan 23, 2021

ramayer commented Jan 22, 2021 •

edited

ramayer commented Jan 22, 2021 •

edited