Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML should not be parsed with regex #255

Closed
LincolnPuzey opened this issue Mar 26, 2021 · 8 comments
Closed

HTML should not be parsed with regex #255

LincolnPuzey opened this issue Mar 26, 2021 · 8 comments

Comments

@LincolnPuzey
Copy link

The middleware's process_response method uses regex to parse the HTML response and inject content.

This is easily broken, leading to the content being injected in the wrong place or not at all.

I will open a PR with unit tests showing this.

See this stackoverflow post as to why parsing HTML with regex is a bad idea.

@Mogost
Copy link
Member

Mogost commented Mar 26, 2021

In fact, you are right.
In principle, I see two options here.

  1. Go back to the tags, which force the user to explicitly specify where to insert the hijack code.
  2. Use somthing like lxml to parse html. (And this approach has even more problems than the current one. Performance and safety.)

As far as I understand, the current approach was inspired by the django-debug-toolbar.

@matthiask @tim-schilling May I hear your opinions as django-debug-toolbar leaders on this, since a similar situation clearly affects django-debug-toolbar as well.

@tim-schilling
Copy link

This logic predates my involvement with the toolbar. It looks like it was originally a version of str.replace, but evolved into a regex to parse it. Avoiding a dependency to inject the content ranks high on my priority list so that removes using an actual parser. Writing a parser is generally out too. The tag approach would be the next tool I'd reach for if this was causing issues for enough people.

I agree that it isn't the most robust approach, but it's been good enough for the toolbar for a few reasons:

  1. Ease of installation. We already have 3 steps for installation (installed apps, middleware and urls). Adding one more would make the toolbar installation even more problematic than it already is for new devs.
  2. We're parsing the document for a very limited scope of use. If we were manipulating the whole document, a parser would be required.
  3. The tag we're looking for is pretty standard and cases of breakage should be solveable by the developer using INSERT_BEFORE.

Keep in mind those opinions are in regards to the django debug toolbar. django-hijack specifically could consider overriding the base admin templates and providing a template tag in the cases of people not using it or using a drop-in replacement.

@Mogost
Copy link
Member

Mogost commented Mar 26, 2021

Thank you for your participation @tim-schilling.

Actually, I don't know what we should do in this situation.
On the one hand I really like the simplicity of installation, on the other hand I see no way to solve this issue without using tags.
I don't even think we should consider the parser option.
It's too complicated and won't be worth the simplicity. We also have to pay for this simplicity with performance.

@codingjoe what do you think?
Maybe there is some other Django magic that would help us, but I don't know it.

@codingjoe
Copy link
Collaborator

codingjoe commented Mar 27, 2021

Hi @LincolnPuzey,

Thank you for reaching out to us and addressing this. I do agree, this isn't a once size fits all solution and I do see scenarios where this might break. However, as mentioned by @Mogost, we blatantly stole the idea from django-debug-toolbar, probably the most popular Django package out there. I guess what applies to both packages, is that this bit of code is barley every executed on a production system. Again we only inject the notice, if a user was hijacked which by default only superusers can do. That does limit potential exposure to the problem you are describing.

However, yes, especially if you are running a side like stackoverflow, where users share HTML code, you may run into problems. But this is why this is based on a setting that can be adjusted. One option is to set it to something dubious that is unlikely or virtually impossible to guess, something like:

# key being creaded with secrets.token_urlsafe()
HIJACK_INSERT_BEFORE = "<!--// SUPER_SECRET_KEY // -->"

You now have one of two options, you may add this to your base template for the middleware to inject the notification or you don't and add the notification to your template yourself. Something like this:

{% if request.user.is_hijacked %}
<!-- Your custom notification or -->
{% include "hijack/notification" with request=request only %}
{% endif %}

Anyhow, here is what I propose we do:

  1. We improve the documentation to include the solutions I pointed out.
  2. We allow people to set HIJACK_INSERT_BEFORE to None and disable the injection.

Beyond that, I believe RegEx is suitable to provide both performance and comfort. At the end, this is only to make the integration easier for users. However, we could consider using Python's HTMLParser, however I would believe this to be slower.

Thanks @Mogost and @tim-schilling for your remarks. Please feel free to share your thoughts.

Best,
Joe

@LincolnPuzey
Copy link
Author

Thanks for your thoughts on this everyone.

I didn't know the code was copied straight from debug-toolbar. However that only ever runs locally in DEBUG mode. I find it hard to trust a middleware that is totally re-writing the response HTML body in production.

My proposed solution would be to split the HijackUserMiddleware class into two classes:

  • HijackUserMiddleware which just implements process_request to set the user.is_hijacked flag
  • HijackUserWithNotificationMiddleware which subclasses above and also implements process_response to inject the response.
  • bring back a standard template tag like hijack_notification which was previously in the library.

Then beginner users can simply use HijackUserWithNotificationMiddleware middleware for easy installation.

Or more advanced users that want a more robust setup can use HijackUserMiddleware and put the template tag in their base template.


Or, given that the process_request method is not doing much, simply bring back a standard template tag like hijack_notification that looks directly at request.session to determine if to show the notification.


@codingjoe @Mogost Thoughts on this?

@codingjoe
Copy link
Collaborator

Hi @codingjoe,

You are making a valid point. However, we should also consider the security implications. The notification is a vital part, making sure that it is always visible is important, to avoid that users may perform unintended actions.
The template tag required users to add the tag to all templates. Yes, a base.html template pattern is common, but will all too often forget to add the notification to places like the admin or other 3rd party.

I personally do not share your concern about a middleware altering a request or response. That is quite literally what they are for, as described in the very first sentence here:

Middleware is a framework of hooks into Django’s request/response processing. It’s a light, low-level “plugin” system for globally altering Django’s input or output.

That aside, I still understand your concern, and we should address it properly. I don't believe we need a second middleware. Removing the first one, will achieve your goal. Yes, you won't have the user attribute, but following the general logic, we shouldn't fiddle with every request either, if not necessary.

Adding a template tag, as before seems a bit counterintuitive to me. A simple include should do. In most cases, I'd probably like to include a custom template as well.

Lastly, yes, we are running this code in production, not in debug mode. I agree, there is a difference. However, we do not alter all requests. We only do so for hijacked requests.

Anyhow, I really appreciate the discussion. I hope I am not being too defensive. This isn't as much about protecting our design decisions, as about keeping the code base small. I am pleased to see that a curious developer can understand the code base so quickly and discuss concepts. However, I also believe not everyone will take the time, investigate the inner workings of this package. And for those, I want to make sure we provide the best security possible.

I am on vacation right now, but I will try to find some time next week to write and test some documentation that implements you suggestions.

Best,
Joe

@timnyborg
Copy link
Contributor

Leaving aside the future of the HTML-parsing, I'd like to second allowing the content injection to be skipped if HIJACK_INSERT_BEFORE is None. I came here to open a new issue with that suggestion before stumbling on this one.

It'd be a very simple change:

    def process_response(self, request, response):
        """Render hijack notification and inject into HTML response."""
        if not getattr(request.user, "is_hijacked", False) or not settings.HIJACK_INSERT_BEFORE:
            return response
        ...

I've recently added django-hijack to a project where I want the notification built right into my layout, and to do so I've had to override hijack/notification.html with an empty template. Not an elegant solution, but nicer than subclassing the middleware. Nulling that setting would be vastly easier, more obvious, and immune to breakage.

@codingjoe
Copy link
Collaborator

Thank you, @timnyborg, for opening up the PR. It doesn't change the regex implementation, though. In lack of a better solution, both for us nor debug toolbar, I will close the issue for. If someone has a new solution, I'd be happy to revisit this topic. Best Joe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants