Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using special characters in creator name and tags #3793

Closed
wants to merge 9 commits into from

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Feb 13, 2024

Fixes

Fixes #3791 by @AetherUnbound

Description

This PR changes the way the path is parsed to extract the creator name to allow for URL-encoded special symbols like / and ? within it. The creator name is also encoded when creating the app creator page link.
To enable testing this change locally, I added special symbols to the creator names sample data.

Testing Instructions

Run the local API.

Then, run the Nuxt app using the local API by running the app using API_URL=http://localhost:50270 just frontend/run dev:only

Enable additional_search_views on the /preferences page.
Go to localhost:8443/audio/source/wikimedia_audio and navigate to one of the results that have a slash in the creator name. From that single result page, click on the creator link to open the creator view - there should be no errors here.

Search for "Wish You Were Here" in audio. You should see an item with a question mark in the creator name. If you click on it, you should see that the name is encoded in the URL: http://localhost:8443/audio/source/jamendo/creator/The.madpix.project%3F/

If you search for "German term Vulkan" in audio, you should see an item with a / in the creator name. If you click on it, you should see the slash as is in the URL, however, this slash is actually encoded in the app code. You can get to this page using this URL: http://localhost:8443/audio/source/wikimedia_audio/creator/wiki%2FJeuwre/

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat requested a review from a team as a code owner February 13, 2024 08:57
@github-actions github-actions bot added the 🧱 stack: frontend Related to the Nuxt frontend label Feb 13, 2024
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🕹 aspect: interface Concerns end-users' experience with the software labels Feb 13, 2024
@obulat obulat mentioned this pull request Feb 13, 2024
8 tasks
Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really surprised the API handles this so well, but there is a big edge case... ?!

http://0.0.0.0:50280/v1/images/source/flickr/creator////me////?you-and-them/\////

Leads to the following query:

openverse-web-1  | {'from': 0,
openverse-web-1  |  'query': {'bool': {'filter': [{'term': {'source': 'flickr'}},
openverse-web-1  |                                {'term': {'creator.keyword': '///me///'}}],
openverse-web-1  |                     'must_not': [{'term': {'mature': True}}]}},
openverse-web-1  |  'size': 40,
openverse-web-1  |  'sort': [{'created_on': {'order': 'desc'}}]}

I wonder if we should just percent encode the creator name? Or maybe conditionally percent encode, at least ?, and any other significant things we could find?

Not requesting changes because it could be considered a separate issue. Let me know!

@obulat obulat requested a review from a team as a code owner February 14, 2024 17:17
Comment on lines 3 to 7
const removeSlashes = (path: string): string => {
return path.replace(/^\/|\/$/g, "")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only remove leading and trailing slashes from the path? What about trailing or leading slashes in creator names? e.g., image/source/flickr//edgycreator// where the creator name is /edgycreator/?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat have you seen this question? Just want to make sure that path gets passed in with the /s from the path, and not the raw creator name characters on both ends of the string and always.

If that is the case, then can we just do path.substring(1, path.length - 1) and avoid the regex altogether? It would communicate things a bit more clearly, especially if the function is renamed to trimLeadingAndTrailingSlashes (or some less verbose version 😁 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realising that potentially my question doesn't make sense anymore, because all slashes that are elements of the creator's name should be URL encoded, right? I am still waiting to approve just to make sure, because http://192.168.20.75:8443/audio/source/jamendo/creator/%2FThe.madpix.project%3F/, with a leading slash in the creator name URL encoded does work, but http://192.168.20.75:8443/audio/source/jamendo/creator//The.madpix.project%3F/ does not.

However, the PR description says that we should be able to visit http://localhost:8443/audio/source/wikimedia_audio/creator/wiki/Jeuwre/, but that doesn't work either, it 404s. The URL encoded version does work.

Except, they only work when navigate to client side. Navigating to them SSR (i.e., reloading the page) does not work. Is it an issue with preferences, perhaps? I also don't see the additional search view links on the single result page, so I think it may be an issue outside this PR, but I want to make absolutely sure it is working the right way, because what I am experiencing differs from the PR description.

@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch from 8080a7a to 8be847a Compare February 15, 2024 07:44
@AetherUnbound
Copy link
Contributor

I wonder if we should just percent encode the creator name? Or maybe conditionally percent encode, at least ?, and any other significant things we could find?

This was exactly my thought for approaching this too! I'm not sure how much it matters, but leaving the / unencoded forces us into the position of accepting that we won't have any future sub-paths to /creator/, since those sub-paths might be interpreted as the creator name. FWIW I think it's worth changing this PRs behavior to URL encode things!

@obulat
Copy link
Contributor Author

obulat commented Feb 16, 2024

This was exactly my thought for approaching this too! I'm not sure how much it matters, but leaving the / unencoded forces us into the position of accepting that we won't have any future sub-paths to /creator/, since those sub-paths might be interpreted as the creator name. FWIW I think it's worth changing this PRs behavior to URL encode things!

I added encoding to the link that we create for the creator button in 9a535b9. If you want to access a page for a creator that has special symbols like / or ? in it by entering the URL, you would have to manually encode the creator name.

@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch from 8be847a to fa57822 Compare February 16, 2024 03:58
@AetherUnbound
Copy link
Contributor

I'm not sure if it's just my machine, but I'm having a really hard time testing this 😞 lots of the pages aren't loading, and I can't navigate to any of the links you posted (even after trying j down -v && j api/init and using the API URL variable directly). Some of the pages half load or stay loading forever. It looks like not all the content on the pages is loading because my default font is being used for some of the pages that get half rendered. I tried the exact same steps on main (including pointing to the local API) and the pages load and function as expected 😕

@obulat
Copy link
Contributor Author

obulat commented Feb 19, 2024

I'm not sure if it's just my machine, but I'm having a really hard time testing this 😞 lots of the pages aren't loading, and I can't navigate to any of the links you posted (even after trying j down -v && j api/init and using the API URL variable directly). Some of the pages half load or stay loading forever. It looks like not all the content on the pages is loading because my default font is being used for some of the pages that get half rendered. I tried the exact same steps on main (including pointing to the local API) and the pages load and function as expected 😕

This is weird. I tried checking out the branch again, and it works well for me locally 🙃

Just in case, here are the first steps I used for testing:

  1. Checkout fix/allow-slashes-in-creator-name

  2. Run just install

  3. Run the API locally. Make sure that the updated sample data with special symbols in creator names is used.

  4. Run just up

  5. Delete all indices using Elasticvue

  6. Run just init

  7. API_URL=http://localhost:50270/ just frontend/run dev:only

  8. Open http://localhost:8443/preferences and make sure that the additional_search_views is on.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More a request for clarification than a change, but I do have an additional fix to the sample audio data.

Comment on lines 3 to 7
const removeSlashes = (path: string): string => {
return path.replace(/^\/|\/$/g, "")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat have you seen this question? Just want to make sure that path gets passed in with the /s from the path, and not the raw creator name characters on both ends of the string and always.

If that is the case, then can we just do path.substring(1, path.length - 1) and avoid the regex altogether? It would communicate things a bit more clearly, especially if the function is renamed to trimLeadingAndTrailingSlashes (or some less verbose version 😁 )

sample_data/sample_audio.csv Show resolved Hide resolved
Comment on lines 3 to 7
const removeSlashes = (path: string): string => {
return path.replace(/^\/|\/$/g, "")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realising that potentially my question doesn't make sense anymore, because all slashes that are elements of the creator's name should be URL encoded, right? I am still waiting to approve just to make sure, because http://192.168.20.75:8443/audio/source/jamendo/creator/%2FThe.madpix.project%3F/, with a leading slash in the creator name URL encoded does work, but http://192.168.20.75:8443/audio/source/jamendo/creator//The.madpix.project%3F/ does not.

However, the PR description says that we should be able to visit http://localhost:8443/audio/source/wikimedia_audio/creator/wiki/Jeuwre/, but that doesn't work either, it 404s. The URL encoded version does work.

Except, they only work when navigate to client side. Navigating to them SSR (i.e., reloading the page) does not work. Is it an issue with preferences, perhaps? I also don't see the additional search view links on the single result page, so I think it may be an issue outside this PR, but I want to make absolutely sure it is working the right way, because what I am experiencing differs from the PR description.

@sarayourfriend
Copy link
Contributor

@AetherUnbound I am not having the same issue as you.

We can pair this week to try to discover why your local is behaving differently, if you would like.

@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch 2 times, most recently from 5bd37c8 to 22722b5 Compare February 20, 2024 11:13
@obulat
Copy link
Contributor Author

obulat commented Feb 20, 2024

I'm realising that potentially my question doesn't make sense anymore, because all slashes that are elements of the creator's name should be URL encoded, right? I am still waiting to approve just to make sure, because http://192.168.20.75:8443/audio/source/jamendo/creator/%2FThe.madpix.project%3F/, with a leading slash in the creator name URL encoded does work, but http://192.168.20.75:8443/audio/source/jamendo/creator//The.madpix.project%3F/ does not.

However, the PR description says that we should be able to visit http://localhost:8443/audio/source/wikimedia_audio/creator/wiki/Jeuwre/, but that doesn't work either, it 404s. The URL encoded version does work.

Except, they only work when navigate to client side. Navigating to them SSR (i.e., reloading the page) does not work. Is it an issue with preferences, perhaps? I also don't see the additional search view links on the single result page, so I think it may be an issue outside this PR, but I want to make absolutely sure it is working the right way, because what I am experiencing differs from the PR description.

I was having problems testing the creator names that start with / locally: they would be correctly saved as URL-encoded versions, but would redirect to a creator name without a leading slash. Then, I realized that the pathMatch returned by vue-router is always URL-decoded, so it never shows the URL-encoded version of the creator.

So, I changed the code a little bit to only allow for URL-encoded creator names. Now, if you use a non-encoded slash / in the creator name, it shows a full-page yellow error page. To correctly parse out the encoded creator name, I had to add the fullPath as a parameter, as it shows the exact URL instead of always decoding it. Previously, both %2Fcreator and /creator would be passed as /creator in the parser function, and now these are passed as %2Fcreator and /creator, respectively.

I also moved the path collectionParams validation to the middleware because throwing an error in the setup function shows the Nuxt generic error page instead of Openverse yellow error page.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. I'm unable to test this locally at the moment, but will do so later.

Something I didn't realise about the implementation until now, is the use of _.vue, rather than typical path parameters. Was that originally to allow for / in the creator names? If we are now disallowing it, could we switch to using regular Nuxt path params by changing the file names to audio/source/[source].vue, audio/source/[source]/creator/[creator].vue, and so forth? Maybe it's even possible to use [mediaType]/source/[source]/creator/[creator].vue, if that would simplify the logic altogether (I'm not sure if this is the case).

Would this overall simplify the implementation of these pages or is it more complex for some reason?

Also, do we need to handle URL encoding tags? I don't know how much normalisation we do to them on the presentation layer.

@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch from 22722b5 to af9b908 Compare February 21, 2024 10:51
@obulat obulat changed the title Allow using slashes in creator name Allow using special characters in creator name and tags Feb 21, 2024
@obulat
Copy link
Contributor Author

obulat commented Feb 21, 2024

Something I didn't realise about the implementation until now, is the use of _.vue, rather than typical path parameters. Was that originally to allow for / in the creator names? If we are now disallowing it, could we switch to using regular Nuxt path params by changing the file names to audio/source/[source].vue, audio/source/[source]/creator/[creator].vue, and so forth? Maybe it's even possible to use [mediaType]/source/[source]/creator/[creator].vue, if that would simplify the logic altogether (I'm not sure if this is the case).

I tried this at first, but had a problem where the nested route (creator) was being caught by the parent route. This is why I decided to move to a catch-all route, and handle the parsing of the path manually.

I went back to the code, and made some more changes to make sure that the tags and creators with special characters work both on SSR and on the client-side navigation. I also added e2e tests (with the special characters set in tapes for easier testing), so I'm confident that parsing works correctly now. Note: previously, when I was doing some testing using the local API, I couldn't understand why the pages were not rendered. Then I found that Jamendo results were filtered as dead links...

Parsing of the path was moved to a single middleware. The source and the tags are saved to the media store as-is, but are encoded when creating the path for the frontend links and for the API requests.

@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch 2 times, most recently from e78c0a3 to 789c679 Compare February 21, 2024 12:11
@sarayourfriend
Copy link
Contributor

Unfortunately SSR is still broken for creator names, for me :(

I think that this might be actually an issue with trying to use the path at all, potentially with the browser decoding the percent-encoded slash into a plaintext slash upon page load. Then reloading, or copying the URL, results in an usuable URL, because it gives back the URL with the plain-text slash (http://192.168.20.75:8443/audio/source/wikimedia_audio/creator/wiki/Jeuwre/). You can't navigate to that.

ssrbroken.mp4

This worries me a bit, because I think it invalidates a core supposition that we had about how to approach this from a technical perspective. Specifically, I think it means we should actually use query parameters instead of trying to treat the creator (and tags) as a RESTful resource. Source is (theoretically) safe from this, because we control the slugs that are used for them, and can ensure they are safe for URL path parameters (do not have ? or /, etc). User generated content, on the other hand, cannot be controlled in this way, and it doesn't look like browsers are going to cooperate with us when we try our URL encoded path workaround.

Unfortunately, I think that means taking a big step back and rethinking at least the creator and tag endpoints, on both the frontend and API side, to ensure they're able to handle the full spectrum of possible values.

@AetherUnbound
Copy link
Contributor

I'm having the same difficulties as Sara - even hitting refresh on a page that's client-side rendered will then cause the loading to fail from SSR 😞 And the links in the PR description don't work for accessing media locally. Is moving those potential user-generated values to query parameters the only way? That is disappointing that we seem to be fighting the browser on that front 😕

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Feb 22, 2024

That is disappointing that we seem to be fighting the browser on that front

I think it's more that we're fighting the HTTP standard, rather than the browser 🤷 Trying to keep a / in the path parameter was always going to be iffy, that's one of the few significant characters in a URI.

For what it's worth, I don't think the query parameter is all bad, and the URL can still look nice, especially because we control source:

/audio/source/wikimedia_commons/creator/?creator=urlencoded-creator or probably better, /audio/source/wikimedia_commons/?creator=urlencoded-creator, and /audio/tag/?tag=urlencoded-tag (unless we can disentangle a request to /audio/?tag=urlencoded-tag from a regular search request, which I think we could do on the basis of tag vs the existing parameter being tags (plural). It's more wordy, and potentially causes some routing complexity (though, frankly, not more than we're already dealing with), but at least it works, which is the critical case that the current approach doesn't fulfil.

In fact, it'll be better, because at least in Firefox, you can type whatever cleartext you want in a query parameter, and the browser will URL encode it at request time on your behalf. We were never going to get that with path parameters.

Unfortunately, because we haven't been careful about adding new features to the API before verifying the work for all our cases, we'll need to break the API version and remove the creator and tag routes.

Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@obulat obulat force-pushed the fix/allow-slashes-in-creator-name branch from 6771f31 to e85b6a8 Compare February 22, 2024 10:41
@obulat
Copy link
Contributor Author

obulat commented Feb 22, 2024

I think it's more that we're fighting the HTTP standard, rather than the browser 🤷 Trying to keep a / in the path parameter was always going to be iffy, that's one of the few significant characters in a URI.

Yes, having a slash in the URL is definitely iffy. But I think I found a solution: if we encode the slash in creator and tag twice, we can make it work without having to fight with the browser's behavior with /. I added the changes in e85b6a8. To make this work, I also added the necessary decoding step to handle this in the API.

I added the test that handles reloading of the client-rendered page to make sure this case is not missed.

You can test the cases of ? and / in the creator and the tag using the following search terms:

@obulat
Copy link
Contributor Author

obulat commented Feb 22, 2024

In fact, it'll be better, because at least in Firefox, you can type whatever cleartext you want in a query parameter, and the browser will URL encode it at request time on your behalf. We were never going to get that with path parameters.

I found a way of allowing for typing creator name as cleartext in the path parameter: in the collection middleware, if we're on the server and there's an non-encoded slash, we redirect to the path with creator encoded. Tag cannot use cleartext, though, because it does not use a catchall page name as the source pages do (it's _tag.vue, so it matches only until the following /, and source is _vue, so it matches everything that comes after).

However, this would not allow for a clear text ? question mark in the source name, because the first question mark will be considered the start of the query params. We could also add server-side handling of the question mark: if there are more than 1 question mark, or if there are no = signs in the "query params", redirect the page to the URL with the encoded creator path (i.e., for "/creator/me??license=by", we would redirect on the server to "/creator/me%3F?license=by".

The easiest solution would be not to add clear text handling to the path, i.e, accept only encoded values.
What do you think about this solution, @sarayourfriend ? Is there anything that would make it not feasible and make us refactor the code to use query params instead of the path params?

Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@sarayourfriend
Copy link
Contributor

sarayourfriend commented Feb 22, 2024

The easiest solution would be not to add clear text handling to the path, i.e, accept only encoded values.

I think the issue with this is that browser (or at least Firefox) automatically converts URL encoded values in the path to clear text, so if you copy/paste that value, it ends up as cleartext. It does this with query parameters as well, but browsers know to URL encode query parameters before sending them (because that's standard), whereas they won't know to do that for the path (as it is unusual).

You can see this in my video above. When I navigate to the creator page, despite the link in the HTML actually being url encoded, Firefox even displays it as cleartext in the link preview in the bottom left corner. After navigating to the page, it is cleartext in the URL bar, and when I copy it, it is copied as cleartext.

I think we are doing something fundamentally incompatible with the way browsers expect paths to work. We need to use an approach that will not cause issues when the browser converts the URL to plain text. The only two options I can think of are: using a normalised version of creator names that encodes / and ? is some other format that the browser doesn't know, but we know, and can use to decode to the full name on the API (which seems... bizarre, frankly, especially if we'd need to communicate that to anyone using those API routes) or use query parameters instead of path parameters, which doesn't "look" as nice, but is compatible with every single possible edge case, and works exactly as expected in every single one of those.

Because the first option there, using a custom encoding, is really just another version of fighting the browser and has implications for our API's usability and documentation, my strong preference is for the second, query params. Unless there is some other way we haven't mentioned here yet of handling them that will not fight the browser's behaviour and is inline with standard behaviours of HTTP spec. The bottom line is that no one is going to think that they need to URL encode a path parameter, as it just is not common (I don't know if I've ever seen it). Keep in mind this is not just a frontend issue, the API has the same problems with ?.

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@AetherUnbound
@dhruvkb
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 7 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@sarayourfriend sarayourfriend removed the request for review from dhruvkb February 23, 2024 00:39
@sarayourfriend sarayourfriend marked this pull request as draft February 23, 2024 00:39
@sarayourfriend
Copy link
Contributor

sarayourfriend commented Feb 23, 2024

I've removed Dhruv as a requested reviewer, I stepped in for him instead.

Also drafted this so that we stop getting review pings while we continue to discuss this and find a solution that will work long term and for all cases.

@obulat
Copy link
Contributor Author

obulat commented Feb 29, 2024

Closing this PR as we've decided to change the approach and use query parameters for the additional search views.

@obulat obulat closed this Feb 29, 2024
@obulat obulat deleted the fix/allow-slashes-in-creator-name branch March 19, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: frontend Related to the Nuxt frontend
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Unable to visit creator page if creator name has / in it
4 participants