Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes https://github.com/dfabulich/sitemapgen4j/issues/25 #26

Merged
merged 5 commits into from
Jul 20, 2016

Conversation

eximius313
Copy link
Contributor

No description provided.

@eximius313 eximius313 mentioned this pull request Jul 20, 2016
@dfabulich
Copy link
Owner

I don't think this is worth taking a new dependency for this (especially since our commons-lang could conflict with our users' commons-lang). XML escaping is like a one-liner, isn't it?

@eximius313
Copy link
Contributor Author

1)No, this is not a "one liner" - there are lots of cases and exceptions. You don't want to patch your code every time someone uses a new improper character in URL, do you?
2)Aren't this the reasons why we use libraries after all? DRY and don't maintain something that somebody is already doing

@dfabulich
Copy link
Owner

This context, there's no such thing as a "new" improper character. The only improper characters are &, <, and >.

(Arguably the control characters could be escaped, too, but those characters aren't valid in URLs, so we don't have to make this library escape them.)

Yes, libraries can help with reinventing the wheel, but when using wheels, it's important for each axle to use the same size of wheel. This library can't know what version of commons-lang the user will use, so we can't pick any particular version of commons-lang without conflicting with some users or other.

@eximius313
Copy link
Contributor Author

So how hundreds of other libraries pick the right version and use commons-lang?
Which other method do you suggest? naive "replaceAll"?

@marcospereira
Copy link

marcospereira commented Jul 20, 2016

This context, there's no such thing as a "new" improper character. The only improper characters are &, <, and >

According to sitemaps.org docs the following chars must be escaped:

Name Char Escape Code
Ampersand & &amp;
Single Quote ' &apos;
Double Quote " &quot;
Greater Than > &gt;
Less Than < &lt;

Also, keep in mind that URLs must be UTF-8 encoded.

@eximius313
Copy link
Contributor Author

@marcospereira do you recommend "replaceAll" or something else?

@dfabulich
Copy link
Owner

Yes, I was imagining a naive replaceAll (which is why I thought this would be a one-liner).

So how hundreds of other libraries pick the right version and use commons-lang?

Many of those libraries just don't care about conflicting with their users' commons-lang; they just let their users decide how to sort it out.

It's actually not clear to me where in my argument you disagree. I claim:

A) This problem can be solved with a replaceAll one-liner.
B) If this problem can be solved with a replaceAll one-liner, then we shouldn't take a dependency on commons-lang
C) And therefore we shouldn't take a dependency on commons-lang (at least, not just for this).

It seems like you might disagree with either A or B, or you might actually disagree with both.

If you disagree with A, I think I'd need to hear more about why. You gestured toward "new" escapable characters, but it's not clear to me that there are any.

If you disagree with B, that we should take a dependency on commons-lang even granting (for the sake of argument) that replaceAll would work, then I think we might just have different philosophies. I claim that a library should avoid taking a dependency until it is "necessary" to do so. Others think that one should always take dependencies rather than copy/rewrite anything. It's not clear that a person can convince anyone else on philosophical topics like this, and especially not in github threads.

@dfabulich
Copy link
Owner

Also, keep in mind that URLs must be UTF-8 encoded.

To be clear, I don't think that this library should enforce that URLs be URL-encoded (with percent signs). The sitemap.org docs say, "all URLs (including the URL of your Sitemap) must be URL-escaped and encoded for readability by the web server on which they are located" which is to say that users need to provide us with URLs that their webservers can actually read, but if your webserver can handle /ümlat.php then I think this library should let it through. It's not our job to guarantee that the user provides us with URLs that their webserver understands (we don't know anything about their webserver).

This reverts commit 069e5da.
@eximius313
Copy link
Contributor Author

eximius313 commented Jul 20, 2016

how about:

static String escapeXml(String string){
        return string.replaceAll("&", "&amp;")
        .replaceAll("'", "&apos;")
        .replaceAll("\"", "&quot;")
        .replaceAll(">", "&gt;")
        .replaceAll(">", "&gt;")
        .replaceAll("<", "&lt;");
    }

in https://github.com/dfabulich/sitemapgen4j/blob/master/src/main/java/com/redfin/sitemapgenerator/UrlUtils.java
?

Damn, I feel so bad writing such ugly code...

.replaceAll("'", "&apos;")
.replaceAll("\"", "&quot;")
.replaceAll(">", "&gt;")
.replaceAll(">", "&gt;")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this line was duplicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. removed.

@dfabulich
Copy link
Owner

Damn, I feel so bad writing such ugly code...

It's not that bad. ;-) If you want to improve it, you could use just one Matcher with appendReplacement; that way the string would be scanned only once.

@eximius313
Copy link
Contributor Author

Great!
Could you please release a new version, so @marcospereira could use it in play-sitemap-module?

@dfabulich dfabulich merged commit ea0594a into dfabulich:master Jul 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants