Add charset=... for all text/* documents #42

Closed
daapp opened this Issue Sep 10, 2012 · 10 comments

Comments

Projects
None yet
2 participants

daapp commented Sep 10, 2012

I found a bug in static file managment.

If I start hunchentoot with:
(setf acceptor (make-instance 'hunchentoot:easy-acceptor
:port port
:document-root (truename "./static/")))
(start acceptor)

then hunchentoot return all .html files from ./static/ with Content-Type: text/html; charset=utf-8,
but other text/
files, like .js it returns without charset=... , and there is no way to change this.
If such .js file contains utf-8 text, than browser display it incorrectly.

Hans, can you fix this or may be advice how to do it, and i will try myself?

Owner

hanshuebner commented Sep 10, 2012

Alexander, I'm going to look into this tomorrow morning and try to come up
with either a fix or an idea for a fix.

-Hans

On Mon, Sep 10, 2012 at 2:08 PM, Alexander Danilov <notifications@github.com

wrote:

I found a bug in static file managment.

If I start hunchentoot with:
(setf acceptor (make-instance 'hunchentoot:easy-acceptor
:port port
:document-root (truename "./static/")))
(start acceptor)

then hunchentoot return all .html files from ./static/ with
Content-Type: text/html; charset=utf-8,
but other text/
files, like .js it returns without charset=... , and
there is no way to change this.
If such .js file contains utf-8 text, than browser display it incorrectly.

Hans, can you fix this or may be advice how to do it, and i will try
myself?


Reply to this email directly or view it on GitHubhttps://github.com/edicl/hunchentoot/issues/42.

Owner

hanshuebner commented Sep 10, 2012

Hi Alexander,

I cannot find that bug, and I think your analysis is wrong. Hunchentoot
never adds a charset= specification to static files that it serves, neither
to HTML files nor to JavaScript files. As an immediate workaround, I can
offer this as a hacky solution:

(setf (gethash "html" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8"
(gethash "js" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8")

I certainly consider this to be a good, long-term solution. One basic
issue with a long-term solution is that external formats in Common Lisp are
not portable across implementations, i.e. there is no standard way to
determine a HTTP compatible encoding name for a given external format as
returned by CL:STREAM-EXTERNAL-FORMAT. Furthermore, Browsers might
actually send an Accept-Charset header to indicate what encodings would be
accepted, and the server would need to arrange for the file to be properly
converted if it was encoded in an unacceptable charset. I don't currently
have a use case for a standards-conformant implementation of charset
support, so I'll not work on this soon.

A pragmatic approach is to add a special variable that indicates the
character set to report for static files being served, and then be somewhat
smart about how to initialize that variable in an implementation specific
fashion.

This special variable, HUNCHENTOOT:STATIC-TEXT-FILE-CHARSET, should then
be used to add a charset= field in the HANDLE-STATIC-FILE function (around
the call to MIME-TYPE).

If you want to make this change, please remember to add docstrings to
functions that you add. Comments also do not hurt.

-Hans

On Mon, Sep 10, 2012 at 2:37 PM, Hans Hübner hans.huebner@gmail.com wrote:

Alexander, I'm going to look into this tomorrow morning and try to come up
with either a fix or an idea for a fix.

-Hans

On Mon, Sep 10, 2012 at 2:08 PM, Alexander Danilov <
notifications@github.com> wrote:

I found a bug in static file managment.

If I start hunchentoot with:
(setf acceptor (make-instance 'hunchentoot:easy-acceptor
:port port
:document-root (truename "./static/")))
(start acceptor)

then hunchentoot return all .html files from ./static/ with
Content-Type: text/html; charset=utf-8,
but other text/
files, like .js it returns without charset=... , and
there is no way to change this.
If such .js file contains utf-8 text, than browser display it incorrectly.

Hans, can you fix this or may be advice how to do it, and i will try
myself?


Reply to this email directly or view it on GitHubhttps://github.com/edicl/hunchentoot/issues/42.

Owner

hanshuebner commented Sep 11, 2012

Actually, it would be better to add the charset= specification in the MIME-TYPE function, not in HANDLE-STATIC-FILE.

daapp commented Sep 11, 2012

On 11.09.2012 03:56, Hans Hübner wrote:

Hi Alexander,

I cannot find that bug, and I think your analysis is wrong. Hunchentoot
never adds a charset= specification to static files that it serves, neither
to HTML files nor to JavaScript files.

Hi Hans, it seems you are right, HT never adds a charset for static files.

As an immediate workaround, I can
offer this as a hacky solution:

(setf (gethash "html" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8"
(gethash "js" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8")

I certainly consider this to be a good, long-term solution.

Well, I can't say the solution is good, but it allow to fix problem quickly. Thanks for advice.

One basic
issue with a long-term solution is that external formats in Common Lisp are
not portable across implementations, i.e. there is no standard way to
determine a HTTP compatible encoding name for a given external format as
returned by CL:STREAM-EXTERNAL-FORMAT. Furthermore, Browsers might
actually send an Accept-Charset header to indicate what encodings would be
accepted, and the server would need to arrange for the file to be properly
converted if it was encoded in an unacceptable charset. I don't currently
have a use case for a standards-conformant implementation of charset
support, so I'll not work on this soon.

Charset problem has long history in web. Many years ago Russian Apache project has
developed module for for popular Apache web server which try to correctly detect client encoding and
convert data. But after raising popularity of unicode/utf-8 this project slowly dead.
I think approach to charset problem in case of hunchentoot should be simple:

  • all static text files should have one encoding (utf-8 in my case).
  • until HT do not have module for client charset detection and text file convert
    (do we really need this?), server should return "Content-Type: text/..." for all text data
    (files or result of dynamic data generation)
  • server should add charset=HUNCHENTOOT:DEFAULT-CHARSET if default-charset is not null
    for all "Content-Type: text/..."

This approach touch static files and output of handlers, because write
(setf (content-type*) "text/html; charset=utf-8") in each define-easy-handler is annoying.

Such approach doesn't break current client code (you can set default-charset to nil)

What do you think about this?

A pragmatic approach is to add a special variable that indicates the
character set to report for static files being served, and then be somewhat
smart about how to initialize that variable in an implementation specific
fashion.

This special variable, HUNCHENTOOT:STATIC-TEXT-FILE-CHARSET, should then
be used to add a charset= field in the HANDLE-STATIC-FILE function (around
the call to MIME-TYPE).

If you want to make this change, please remember to add docstrings to
functions that you add. Comments also do not hurt.

-Hans

Owner

hanshuebner commented Sep 11, 2012

Hi Alexander,

your proposal sounds good. The automatic addition of the charset parameter
for non-file handlers is already in place, though
(MAYBE-ADD-CHARSET-TO-CONTENT-TYPE-HEADER in START-OUTPUT). The change
that I think is required is the addition of the
HUNCHENTOOT:STATIC-TEXT-FILE-CHARSET variable (please do not call it
DEFAULT-CHARSET, as for dynamic handlers, the REPLY-EXTERNAL-FORMAT is
consulted to determine the charset.

Let me know if there is something I overlook.

-Hans

On Tue, Sep 11, 2012 at 5:26 AM, Alexander Danilov <notifications@github.com

wrote:

On 11.09.2012 03:56, Hans Hübner wrote:

Hi Alexander,

I cannot find that bug, and I think your analysis is wrong. Hunchentoot
never adds a charset= specification to static files that it serves,
neither
to HTML files nor to JavaScript files.

Hi Hans, it seems you are right, HT never adds a charset for static files.

As an immediate workaround, I can
offer this as a hacky solution:

(setf (gethash "html" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8"
(gethash "js" hunchentoot::mime-type-hash) "text/plain;
charset=utf-8")

I certainly consider this to be a good, long-term solution.

Well, I can't say the solution is good, but it allow to fix problem
quickly. Thanks for advice.

One basic
issue with a long-term solution is that external formats in Common Lisp
are
not portable across implementations, i.e. there is no standard way to
determine a HTTP compatible encoding name for a given external format as
returned by CL:STREAM-EXTERNAL-FORMAT. Furthermore, Browsers might
actually send an Accept-Charset header to indicate what encodings would
be
accepted, and the server would need to arrange for the file to be
properly
converted if it was encoded in an unacceptable charset. I don't currently
have a use case for a standards-conformant implementation of charset
support, so I'll not work on this soon.

Charset problem has long history in web. Many years ago Russian Apache
project has
developed module for for popular Apache web server which try to correctly
detect client encoding and
convert data. But after raising popularity of unicode/utf-8 this project
slowly dead.
I think approach to charset problem in case of hunchentoot should be
simple:

  • all static text files should have one encoding (utf-8 in my case).
  • until HT do not have module for client charset detection and text file
    convert
    (do we really need this?), server should return "Content-Type: text/..."
    for all text data
    (files or result of dynamic data generation)
  • server should add charset=HUNCHENTOOT:DEFAULT-CHARSET if
    default-charset is not null
    for all "Content-Type: text/..."

This approach touch static files and output of handlers, because write
(setf (content-type*) "text/html; charset=utf-8") in each
define-easy-handler is annoying.

Such approach doesn't break current client code (you can set
default-charset to nil)

What do you think about this?

A pragmatic approach is to add a special variable that indicates the
character set to report for static files being served, and then be
somewhat
smart about how to initialize that variable in an implementation specific
fashion.

This special variable, HUNCHENTOOT:STATIC-TEXT-FILE-CHARSET, should
then
be used to add a charset= field in the HANDLE-STATIC-FILE function
(around
the call to MIME-TYPE).

If you want to make this change, please remember to add docstrings to
functions that you add. Comments also do not hurt.

-Hans

Reply to this email directly or view it on GitHubhttps://github.com/edicl/hunchentoot/issues/42#issuecomment-8453999.

daapp commented Sep 11, 2012

On 11.09.2012 13:52, Hans Hübner wrote:

Hi Alexander,

your proposal sounds good. The automatic addition of the charset parameter
for non-file handlers is already in place, though
(MAYBE-ADD-CHARSET-TO-CONTENT-TYPE-HEADER in START-OUTPUT). The change
that I think is required is the addition of the
HUNCHENTOOT:STATIC-TEXT-FILE-CHARSET variable (please do not call it
DEFAULT-CHARSET, as for dynamic handlers, the REPLY-EXTERNAL-FORMAT is
consulted to determine the charset.

Let me know if there is something I overlook.

-Hans

Hi Hans,
the note is to call variable HUNCHENTOOT:TEXT-CHARSET and it will be user for
both static and dynamic content by default

Owner

hanshuebner commented Sep 11, 2012

On Tue, Sep 11, 2012 at 6:36 AM, Alexander Danilov <notifications@github.com

wrote:

the note is to call variable HUNCHENTOOT:TEXT-CHARSET and it will be
user for
both static and dynamic content by default

I've understood that proposal, but I do not want to break backwards
compatibility without a good reason. Is there anything that you think the
current mechanism for dynamic handlers does not right?

-Hans

daapp commented Sep 11, 2012

On 11.09.2012 14:39, Hans Hübner wrote:

On Tue, Sep 11, 2012 at 6:36 AM, Alexander Danilov <notifications@github.com

wrote:

the note is to call variable HUNCHENTOOT:TEXT-CHARSET and it will be
user for
both static and dynamic content by default

I've understood that proposal, but I do not want to break backwards
compatibility without a good reason. Is there anything that you think the
current mechanism for dynamic handlers does not right?

Right or wrong is difficult question, I just don't want to specify charset for each easy-handler,
so If you know other way - let me know, if no other way - lets change behaviour for static files only.

Owner

hanshuebner commented Sep 11, 2012

On Tue, Sep 11, 2012 at 6:46 AM, Alexander Danilov <notifications@github.com

wrote:

Right or wrong is difficult question, I just don't want to specify charset
for each easy-handler,
so If you know other way - let me know, if no other way - lets change
behaviour for static files only.

You don't need to specify the character set for easy-handlers. If your
handler returns a string and set a text content type, the character set
should automatically be added. At least that is how it is meant to work.
Does not not?

-Hans

daapp commented Sep 11, 2012

You are right, I was checking on json handlers, so I find no charset in content-type. Sorry.

daapp closed this Sep 18, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment