New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform type of data for cpg.response,body #59

Closed
bb-migration opened this Issue Dec 15, 2004 · 3 comments

Comments

Projects
None yet
1 participant
@bb-migration

bb-migration commented Dec 15, 2004

Originally reported by: Anonymous


Today, CP2 apps usually return strings. There is the option to return a generator, but only if the generator filter is be enabled. However, there is a problem because filters can't know before hand if the body contains a iterable (a generator, for example) or a single string. If they try to iterate over the string (which is possible), they will end up iterating over the characters, which is unnaceptable slow.

The proposed solution is to make cpg.response.body always contain an uniform type of data. The obvious choice is to wrap any kind of returned data inside a iterable. Single strings will be wrapped inside a list with only one element - ["..."]. This will allow filters, and also the CP2 core functions, to always treat the cpg.response.body as a iterable, without the need to test for its type (as in: isinstance(cpg.response.body, GeneratorType)... else: ...). The check will be contained at a single point, when the called object returns the response body.

This patch also needs to check the standard filters, to make sure that they always use the iterable. It's not clear whether this changes will have some other side effect on generator handling; it's possible that this implementation will make the generator filter not needed, but that's left for the implementation to address and test.

Reported by cribeiro


@bb-migration

This comment has been minimized.

Show comment
Hide comment
@bb-migration

bb-migration Dec 15, 2004

Original comment by Anonymous:


You also need to update the docs to reflect the fact that cpg.response.body is now a list or a generator, not a string (and a few recipes might have to be updated too). Just do a search for "cpg.response.body" on the site ...

bb-migration commented Dec 15, 2004

Original comment by Anonymous:


You also need to update the docs to reflect the fact that cpg.response.body is now a list or a generator, not a string (and a few recipes might have to be updated too). Just do a search for "cpg.response.body" on the site ...

@bb-migration

This comment has been minimized.

Show comment
Hide comment
@bb-migration

bb-migration Dec 23, 2004

Original comment by Anonymous:


I've created a branch called Ticket-59 for this one. The changeset is simple but touches many files. There's still one open issue with GZIP encoding. The test for GZIP needs to be improved too.

bb-migration commented Dec 23, 2004

Original comment by Anonymous:


I've created a branch called Ticket-59 for this one. The changeset is simple but touches many files. There's still one open issue with GZIP encoding. The test for GZIP needs to be improved too.

@bb-migration

This comment has been minimized.

Show comment
Hide comment
@bb-migration

bb-migration Aug 27, 2006

Original comment by Anonymous:


''[fumanchu: Moved this here from the Wiki]''

= Notes on Ticket #59 =

Ticket #59 was proposed as a way to better integrated generators in the core CP implementation. Previous to it, the core simply used a string to contain the body of the response. Filters also relied on cpg.response.body containing a string. The problem of this approach is twofold:

  1. Processing of generator-based output wasn't integrated at core; it depended on a filter. While easy to setup, it was one more step that could potentialy discourage use of generators.
  2. The potential gains of generators - memory saving & better responsiveness for long pages -- would never be realized as the content would need to be converted to a string as soon as the GeneratorFilter was processed.

The ticket #59 was filled as a proposal to solve these issues, by using a uniform representation for cpg.response.body inside the CP core through all the processing chain. The new representation assumes that cpg.response.body '''always contain an iterable'''; it can be a list, or a generator.

== Design issues ==

==== Content-length ====

The main advantage of generators for long output strings is negated by the need to calculate the content-length before sending data. The content-length is required by the HTTP/1.0 spec. HTTP/1.1 makes it optional, but recommended. The reasoning is that it allows the client application to discriminate between EOFs sent as part of the content itself from the 'real' EOF character. For non-ninary (read: text, including HTML) output, its use is not really necessary. And even for binary streams, I'm pretty sure that any modern HTTP client can handle EOFs in the stream ('''but I haven't tested this assertion''').

==== Removed hooks ====

During the coding, it turned out that some hooks were never used by any existing filter; what is worse, they posed a problem because of their position inside the code. The span of code between the beforeResponse and the afterResponseHeader & beforeResponseBody calls contains the code that calculates the content-length; this means that at this point, the conten-body has to be fully processed, and that generators would need to be "collected" at this point. This made the two final hooks (beforeResponse and the afterResponseHeader) useless, at best; and dangerous, because they could modify the content '''after''' the header was sent.

As a result, a decision was made to remove these hooks. Instead, a new hook called afterResponse is being implemented. It's mostly for cleanup & logging purposes. It can be used, for instance, to delete connections or other response-specific data structure. It can't change anything on the cpg.response structure, as all data is already sent at this point.

==== Filters returning generators ====

One side effect of the internal use of generators is that filters that process the body should also return a generator themselves. The modifications are trivial in most cases:

class EncodingFilter(BaseOutputFilter):
    ...
    def beforeResponse(self):
        if 1: #isinstance(cpg.response.body, unicode):
            # Add "charset=..." to response Content-Type header
            contentType = cpg.response.headerMap.get("Content-Type")
            if contentType and 'charset' not in contentType:
                cpg.response.headerMap["Content-Type"] += ";charset=%s" % self.encoding
            # Return a generator that encodes the sequence
            cpg.response.body = self.encode_body(cpg.response.body)

    def encode_body(self, body):

        for line in body:
            yield line.encode(self.encoding)

The encode_body function shown above is a generator that encodes the cpg.response.body line by line (or chunk by chunk).

==== Caveat: a string is an iterable ====

There's a potential for hidden bugs in the fact that a string is also an iterable; however, iterating over characters is slow, and this bug may be difficult to catch. One possibility is to add a warning if this case ever happens.

==== Gzip ====

The original GzipFilter used the gzip library to compress the entire document body at once. The new version uses the zlib -- a low-level compression library that is used by the gzip module itself. The zlib can compress data chunk by chunk, which makes it a reasonable candidate for a generator. For long stream of data it may save a ''lot of memory''.

The gzip format is specified in http://www.faqs.org/rfcs/rfc1952.html. The gzip header is defined as follows (heavily edited from the original source):

    +---+---+---+---+---+---+---+---+---+---+    
    |ID1|ID2|CM |FLG|     MTIME     |XFL|OS |
    +---+---+---+---+---+---+---+---+---+---+

    +=======================+
    |...compressed blocks...| (more-->)
    +=======================+

      0   1   2   3   4   5   6   7
    +---+---+---+---+---+---+---+---+
    |     CRC32     |     ISIZE     |
    +---+---+---+---+---+---+---+---+

As none of the flags are supposed to be set, some optional members that could potentially follow the header are ommited from this presentation. The necessary fields are:

  • '''ID1 (IDentification 1) & ID2 (IDentification 2)'''. These have the fixed values '''ID1 = 31''' (0x1f, \037), '''ID2 = 139''' (0x8b, \213), to identify the file as being in gzip format.
  • '''CM (Compression Method)'''. CM = 8 is the standard "deflate" compression method.
  • '''FLG (FLaGs)'''. Zero, as no flags are set for this application.
  • '''MTIME (Modification TIME)'''. The time in Unix format, but can be set to zero safely for stream compressing.
  • '''XFL (eXtra FLags)'''. The "deflate" method (CM = 8) sets these flags as follows:
    • XFL = 2 - compressor used maximum compression, slowest algorithm
    • XFL = 4 - compressor used fastest algorithm
  • '''OS (Operating System)'''. The actual value should not matter here, because all files are being treated as binary.
  • '''XLEN (eXtra LENgth)'''. Set to zero.

In the trailer part of the stream, the following data should be provided:

  • CRC32 (CRC-32). The zlib library provides a suitable method to calculate it.
  • ISIZE (Input SIZE). This contains the size of the original (uncompressed) input data modulo 2^32. This requires the size being calculated as the compression goes.

===== Assorted remarks =====

The changes in the gzip filter required some research. While reading, some notes were collected we may prove of interest for CherryPy development:

[1] From http://www.15seconds.com/issue/020314.htm:

"Both Internet Explorer 5.5 and Internet Explorer 6.0 have a bug with decompression that affects some users. This bug is documented in: the Microsoft knowledge Base articles, Q312496 is for IE 6.0 … , the Q313712 is for IE 5.5. Basically Internet Explorer doesn't decompress the response before it sends it to plug-ins like Adobe Photoshop."

bb-migration commented Aug 27, 2006

Original comment by Anonymous:


''[fumanchu: Moved this here from the Wiki]''

= Notes on Ticket #59 =

Ticket #59 was proposed as a way to better integrated generators in the core CP implementation. Previous to it, the core simply used a string to contain the body of the response. Filters also relied on cpg.response.body containing a string. The problem of this approach is twofold:

  1. Processing of generator-based output wasn't integrated at core; it depended on a filter. While easy to setup, it was one more step that could potentialy discourage use of generators.
  2. The potential gains of generators - memory saving & better responsiveness for long pages -- would never be realized as the content would need to be converted to a string as soon as the GeneratorFilter was processed.

The ticket #59 was filled as a proposal to solve these issues, by using a uniform representation for cpg.response.body inside the CP core through all the processing chain. The new representation assumes that cpg.response.body '''always contain an iterable'''; it can be a list, or a generator.

== Design issues ==

==== Content-length ====

The main advantage of generators for long output strings is negated by the need to calculate the content-length before sending data. The content-length is required by the HTTP/1.0 spec. HTTP/1.1 makes it optional, but recommended. The reasoning is that it allows the client application to discriminate between EOFs sent as part of the content itself from the 'real' EOF character. For non-ninary (read: text, including HTML) output, its use is not really necessary. And even for binary streams, I'm pretty sure that any modern HTTP client can handle EOFs in the stream ('''but I haven't tested this assertion''').

==== Removed hooks ====

During the coding, it turned out that some hooks were never used by any existing filter; what is worse, they posed a problem because of their position inside the code. The span of code between the beforeResponse and the afterResponseHeader & beforeResponseBody calls contains the code that calculates the content-length; this means that at this point, the conten-body has to be fully processed, and that generators would need to be "collected" at this point. This made the two final hooks (beforeResponse and the afterResponseHeader) useless, at best; and dangerous, because they could modify the content '''after''' the header was sent.

As a result, a decision was made to remove these hooks. Instead, a new hook called afterResponse is being implemented. It's mostly for cleanup & logging purposes. It can be used, for instance, to delete connections or other response-specific data structure. It can't change anything on the cpg.response structure, as all data is already sent at this point.

==== Filters returning generators ====

One side effect of the internal use of generators is that filters that process the body should also return a generator themselves. The modifications are trivial in most cases:

class EncodingFilter(BaseOutputFilter):
    ...
    def beforeResponse(self):
        if 1: #isinstance(cpg.response.body, unicode):
            # Add "charset=..." to response Content-Type header
            contentType = cpg.response.headerMap.get("Content-Type")
            if contentType and 'charset' not in contentType:
                cpg.response.headerMap["Content-Type"] += ";charset=%s" % self.encoding
            # Return a generator that encodes the sequence
            cpg.response.body = self.encode_body(cpg.response.body)

    def encode_body(self, body):

        for line in body:
            yield line.encode(self.encoding)

The encode_body function shown above is a generator that encodes the cpg.response.body line by line (or chunk by chunk).

==== Caveat: a string is an iterable ====

There's a potential for hidden bugs in the fact that a string is also an iterable; however, iterating over characters is slow, and this bug may be difficult to catch. One possibility is to add a warning if this case ever happens.

==== Gzip ====

The original GzipFilter used the gzip library to compress the entire document body at once. The new version uses the zlib -- a low-level compression library that is used by the gzip module itself. The zlib can compress data chunk by chunk, which makes it a reasonable candidate for a generator. For long stream of data it may save a ''lot of memory''.

The gzip format is specified in http://www.faqs.org/rfcs/rfc1952.html. The gzip header is defined as follows (heavily edited from the original source):

    +---+---+---+---+---+---+---+---+---+---+    
    |ID1|ID2|CM |FLG|     MTIME     |XFL|OS |
    +---+---+---+---+---+---+---+---+---+---+

    +=======================+
    |...compressed blocks...| (more-->)
    +=======================+

      0   1   2   3   4   5   6   7
    +---+---+---+---+---+---+---+---+
    |     CRC32     |     ISIZE     |
    +---+---+---+---+---+---+---+---+

As none of the flags are supposed to be set, some optional members that could potentially follow the header are ommited from this presentation. The necessary fields are:

  • '''ID1 (IDentification 1) & ID2 (IDentification 2)'''. These have the fixed values '''ID1 = 31''' (0x1f, \037), '''ID2 = 139''' (0x8b, \213), to identify the file as being in gzip format.
  • '''CM (Compression Method)'''. CM = 8 is the standard "deflate" compression method.
  • '''FLG (FLaGs)'''. Zero, as no flags are set for this application.
  • '''MTIME (Modification TIME)'''. The time in Unix format, but can be set to zero safely for stream compressing.
  • '''XFL (eXtra FLags)'''. The "deflate" method (CM = 8) sets these flags as follows:
    • XFL = 2 - compressor used maximum compression, slowest algorithm
    • XFL = 4 - compressor used fastest algorithm
  • '''OS (Operating System)'''. The actual value should not matter here, because all files are being treated as binary.
  • '''XLEN (eXtra LENgth)'''. Set to zero.

In the trailer part of the stream, the following data should be provided:

  • CRC32 (CRC-32). The zlib library provides a suitable method to calculate it.
  • ISIZE (Input SIZE). This contains the size of the original (uncompressed) input data modulo 2^32. This requires the size being calculated as the compression goes.

===== Assorted remarks =====

The changes in the gzip filter required some research. While reading, some notes were collected we may prove of interest for CherryPy development:

[1] From http://www.15seconds.com/issue/020314.htm:

"Both Internet Explorer 5.5 and Internet Explorer 6.0 have a bug with decompression that affects some users. This bug is documented in: the Microsoft knowledge Base articles, Q312496 is for IE 6.0 … , the Q313712 is for IE 5.5. Basically Internet Explorer doesn't decompress the response before it sends it to plug-ins like Adobe Photoshop."

jaraco added a commit that referenced this issue Apr 30, 2016

Merged in bazsi/cherrypy/cherrypy-3.2.x (pull request #59)
HandlerWrapperTool: handle config arguments properly

--HG--
branch : cherrypy-3.2.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment