Skip to content

Endpoint granularisation for signature calculations #53

@inakleinbottle

Description

@inakleinbottle

At the moment, there is some troublesome behaviour arising from the way the granularisation occurs when computing signatures. The algorithm needs to be very carefully thought through, so I've opened this issue to start a discussion about how this should work and how it should be implemented. This should also serve to document the decisions we make about this.

As described in the documentation, RoughPy internally granularises query intervals by moving the the endpoints the included end of the unique granular dyadic interval containing said endpoint. For instance, if the stream has resolution 3, so granular dyadic intervals have length 1/8 and we query the interval over the interval [0.3, 0.6), then the actual query will take place over the granular interval [0.25, 0.5). To see why this happens, we need the following picture (not quite to scale):

                                   0.3                                   0.6
------------------------------------[-------------------------------------)---------------------     original interval
[--------------)[--------------)[--------------)[--------------)[--------------)[--------------)      granular dyadic dissection
0            0.125            0.25           0.375            0.5            0.625            0.75

The lower bound of the query interval (0.3) lies in the granular dyadic interval [0.25, 0.375) so is rounded to the contained end-point 0.25. The upper bound of the query interval (0.6) lies in the granular dyadic interval [0.5, 0.625) so is rounded to the contained end-point 0.5. This rounding means that we can ensure the the signatures over adjacent intervals always concatenate to give the correct signature over the union of these intervals. However, it also comes with some problems.

First problem

The granularised interval will (for clopen intervals) have a lower bound that is less than the lower bound of the query interval, unless this lower bound happens to also be a dyadic number of smaller resolution (i.e. the denominator has a smaller power of 2). This is particularly problematic if the stream resolution is small, in which case the endpoint can be shifted dramatically and may then include data that was far outside the original query interval. This actually leads to a rather troubling scenario as pointed out by @lingyiyang. Consider the following situation (behaviour shown is as of version 0.0.8).

import roughpy as rp
ctx = rp.get_context(width=3, depth=2, coeffs=rp.DPReal)
stream = rp.LieIncrementStream.from_increments([[0., 1., 2.], [3., 4., 5.]], indices=[0., 1.], ctx=ctx)
# stream.resolution == 0, so granular dyadics are [0, 1), [1, 2) etc.)
print(stream.signature(rp.RealInterval(0, 0.5)))        # case 1
# { 1() }
print(stream.signature(rp.RealInterval(0, 1)))           # case 2
# { 1() 1(2) 2(3) 0.5(2,2) 1(2,3) 1(3,2) 2(3,3) }
print(stream.signature(rp.RealInterval(0.9, 1)))        # case 3
{ 1() 1(2) 2(3) 0.5(2,2) 1(2,3) 1(3,2) 2(3,3) }

Let me explain what is happening here. As above, the query interval has been granularised in each case as follows: case 1, the empty interval [0, 0); case 2, the interval [0, 1) (it is already a dyadic interval of suitable resolution); case 3, the interval [0, 1).
In case 1 both end points belong to the granular dyadic interval [0, 1), and so both end points are shifted to 0, leading to an empty query. Case 2 is as expected. Case 3 is probably the most surprising. Here the lower interval appears in [0, 1) while the upper limit appears in [1, 2), so the granularised end points are taken as 0 and 1, respectively. This is despite the interval [0.9, 1.0) not containing any data itself. In fact, this implementation (as of version 0.0.8) has the opposite effect than the desired, where the data of an interval is effectively concentrated at the excluded end, rather than the included end. (In commit f50a8a4 I have done some work to try and address this but the problem actually remains.)

Fix

The problem stems from the use of the to_dyadic_intervals function, which is typically called with the stream's resolution. I don't believe that this should be the case. Instead we should resolve to dyadic intervals to a higher resolution, which is determined instead by the length of the interval (trimmed to a minimum value), and the use the following rule to determine the value:

If $I$ is a dyadic interval of with the stream's resolution $m$ and $J$ a dyadic interval of resolution $n > m$, then the signature of the stream over $J$ is equal to the signature over $I$ if $J$ contains the included end of $I$, and is the trivial signature otherwise.

This should give the behaviour as described in the documentation, whereby the signature over the granular dyadic intervals is concentrated at the included end, regardless of how the original data was distributed over the interval. This is illustrated as follows:

       x  x  x     x   x   x    x  x x
-------[---------------------------------)---------------------        Original data
           
       L
-------[---------------------------------)---------------------        RoughPy internals

Here the x's represent measurements that appear within the granular interval when we construct the stream, and L represents the aggregated log-signature over the whole interval which is used internally by RoughPy during calculations.

When a new query interval is provided, say $[a, b)$, we can use a resolution n large enough so that $2^{-n} < (b - a) / 4$, or perhaps smaller, to granularise the interval. This way, we can be sure that the end-points of the interval are not displaced more than 1/4 the length of the interval, which should go a long way to obscuring the problem above. (Of course, we cannot fully eliminate the undesirable behaviour because the end point will usually shift to the left during granularisation.)

A second solution, which I think should also be imposed, is a minimum stream resolution based on the length of the effective support, which should ensure that the resolution is generally large enough that the oddities arising from the granularisation are hidden within the noise.

Second problem

The second problem is somewhat related to the first, and concerns the problem of restricting a stream to a given interval. Currently, RoughPy has the ability to restrict a stream to a given interval $[a, b)$. After performing such an action, the expected behaviour is that , for a query interval $I$, the signature over $I$ is trivial if $I$ does not meet $[a, b)$ and equal to the signature of $I \cap [a, b)$ otherwise. However, with the way that granularisation occurs, it is possible, with effectively the same mechanism as above, for this contract to be violated. Unfortunately, the fix described above will not suffice here because it actually only mitigates the problem and does not solve it entirely.

Discuss!

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions