Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint granularisation for signature calculations #53

Open
inakleinbottle opened this issue Nov 29, 2023 · 4 comments
Open

Endpoint granularisation for signature calculations #53

inakleinbottle opened this issue Nov 29, 2023 · 4 comments
Assignees

Comments

@inakleinbottle
Copy link
Contributor

At the moment, there is some troublesome behaviour arising from the way the granularisation occurs when computing signatures. The algorithm needs to be very carefully thought through, so I've opened this issue to start a discussion about how this should work and how it should be implemented. This should also serve to document the decisions we make about this.

As described in the documentation, RoughPy internally granularises query intervals by moving the the endpoints the included end of the unique granular dyadic interval containing said endpoint. For instance, if the stream has resolution 3, so granular dyadic intervals have length 1/8 and we query the interval over the interval [0.3, 0.6), then the actual query will take place over the granular interval [0.25, 0.5). To see why this happens, we need the following picture (not quite to scale):

                                   0.3                                   0.6
------------------------------------[-------------------------------------)---------------------     original interval
[--------------)[--------------)[--------------)[--------------)[--------------)[--------------)      granular dyadic dissection
0            0.125            0.25           0.375            0.5            0.625            0.75

The lower bound of the query interval (0.3) lies in the granular dyadic interval [0.25, 0.375) so is rounded to the contained end-point 0.25. The upper bound of the query interval (0.6) lies in the granular dyadic interval [0.5, 0.625) so is rounded to the contained end-point 0.5. This rounding means that we can ensure the the signatures over adjacent intervals always concatenate to give the correct signature over the union of these intervals. However, it also comes with some problems.

First problem

The granularised interval will (for clopen intervals) have a lower bound that is less than the lower bound of the query interval, unless this lower bound happens to also be a dyadic number of smaller resolution (i.e. the denominator has a smaller power of 2). This is particularly problematic if the stream resolution is small, in which case the endpoint can be shifted dramatically and may then include data that was far outside the original query interval. This actually leads to a rather troubling scenario as pointed out by @lingyiyang. Consider the following situation (behaviour shown is as of version 0.0.8).

import roughpy as rp
ctx = rp.get_context(width=3, depth=2, coeffs=rp.DPReal)
stream = rp.LieIncrementStream.from_increments([[0., 1., 2.], [3., 4., 5.]], indices=[0., 1.], ctx=ctx)
# stream.resolution == 0, so granular dyadics are [0, 1), [1, 2) etc.)
print(stream.signature(rp.RealInterval(0, 0.5)))        # case 1
# { 1() }
print(stream.signature(rp.RealInterval(0, 1)))           # case 2
# { 1() 1(2) 2(3) 0.5(2,2) 1(2,3) 1(3,2) 2(3,3) }
print(stream.signature(rp.RealInterval(0.9, 1)))        # case 3
{ 1() 1(2) 2(3) 0.5(2,2) 1(2,3) 1(3,2) 2(3,3) }

Let me explain what is happening here. As above, the query interval has been granularised in each case as follows: case 1, the empty interval [0, 0); case 2, the interval [0, 1) (it is already a dyadic interval of suitable resolution); case 3, the interval [0, 1).
In case 1 both end points belong to the granular dyadic interval [0, 1), and so both end points are shifted to 0, leading to an empty query. Case 2 is as expected. Case 3 is probably the most surprising. Here the lower interval appears in [0, 1) while the upper limit appears in [1, 2), so the granularised end points are taken as 0 and 1, respectively. This is despite the interval [0.9, 1.0) not containing any data itself. In fact, this implementation (as of version 0.0.8) has the opposite effect than the desired, where the data of an interval is effectively concentrated at the excluded end, rather than the included end. (In commit f50a8a4 I have done some work to try and address this but the problem actually remains.)

Fix

The problem stems from the use of the to_dyadic_intervals function, which is typically called with the stream's resolution. I don't believe that this should be the case. Instead we should resolve to dyadic intervals to a higher resolution, which is determined instead by the length of the interval (trimmed to a minimum value), and the use the following rule to determine the value:

If $I$ is a dyadic interval of with the stream's resolution $m$ and $J$ a dyadic interval of resolution $n > m$, then the signature of the stream over $J$ is equal to the signature over $I$ if $J$ contains the included end of $I$, and is the trivial signature otherwise.

This should give the behaviour as described in the documentation, whereby the signature over the granular dyadic intervals is concentrated at the included end, regardless of how the original data was distributed over the interval. This is illustrated as follows:

       x  x  x     x   x   x    x  x x
-------[---------------------------------)---------------------        Original data
           
       L
-------[---------------------------------)---------------------        RoughPy internals

Here the x's represent measurements that appear within the granular interval when we construct the stream, and L represents the aggregated log-signature over the whole interval which is used internally by RoughPy during calculations.

When a new query interval is provided, say $[a, b)$, we can use a resolution n large enough so that $2^{-n} < (b - a) / 4$, or perhaps smaller, to granularise the interval. This way, we can be sure that the end-points of the interval are not displaced more than 1/4 the length of the interval, which should go a long way to obscuring the problem above. (Of course, we cannot fully eliminate the undesirable behaviour because the end point will usually shift to the left during granularisation.)

A second solution, which I think should also be imposed, is a minimum stream resolution based on the length of the effective support, which should ensure that the resolution is generally large enough that the oddities arising from the granularisation are hidden within the noise.

Second problem

The second problem is somewhat related to the first, and concerns the problem of restricting a stream to a given interval. Currently, RoughPy has the ability to restrict a stream to a given interval $[a, b)$. After performing such an action, the expected behaviour is that , for a query interval $I$, the signature over $I$ is trivial if $I$ does not meet $[a, b)$ and equal to the signature of $I \cap [a, b)$ otherwise. However, with the way that granularisation occurs, it is possible, with effectively the same mechanism as above, for this contract to be violated. Unfortunately, the fix described above will not suffice here because it actually only mitigates the problem and does not solve it entirely.

Discuss!

@terrylyons
Copy link

I agree with the first fix.

Resolution implies that all incremental data in the dyadic interval of that resolution is compressed to the included end of the interval of that resolution while correctly preserving order. We should call this the granulated path.

The signature of the stream over any interval is the signature of the granulated path over that interval.

That is consistent with your proposal. It also guarantees that signatures over long intervals are always exactly multiplicative across concatenations.

We do not round intervals!! We granulate data arrival times. Note, once done, one can coarsen granularity unambiguously but not conversely. Equally one cannot perfectly merge data as the lie increments do not commute. There should be left and and right merge.
I think that is clear.

I can see that clopen and opencl need to coexist. To switch between them, one needs to know the jump at the included end. Keep that separate.

@terrylyons
Copy link

There are consequences of these decisions for the behaviour of the way an interval is broken up into dyadic intervals. The dyadic interval nearest the contained point should be strictly included in the interval. If it doesn’t contain the included point, then it must be less than the resolution from it so that the next dyadic interval of resolution length has all its information in its included end which is outside the interval justifying that it does not contribute to the signature of the interval and should not be included.

I believe this is opposite to what we do at the moment!!

@inakleinbottle
Copy link
Contributor Author

So there are actually two conflated concepts here: granularising the data for the stream during construction, and granularising query intervals when computing a signature or log signature.

Granularising the stream data

The first concept to tackle is the way that we granularise the data when constructing the stream object itself, which is done to specified (or determined heuristically) resolution. This is where the diagram I posted in the initial post comes in:

       x  x  x     x   x   x    x  x x
-------[---------------------------------)---------------------        Original data
           
       L
-------[---------------------------------)---------------------        RoughPy internals

Here we collect all the data that occurs in the granular dyadic interval and aggregated it as a single Lie object that is concentrated (for the purposes of calculations) at the included end of the interval. At the moment, all intervals are clopen, but in the future we might want to allow both left aggregate and right aggregate values to accommodate for opencl intervals too.

At the moment, the resolution is set to zero unless the user specifies a value which is clearly not a good solution. As of commit f50a8a4 we've switched to using a heuristic based on the parameter difference between adjacent data points. Which is probably a better approach. We do need to be careful here, because the resolution that we use will have consequences for the second part below.

Granularising query intervals

The second concept is the way that we dissect a query interval into dyadic intervals on which we should actually do computations.
Currently this amounts to granularising the two end points along the same resolution as the stream itself, but this is clearly unsatisfactory. The solution I proposed for this is to dissect the query interval to a much greater accuracy (I use accuracy now rather than resolution to logically separate the two concepts).

The critical invariant that we must respect is the ability to partition an interval into two parts, and the dissections for "match up" correctly. This is so that the concatenation of signatures is correct when computed over two adjacent intervals and over the union of the two intervals.

@terrylyons
Copy link

Granularising intervals

I think this is probably preferable to granlarising data. It is actually very easy for dyadic intervals. If an interval is dyadic and of finer resolution than the desired granularisation it gets promoted to the dyadic interval of the granularities resolution if it shares included end. Otherwise it gets promoted to the empty interval ( running from and to the nonincluded end of the granular dyadic interval. This works well with general intervals as well.

Fixing a granularisation. Any interval I is replaced by the union of the granularised resolution intervals whose included end is in the interval I. ( typically aggregating adjacent intervals to make coarser dyadic intervals where possible).

In particular for clopen intervals, the granularisation goes from the first included resolution level dyadic end point until but not including the in luded end of the first resolution level dyadic interval to the right of the interval.

I think it is simpler to granularise intervals than data. However real intervals are better granularised to a list of adjacent dyadic of maximal coarseness compatible with the granularisation resolution. The end values can be converted back to a real interval where this is desired. Maybe such lists should be objects. Concatenation should be defined, so should splitting at a real number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants