New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series and Frames for real-time streaming data #51

Closed
buybackoff opened this Issue Oct 27, 2013 · 6 comments

Comments

Projects
None yet
4 participants
@buybackoff
Contributor

buybackoff commented Oct 27, 2013

What would be the right way to use Series in a real-time environment where new data arrive asynchronously?

I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

The answers on SO suggest using FSharpx.Collections.Vector<T> data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector as ArrayVector in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.

For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.

Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.

  1. For a new price point we create a new series object by appending the old object (in the case of a very large data set copying array is slow)
  2. Then we take last 60 values from the new series object and calculate new MA value (crucial point is to avoid recalculation for all MA values, but take only the last 60-point window from A)
  3. We append new MA value to the MA_A_60.

Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?

Will an implementation of Deedle's IVector with FSharpx.Collections.Vector be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)

I would love to have Deedle's abstraction and API for such use case!

P.S. An abstraction of the workflow: if seriesB = f seriesA, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f function only for incremental data). For this we would need some projection object that would keep seriesB always synchronized with seriesA using the transformation function f. In turn, there could be some seriesC = f2 seriesB on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.

@hmansell

This comment has been minimized.

Contributor

hmansell commented Nov 2, 2013

My guess is that the current implementation would not be suitable - but please feel free to try it!

As you say, copying everything doesn't make sense for this application. Ideally, you would want changes to propagate down the chain of operations and do calculations incrementally, which would require different abstractions.

We're working on some real-time stuff at BlueMountain and going about it quite a different way.

@sirinath

This comment has been minimized.

sirinath commented Nov 14, 2014

Is it possible to consider open sourcing the real time stuff you guys have done internally?

@hmansell

This comment has been minimized.

Contributor

hmansell commented Nov 14, 2014

The implementation is too coupled to our internal infrastructure to allow us to do that, unfortunately.

@buybackoff

This comment has been minimized.

Contributor

buybackoff commented Dec 18, 2015

Done.

@buybackoff buybackoff closed this Dec 18, 2015

@buybackoff

This comment has been minimized.

Contributor

buybackoff commented Dec 20, 2015

@sirinath

This comment has been minimized.

sirinath commented Dec 21, 2015

WOW. Good stuff.

Also can you make the license more more permissive license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment