-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series and Frames for real-time streaming data #51
Comments
My guess is that the current implementation would not be suitable - but please feel free to try it! As you say, copying everything doesn't make sense for this application. Ideally, you would want changes to propagate down the chain of operations and do calculations incrementally, which would require different abstractions. We're working on some real-time stuff at BlueMountain and going about it quite a different way. |
Is it possible to consider open sourcing the real time stuff you guys have done internally? |
The implementation is too coupled to our internal infrastructure to allow us to do that, unfortunately. |
Done. |
Here it is: https://github.com/Spreads/Spreads |
WOW. Good stuff. Also can you make the license more more permissive license. |
What would be the right way to use Series in a real-time environment where new data arrive asynchronously?
I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data
The answers on SO suggest using
FSharpx.Collections.Vector<T>
data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation ofVector
asArrayVector
in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.
Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.
Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?
Will an implementation of Deedle's
IVector
withFSharpx.Collections.Vector
be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)I would love to have Deedle's abstraction and API for such use case!
P.S. An abstraction of the workflow: if
seriesB = f seriesA
, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applyingf
function only for incremental data). For this we would need some projection object that would keepseriesB
always synchronized withseriesA
using the transformation functionf
. In turn, there could be someseriesC = f2 seriesB
on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.The text was updated successfully, but these errors were encountered: