I'm not 100% sure how this should work. But basically I think we need something that'll let you collapse similar activities into one. For example if one of my activities is someone liked my post it isn't a problem but what happens if 100 people liked my post. Then my stream is going to be spammed and it won't be very useful. It would be better if I got John Doe and 99 others liked your post.
John Doe and 99 others liked your post
I was thinking this might be a good problem for map reduce although that would probably require Streama to pull activities from a separate collection and run the map reduce every so often to update it.
Map reduce could work, but then it wouldn't be realtime. Maybe creating a new "Feed" or "Stream" collection that is unique to each user is the way to go. Whenever an activity is posted each receivers feed is updated accordingly. It'll look for duplicates and also solve issue #20 and issue #3. The only thing is it'll probably be best to process the feed generation in a background job, I don't want to add a dependancy for that.
I'm kinda leaning towards an implementation similar to joe1chens fork however instead of just duplicating activities for each user, there will be some sort of processing step before hand. The feed collection would have a preprocessed copy of each users activity stream which will include info specific to the user as well as cached output for the feed.
I think that is the way to go. If you fan out on write but do some sort of filtering and consolidation it'll solve some of the issues I saw you had with @joe1chen's fork because in theory the data should be reduced. I don't even think you need a separate collection for each user just a collection for all incoming activities and a collection that contains what the user will actually see.
I'd be happy to help with something like that since it's in line with what i'm doing. It'll let me add a viewed flag to the activities.
Sounds good... It would be great to collaborate with you on it. I'm a little busy over the next 2 months with work/travel but if you have started on this let me know and i'll try make some time to look over your changes.
It seems to me like each document representing an activity is so small that at the point where it would even start approaching a decent amount of MBs or GBs operations work to scale the DB is probably way overdue at any rate. At that point the simplicity of @joe1chen's sharding approach might make the scaling work much easier, but I am not great at ops so I don't know.
The idea is that activity feeds determine relevance based upon time, and more recently created activities are deemed more relevant. Therefore, activity feed entries could be sharded by actor as well as a second archival field. A background job can then mark old documents for archival. Archived documents can be put onto a slow shard with lots of TBs in case you want to query for those documents. Then in the off-chance the user is interested in the less-relevant old activities you can still present them without unbounded growth of the more relevant activity data-set.
I don't think there is a problem with @joe1chen's schema but since you are already creating the display version for the end user might as well also aggregate it so you end up with what is essentially the end product you show to the user.
I knew one of @christospappas's issues had to do with duplicate data getting out of hand which I think aggregation also solves.
Ultimately though for my uses, which was more of a notification feed, I ended up just coding my own custom solution. Eventually I may circle back and add in streama for a more traditional activity stream but I don't know when that'll be.