Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map-reduce support #61

Closed
stas opened this issue Jan 15, 2013 · 44 comments
Closed

Map-reduce support #61

stas opened this issue Jan 15, 2013 · 44 comments
Labels

Comments

@stas
Copy link
Contributor

stas commented Jan 15, 2013

This is more like an open question.

I need a way to reduce the amount of repetitive activities so that if I get same keys for the same object, I could develop a strategy to map/merge those into a single entry.

Ideally I would patch public_activity to handle that for me. Anyone interested in getting something like this into the core or I should not even bother pitching this.

Thanks

@farnoy
Copy link
Contributor

farnoy commented Jan 15, 2013

Could you elaborate more on how would you use it? Since many activities of the same key will have different params/owners and such, how can you benefit from that?

@stas
Copy link
Contributor Author

stas commented Jan 15, 2013

Suppose we have the key status, which means for an activity entry that an actor changed the status. It's not relevant what was the old value, all that is relevant is that the actor did it 3 times already.

Same for, lets say Article, if an actor updates it.

@hadees
Copy link

hadees commented Jan 15, 2013

I have a similar issue, for example a User's post is liked 100 times by other users. Currently their Activity Feed is going to be spammed with 100 activities instead of just something like Johnny and 99 others liked your post

@stas
Copy link
Contributor Author

stas commented Jan 15, 2013

A simple approach is to update last activity if some conditions are met. And that would be the easiest way to handle it for basic usecases.

@farnoy
Copy link
Contributor

farnoy commented Jan 15, 2013

I think I need to study map-reduce more in depth to answer this question. I thought that it was a simple operation for grouping records by fields and getting count of each group. If it's really more complex than an ORM+DB problem, I will learn it shortly and post some meaningful comments here.

@farnoy
Copy link
Contributor

farnoy commented Jan 15, 2013

I think associating a post with an activity could solve the problem, as every new comment would set the unread status (or something similar) on the notification. Also a habtm relation with join table for read/unread for each user concerned in that post (author and those who commented, for example). I think there's a gem for that, but I can't recall the name.

@pokonski
Copy link
Member

That is one hard topic, on which I contemplated from the very beginning. We can either merge them on creation or in views. In one of the apps I developed using public_activity I did the grouping in JavaScript, client-side.

  1. You render all activities in a view.
  2. Then you group them by their key+trackable_id, and merge into one activity.

Looked and behaved properly, but requires fetching many activities (though it was still fast, as the activities were very simple to render). This is the closest I got to Facebook implementation, but on client-side (facebook does this somewhere between database and displaying).

@michelson
Copy link

Maybe we can use the params or a field with a parent_id activity to know how to group activities and then query only the parents activities, and on each parent activity make the subqueries, grouping , logic to display what you want.
The benefit of this approach is that you are going to display the number of activities you want , suppose that we want to display 100 activities, if you map / reduce those 100, in client or server, you may end with a different quantity of activities - grouping 100 likes end with one activity displayed -.
but quering "parent" activities you always display 100 activities with inner mapped/reduced/subqueried content.

now the question is how to set up some logic for parent / child creation , like:
time - activities created with x mins/secs/hours of distance will be child of x
repetition - if there is no other activity (different trackable & different owner) then it will be a child of x
custom logic in controller
...

makes any sense ?

@pokonski
Copy link
Member

My take was to on Activity creation find the most recent activity with the same key/trackable combination and "attach" the newly created activity to the one already existing.

Merging activities is trivial on MongoDB, thanks to its schema-less design. Everything relational requires additional columns, and probably associations.

Facebook example, when two people comment on my post Facebook merges those notifications into one:

  1. Mark comments...
  2. The activity says: "Mark commented on your post: 'thanks for sharing!'"
  3. Jenny comments, too.
  4. And now we have a merged activity "Jenny and Mark commented on your post: 'that's cool'"

So we need to merge owners into a list, which could be stored as a serialized array (but we lose querying possibilities in ActiveRecord), or as an additional association, and update :parameters with the ones from newly created activity.

This is what I did, but in views. So no merging was done in the database. This was far from perfect, as @michelson mentioned, because we have no way of knowing how many activities we are actually going to get.

@stas
Copy link
Contributor Author

stas commented Feb 15, 2013

I'm not a fan of any proposed approaches. What I would do is add a callback on before_save that does the merging for me. This operation will be incremental and does not require any schema modifications.

In ex. Mark triggers an activity on something related to a trackable with ID X and the event key is key1. When Jenny triggers another activity on the same trackable with ID X and the same event key1, the hook fetches previous keys (which will include Mark's activity) and since trackable and keys are the same, we can apply a merging algorithm (probably just update parameters with a new value, say, actors that include Mark and Jane.

This makes sense for trackable object but it's not relevant if you care about recipient. To preserve the recipient we gonna need schema changes.

@pokonski
Copy link
Member

@stas good thinking. I like the actors approach. To preserve recipient we could group them by trio of key, trackable and recipient. This way we only group activities directed at the same recipient.

@stas
Copy link
Contributor Author

stas commented Feb 15, 2013

Also true.

În data de Fri, 15 Feb 2013 13:47:01 +0200, Piotrek Okoński
notifications@github.com a scris:

@stas good thinking. I like the actors approach. To preserve
recipient we could group them by trio of key, trackable and recipient.
This way we only group activities directed at the same recipient.


Reply to this email directly or view it on GitHub:
#61 (comment)

@michelson
Copy link

im still think that add a parent_id association is the simplest solution,
because we are going to have all the activities (not merged) so we can
display it individually if we want.

but i agree that the storage of data in the serialized params fields is a
good idea

is obvious that the check of the keys have to be in a before_save , so we
could cache the data in the serialized params field of the parent, and also
build the child/parent assoc, that way we are going to have cached data but
also we are going to be able to query down the child activities if we want.

Because, the cached data could be an array of names ? or ids ? but what
happens if we need to display the user avatars or something else ? ...
well, we could change the params data structure when before_save (hard to
maintain), or query the array of ids stored in the serialized params one by
one (not optimal), or just query down the child activities. - many
approaches.

the main problem with ONLY cache serialized params is that we loose
querying abilities, so in the future if we need to show only the activities
of one users maybe we can´t. That´s why im a fan of a parent/child approach
in addition to cache serialized data in the parent object.

also I'll be happy to help in this task

regards

Atte.
Miguel Michelson Martinez
www.artenlinea.com

On Fri, Feb 15, 2013 at 9:10 AM, Стас Сушков notifications@github.comwrote:

Also true.

În data de Fri, 15 Feb 2013 13:47:01 +0200, Piotrek Okoński
notifications@github.com a scris:

@stas good thinking. I like the actors approach. To preserve
recipient we could group them by trio of key, trackable and recipient.
This way we only group activities directed at the same recipient.


Reply to this email directly or view it on GitHub:

#61 (comment)


Reply to this email directly or view it on GitHubhttps://github.com//issues/61#issuecomment-13603851.

@mewdriller
Copy link
Contributor

@stas, @pokonski: I took a shot at this in the app I'm using public_activity for and found that including the time in the aggregation key was relevant too. For example, if the following actions happen:

  • Jim creates a Post on 1/1/13 at 1:00pm
  • Bob comments on that Post on 1/2/13 at 1:00pm
  • Sally comments on that Post on 1/2/13 at 2:00pm
  • Mark comments on that Post on 1/2/13 at 3:00pm
  • Jill comments on that Post on 1/2/13 at 4:00pm
  • Susan comments on that Post on 1/3/13 at 1:00pm

If the user is viewing the list of notifications on 1/4/13, you might want to group the notifications like:

  • Jim created a new Post three days ago.
  • Bob and 3 others commented on a Post two days ago.
  • Susan commented on a Post yesterday.

What it seems like to me is that there should be a hook when creating activity that defines how to create the "aggregation key" which could then be used by the group_by method. By default, that aggregation key could be generated by combining: trackable, key, recipient, and created_at (with a time granularity of days).

@pokonski
Copy link
Member

@mewdriller yes, exactly. All that makes sense. Since it's a gem it needs to provide reasonable defaults with possibility of changing them, as every one needs something slightly different.

@rubish
Copy link

rubish commented Mar 5, 2013

+1

Just came across public activity today. we have implemented a custom notifications system, basically doing what public_activity does with a few additions. Public activity does it much more elegantly than our implementation.

We only create new aggregates for user when he becomes active after few hours of inactivity, otherwise we keep aggregating the activities into older latest activity. This way user has a clear separation of things which happened while he was away and things happening now. Custom proc can generate the aggregation key as required, other option can be to return activity to aggregate current activity into.

There are a few additional issues/features which we tackled and would make sense for public activity too. Have started a new thread #82 to discuss them further.

@rubish
Copy link

rubish commented Mar 6, 2013

One more important aspect to consider would be translation keys for aggregated activities. Different projects will have different requirements. Notifications can look like:

  1. John and Bob liked your Post about 3 hours ago.
  2. Jim created a new Post three days ago.
  3. Bob commented on a Post two days ago.
  4. John, Bob and 5 others commented on your Post 5 days ago.
  5. John, Bob and 1 other liked on your Post 5 days ago.

Fallback to i18n solely based on activity key might not be enough in such a scenario, with different number of actors visible upfront and pluralization. Developers should have some method to define the translation key and interpolation parameters for each activity.

@pokonski
Copy link
Member

pokonski commented Mar 6, 2013

All this comes down to joining actors, but keeping in mind that different languages have different syntax. English is the easiest one, I think. Implementation in p_a should be vague enough for people to implement this by themselves.

Rails already has i18n support for joining arrays in human-readable format: https://github.com/svenfuchs/rails-i18n/blob/master/rails/locale/en.yml#L188

@stas
Copy link
Contributor Author

stas commented Mar 6, 2013

Public Activity already offers flexibility to implement an i18n or a view-based rendering for entries, I'm using views, and it works great. Don't think this is a blocker.

@rubish
Copy link

rubish commented Mar 6, 2013

Yes, definitely not a blocker by any means.

@bryanrite
Copy link

👍 @mewdriller

@mieko
Copy link

mieko commented Nov 26, 2013

I've extracted the code I've been using for this purpose into a library at mieko/coalesce, which I'm trying to turn into a general-purpose "object aggregation library". Contributions expanding its scope to cover all of the use-cases explored here would most certainly be welcome. It'd be nice to finally nail down this functionality, in a way which doesn't invade PA, and would be useful for other purposes.

Coalesce works something like:

g = Coalesce::Grouper.new do
  rule :create_and_comment do
    same       :id, :owner
    attr_in    :key, 'ticket.create', 'ticket.comment'
    time_delta 5.minutes, from: :first

    combine :key,  with: :smart_key
    combine :text, with: :array
  end
end

activities = [
  Activity.new(id: 5, key: 'ticket.create', owner: 10, text: 'Data 1'),
  Activity.new(id: 5, key: 'ticket.comment', owner: 10, text: 'Data 2')
]

results = g.each(activities).to_a
# => [Activity(id: 5, key: 'ticket.comment_create', owner: 10, text: ['Data 1', 'Data 2'])]

The PA-specific niceties like the :smart_key combiner allow Coalesce to massage collections into PA's view structure. Coalesce currently only works in a stream-processing manner, so non-adjacent items aren't considered for grouping, I'd like to extend this to random-access, without ballooning queries or memory usage.

The API is still up in the air, and the functionality only implements enough to cover my use-cases. The gem is pre-release, without proper versioning and packaging yet. I'm working on documentation as I have time, so anyone looking to check it out must be willing to code-dive at the moment.

With a bit of help, I think this could be turned into a system powerful enough to reasonably express all the complexities of Facebook-style aggregation, while staying maintainable. Regardless, it may be a good starting point for anyone looking for functionality like this.

@seifsallam
Copy link

@mieko That's awesome. Thanks 👍

@farnoy
Copy link
Contributor

farnoy commented Nov 26, 2013

But does your library reduce on the server side? If not, the performance will be penalised when done in Ruby.

@mieko
Copy link

mieko commented Nov 26, 2013

@farnoy Coalesce works at the Ruby object level.

I guess performance is a matter of perspective. My target is mobile and widely-compatible mobile, (think Android 2.3 on spotty 3G). When we moved to server-side aggregation, with standard Rails caching techniques, both bandwidth requirements and page-load responsiveness improved.

As far as our users are concerned, it was a definite performance win. Our servers take a little performance ding on a cache miss.

Where an application lives on the SPA-vs-traditional web app gradient will determine if this approach is appropriate.

@pokonski
Copy link
Member

@mieko nice idea! Looking forward to hearing more :)

@farnoy
Copy link
Contributor

farnoy commented Nov 26, 2013

While caching can excuse any bottlenecks temporarily, this is not a true solution for map reduce.

@mieko
Copy link

mieko commented Nov 26, 2013

While caching can excuse any bottlenecks temporarily,

Caching is actually the cornerstone performance strategy of large web deployments, but I'm digressing.

this is not a true solution for map reduce.

While the issue title contains the phrase "map-reduce", the problems described are more generalized aggregation. One forcing MapReduce (the paper) strategy to get the results discussed here, would probably be disappointed in how little is gained from not necessarily independent data. A lot of rules we use are outright sequential and heavily inter-dependent for the sake of "humanized" grouping.

Regardless, Coalesce is neither Hadoop nor a Javascript library, depending on what is expected of a true solution. Deep ORM query generation is even out of its scope. It's a still-incubating library for aggregating Ruby objects which meet certain conditions in prescribed ways.

@pokonski
Copy link
Member

@farnoy disagree. Caching is an absolute necessity in web-dev. No matter how brilliantly awesome your code is. Serving static content will beat it every time.

Like @mieko pointed out, there is no single serve-all-purposes solution, and definitely not one I can include in p_a.

I did aggregation with p_a in JS (client-side), and server-side for different projects, depending on the use-case. Doing it in the database would bind p_a implementation very closely to the postgres/mysql/oracle or some other crappy database (MongoDB).

@farnoy
Copy link
Contributor

farnoy commented Nov 26, 2013

We agreed that we can't ship a solution for everyone in p_a months ago on the mailing list.

@mieko, people doing large scale webdev will have serious problems as soon as they use your gem for aggregation.

I'm not saying MapReduce specifically. The use case you've described as an example would be infinitely better performing if developers were to store data the way your results look like.

# => [Activity(id: 5, key: 'ticket.comment_create', owner: 10, text: ['Data 1', 'Data 2'])]

text could be a parameter. text could be it's own column if it would have to be queried.

The algorithm for this bundling would be really simple using the find_by method.

This does introduce another query when inserting, but it won't crash your servers when you purge caches.

By the way, I think one PostgreSQL query can replace your gem. Using array_agg(activities.parameters -> text) with proper GROUP BY clauses should return one row with all parameters[:text] concatenated into one array.

Refs:

@mieko
Copy link

mieko commented Nov 26, 2013

I've updated my in-database format to resemble my views, and replaced the gem with the proposed query. My servers have stopped crashing. I'll leave Coalesce (now in maintenance-only mode) up as a warning to others intending to scale.

@farnoy
Copy link
Contributor

farnoy commented Nov 26, 2013

This requires a proper migration, which we'll document in the wiki, but for testing purposes, when parameters is still a serialized string, array_agg(activities.parameters::hstore -> text) should work too (not optimal!)

@xguox
Copy link

xguox commented Jan 5, 2014

I am now doing remove_duplicate like this in the controller

  def remove_duplicate
    group = Activity.all.group_by{|model| [model.trackable_id, model.trackable_type, model.owner_id, model.owner_type, model.key, model.recipient_id,model.recipient_type] }
    group.values.each do |duplicates|
      newest_one = duplicates.pop
      duplicates.each{|duplicate| duplicate.destroy}
    end
  end

How's it? and

I attempt to do this in the model PublicActivity::Activity by using callback

but fail

@farnoy
Copy link
Contributor

farnoy commented Jan 5, 2014

The algorithm is really bad because you process all activities every time
even when you create only one. There is too much redundant computing and it
does not scale at all. What you could do is implement either before or
after save callback and search for that one other activity (should always
be at most one duplicate when the hook strategy is employed) that has the
same parameters and destroy/merge them.
On 5 Jan 2014 03:41, "XGUOX" notifications@github.com wrote:

I am now doing remove_duplicate like this in the controller

def remove_duplicate
group = Activity.all.group_by{|model| [model.trackable_id, model.trackable_type, model.owner_id, model.owner_type, model.key, model.recipient_id,model.recipient_type] }
group.values.each do |duplicates|
newest_one = duplicates.pop
duplicates.each{|duplicate| duplicate.destroy}
end
end

How's it? and

I attempt to do this in the model PublicActivity::Activity by using
callback

but fail


Reply to this email directly or view it on GitHubhttps://github.com//issues/61#issuecomment-31594733
.

@farnoy
Copy link
Contributor

farnoy commented Jan 6, 2014

I successfully implemented a sample application where there are two models + activities.

One is called Author and stores only a name, the second one is called Update and has a string content and belongs to some author.

I then implemented simple functionality to share these updates (facebook style) by authors (fully done on activities, just a simple concept).

Finally, I've written a query that finds all activities with key update.shared, groups them by trackable, and then groups them by the read parameter. From there, I can select array_agg(activities.owner_id) AS owner_ids and use it to make a simple map/reduce structure:

[{["Test", "false"]=>[3, 1, 2]}, {["Test", "true"]=>[1]}]

The structure looks like this: {[update_content, read_status] => [owner_ids..]}.

In this simple form, it does not achieve anything substantial, but I think it can solve most of the aggregation problems that our users encounter.

The short version is:

@activities = PublicActivity::Activity.where(key: 'update.shared')
.group(:trackable_id, :trackable_type)
.group("activities.parameters -> 'read'")
.select("activities.parameters -> 'read' AS read,
  activities.trackable_id, activities.trackable_type, 
  array_agg(activities.owner_id) AS owner_ids")

# this produces above mentioned structure
@activities.map {|o| {[o.trackable.content, o.read] => o.owner_ids} }

My solution uses PostgreSQL and requires one change in p_a codebase plus a migration (to store parameters in native HStore).

I'd like to hear what do you think about this solution and if it makes sense for your use cases.

Ideally, we'd like a RDBMS agnostic solution, which probably could be achieved (with the exception of referencing into parameters which only Postgres has I think).

@stas

@stas
Copy link
Contributor Author

stas commented Jan 7, 2014

@farnoy I was expecting a more DB schema oriented approach (as I mentioned, I would see the reduce process happening at the write level rather when activities are queried/read). Reading above, It is not clear how activity generation is handling the entries which look same. Maybe I'm missing something, could you explain more?

Thanks.

P.S.: By the time I opened this ticket, I was working on @Courseware which is now open source. So if you are looking for a project where to benchmark/test things, feel free to take a look at it (It has a dozen models with activities).

@farnoy
Copy link
Contributor

farnoy commented Jan 7, 2014

The way I understand map reduce is that mapping and reducing happens when
querying, the data is not reduced when inserting because that would not be
appropriate for every select query later on. This is what happens in my
example where similar activities are stored in the database and the reduce
step happens during queries.
On 7 Jan 2014 10:57, "Stas Sușcov" notifications@github.com wrote:

@farnoy https://github.com/farnoy I was expecting a more DB schema
oriented approach (as I mentioned, I would see the reduce process happening
at the write level rather when activities are queried/read). Reading above,
It is not clear how activity generation is handling the entries which look
same. Maybe I'm missing something, could you explain more?

Thanks.

P.S.: By the time I opened this ticket, I was working on @coursewarehttps://github.com/coursewarewhich is now open source. So if you are looking for a project where to
benchmark/test things, feel free to take a look at it (It has a dozen
models with activities).


Reply to this email directly or view it on GitHubhttps://github.com//issues/61#issuecomment-31725133
.

@pokonski
Copy link
Member

pokonski commented Aug 5, 2014

Closing this for now as we have no specific way of achieving this on multiple ORMs.

@thomasgallagher
Copy link

Is there somewhere else we can discuss this as it's certainly logical enchancement to p_a - even if it doesn't necessarily make sense for it to be a part of the gem itself.

@farnoy - perhaps you could share your sample Authors app please?

I'm also struggling to get to grips with a good way of doing this.

@farnoy
Copy link
Contributor

farnoy commented Sep 14, 2015

If you make your parameters a hstore in postgres (and you should be able to do that easily) you can use aggregate functions for this. Probably grouping on the [trackable_type, trackable_id, key] triple and using something like array_agg(parameters -> 'some_thing') or count(...).

I don't have that app anymore. It should be easy in general, just limited because of SQL.

@thomasgallagher
Copy link

Great, thanks for that starting point.

I'm going to have a go at it and report back here.

I know it's not all inclusive from an ORM perspective but hopefully it will help other Postgres users - for others there is the JS solution above.

@oleander
Copy link

oleander commented Jan 5, 2016

I solved the problem for postgresql like this. It groups events that happens in sequence and counts the number of occurrences.

SELECT * , row_number - coalesce(lag(row_number) OVER(), 0) as count
FROM(
  SELECT 
    key, 
    created_at, 
    row_number() OVER(ORDER BY created_at), 
    lead(key) OVER(ORDER BY created_at) 
  FROM activities
  WHERE owner_id = <user-id>
) t
WHERE key IS DISTINCT 
FROM lead
ORDER BY created_at DESC;

Here's the output.

key           |         created_at         | row_number |          lead           | count
-------------------------+----------------------------+------------+-------------------------+-------
answer.create           | 2016-01-04 23:54:45.55378  |          9 |                         |     2
taken_exam.update       | 2016-01-04 23:54:35.966606 |          7 | answer.create           |     2
taken_exam.create       | 2016-01-04 23:46:11.351729 |          5 | taken_exam.update       |     1
challenge_answer.create | 2016-01-04 23:45:50.546626 |          4 | taken_exam.create       |     1
quick_answer.create     | 2016-01-04 23:36:21.15633  |          3 | challenge_answer.create |     3

Here taken_exam.update has occurred 2 times and happened after answer.create (as shown in the lead column). I've some corresponding Ruby code if someone is interested. I just have to clean it up first.

@thomasgallagher
Copy link

Hey @oleander - definitely interested :)

@cantonyselim
Copy link

@oleander I too am interested! Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests