Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EZP-20461: Finishing the implementation using Signal/Slot by enabling all Service integration tests #436

Closed
wants to merge 23 commits into from

Conversation

patrickallaert
Copy link
Contributor

@ezrobot
Copy link
Contributor

ezrobot commented Jun 25, 2013

This Pull Request does not respect our Coding Standards, please, see the report below:

FILE: ...Z/Publish/Core/Persistence/Legacy/Tests/Content/LocationHandlerTest.php
--------------------------------------------------------------------------------
FOUND 1 ERROR(S) AFFECTING 1 LINE(S)
--------------------------------------------------------------------------------
 109 | ERROR | Space found before comma in function call
--------------------------------------------------------------------------------

@patrickallaert
Copy link
Contributor Author

Unused method hasTransactionStarted() removed (thanks @andrerom) and CS issue fixed (thanks @ezrobot).

@andrerom
Copy link
Contributor

Looks really good,
Only thing I could find was missing cache clearing on places where locations are manipulated else-where, contentHandler->create() comes to mind.

*/
public function deleteLocation( $locationId )
{
throw new \Exception( "Not implemented yet." );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a proxy function this should probably be implemented as it's so low hanging.

@patrickallaert
Copy link
Contributor Author

@andrerom All your remarks have been taken into account.

@andrerom
Copy link
Contributor

mm, can't see any cache clearing in Cache/ContentHandler->create, as the struct can contain locations it looks to create locations that should affect the subtree cache id's?

@patrickallaert
Copy link
Contributor Author

oops, indeed, forgot cache clearing...

@patrickallaert
Copy link
Contributor Author

@andrerom:
Calls ot

$this->cache->clear( 'location', 'subtree' );

have been added to:

  • eZ\Publish\Core\Persistence\Cache\ContentHandler::deleteContent()
  • eZ\Publish\Core\Persistence\Cache\ContentHandler::publish()

in commit: 427b32add2b3c05f53d5db2b2e87ef2861bf9f15.

I'm wondering if it needs to be added to the following ones as well:

  • eZ\Publish\Core\Persistence\Cache\ContentHandler::create()
  • eZ\Publish\Core\Persistence\Cache\ContentHandler::copy()
  • eZ\Publish\Core\Persistence\Cache\ContentHandler::deleteVersion()
  • eZ\Publish\Core\Persistence\Cache\UserHandler::create()
  • eZ\Publish\Core\Persistence\Cache\UserHandler::delete()

But I guess it's not needed as it doesn't perform changes in terms of Location, right?

@andrerom
Copy link
Contributor

  • ContentHandler::create(): Not needed unless you can create as published directly
  • ContentHandler::copy(): Probably needed
  • ContentHandler::deleteVersion(): Needed if this deletes the content on last version, or deletes the published version
  • UserHandler::create(): probably needed
  • UserHandler::delete(): probably needed

@patrickallaert
Copy link
Contributor Author

ContentHandler::create(): Not needed unless you can create as published directly

Doesn't publish, so I would say "not needed as well"

ContentHandler::copy(): Probably needed

AFAIK, it creates a copy of the content, but does not publish it as well, so as for ContentHandler::create() I would say "not needed"

ContentHandler::deleteVersion(): Needed if this deletes the content on last version, or deletes the published version

cache clearing added.

UserHandler::create(): probably needed
UserHandler::delete(): probably needed

Apparently not needed: it only takes care of creating/removing data specific to the user ("ezuser") and nothing at the Content level.

@andrerom
Copy link
Contributor

andrerom commented Jul 4, 2013

Apparently not needed: it only takes care of creating/removing data specific to the user ("ezuser") and nothing at the Content level.

Ah yes, was thinking userService..

+1

review ping @pspanja / @bdunogier

@pspanja
Copy link
Contributor

pspanja commented Jul 5, 2013

I do not see the reason for TextLine to use MultipleStringField, instead StringField should be enough.

Aside from that question, this looks really good and has my +1.

@patrickallaert
Copy link
Contributor Author

I do not see the reason for TextLine to use MultipleStringField, instead StringField should be enough.

Because a field may exist in multiple languages for a same Content / Version.
For example, in test eZ\Publish\API\Repository\Tests::testRemoveFieldDefinitionRemovesFieldFromContent, the blog-post/title field contains: "My marvellous blog post" in eng-GB and "My awesome blog post" in eng-US and will generate the following index creation request:

<add>
  <doc>
    <field name="id">60</field>
    <field name="type_id">36</field>
    <field name="version_id">1</field>
    <field name="status_id">1</field>
    <field name="language_code_ms">eng-GB</field>
    <field name="language_code_ms">eng-US</field>
    <!-- ... 8< ... -->
    <field name="blog-post/title/value_s">My marvellous blog post</field>
    <field name="blog-post/body/value_s">Body is not done yet but it is going to be jolly good...</field>
    <field name="blog-post/title/value_s">My awesome blog post</field>
    <field name="blog-post/body/value_s">Body is not done yet but it is going to be awesome...</field>
  </doc>
</add>

@pspanja
Copy link
Contributor

pspanja commented Jul 8, 2013

Ok this clears it up, +1 it is.

@kore
Copy link
Contributor

kore commented Jul 10, 2013

Value Objects in Signals

The change introduces passing value objects with signals. We explicitly focused on just passing scalar types with signals because of potential asynchronous signal processing.

If we want to implement asynchronous processing of signals, those signals must be stored in some queue together with their data. This ought to be trivial as long as we stay with scalar types and IDs referencing the objects. When passing around full value objects, two problems occur:

  1. (De-)Serializing complex data structures

    A full serializing and de-serializing step must be introduced for all value objects passed around together with signals. This does not exist yet, and might be hard to implement. Especially if we want to do this in a backend agnostic way (not using PHPs serialize()).

    It might be possible to re-use the serialization logic from the REST interface, but the use case is different.

  2. Outdated aggregated value objects

    A User value object, for example, aggregates a ContentInfo value object, which then may aggregate other value objects. If a signal is processed asynchronous, some of the aggregated value objects might be out of date. This could lead to inconsistencies. If we only pass the ID around we might not retrieve the exact object from the storage which emitted the signal. But in this case it would be obvious that we will retrieve the most current version and the signal processor will be aware of that.

    If we want to retrieve all updates / changes on the data, we should explicitly design something like Event Sourcing for our persistence backend with proper revisions etc. But this is a different application approach and requires additional thoughtwork.

I suggest to revert back to passing around IDs identifying the value objects and reloading them back from the storage. This might seem like overhead, but with caching in the persistence layer it should not hurt too much. From my point of view the potential problems with value objects embedded in signals outweigh the possible performance drawback.

If we decide to pass value objects around, we should do so in all places consistently and clearly document the potential issues. We could also convert the value objects to IDs in the asynchronous processing handler, but then the APIs would differ for asynchronous processed signals and synchronous processed signals.

Content\Search\Handler::deleteLocation()

What is the reason to implement this in the search backend?

I would think that the logic for resolving a location / subtree to the affected content objects could happen in the Public API implementation. Implementing logic like this in every persistence layer will introduce bugs, and makes the implementation of persistence layers far more complex. (Something similar already makes implementing an UrlAlias-Hander far too complex.)

Even if this reduces the potential runtime optimizations, I would consider it better to move the involved logic to the business layer and just use the deleteContent() method for cache purging operations. It might be beneficial to allow to clear the cache for multiple content objects at once, if this helps.

Cache Purging Logic

I did not review the cache purging logic in Persistence\Cache\*Handler in depth. As mentioned in the original cache document I would prefer a native logic, which allows to automatically verify which caches require purging and which caches can be left intact. Doing this "manually" might ignore hidden side-effects.

For example: I do not see that cached ContentInfo of (reverse) related content objects is cleared properly. And there are several more issues with aggregated value objects, which might cause inconsistencies, when applying manual cache purging. I might have missed the correct implementation during review, though. It might also be a non-issue depending on other constraints.

Naming

Please rename "ez_mid" and "ez_mstring" to "ez_multi_id" and "ez_multi_string" in eZ/Publish/Core/Persistence/Solr/Content/Search/FieldNameGenerator.php and the respective search field types. Maybe even "ez_id_array" / "ez_string_array" would even be more readable.

Those identifiers are supposed to be readable by implementors of the Persistence API and they should provide a basic insight of their meaning. Abbreviations should always be avoided. The internal identifier in Solr is something different, since it not part of the external API. "ms" is fine there.

Method Complexity

Solr\Content\Search\Handler::mapContent()

This method should be extracted into an own mapper class. The logic is getting to complex and the method body to long to sanely maintain it in the future. Also making the Mapper changeable / over-writable could be a sensible extension point.

@patrickallaert
Copy link
Contributor Author

Value Objects in Signals

I suggest to revert back to passing around IDs identifying the value objects and reloading them back from the storage. This might seem like overhead, but with caching in the persistence layer it should not hurt too much. From my point of view the potential problems with value objects embedded in signals outweigh the possible performance drawback.

It has not been made for performance reasons, even if this is one of the benefit.
The problem with IDs is that we can't always reconstruct the information back. In the case of Delete*Signal, how can a remoteId, a parentId, a pathString or... be retrieved if the information has been removed (including from the cache). One possibility would be to integrate as much information as possible in the form of scalars, but that would mimic the job of a serializer, and we never know in advance what the information the slots will need, e.g: the parent location in the case of Solr.

If we want to implement asynchronous processing of signals[...]

I agree with your concerns, but to me, this is highly hypothetical and nothing prevents creating asynchronous signals in the future starting from a synchronous one and take at that moment the design decision of what need to be stored in a queue and how.
I would therefore rather go with the current approach that will also play more nicely in terms of performance since in lot of cases: full objects are required and the cache will often not contain the info because we have invalidated them since we are dealing with changes regarding the concerned object.

If we decide to pass value objects around, we should do so in all places consistently and clearly document the potential issues.

I very much agree with that. In fact, I planned to create a separate refactoring issue to consistently pass the Value Objects in Signals. But to make this PR lighter to review and progress, I decided it should rather be treated aside.

Content\Search\Handler::deleteLocation()

What is the reason to implement this in the search backend?

At a first glance: it doesn't look difficult to implement in current and future backends while a bit more difficult in the Public API implementation. This is also a very common operation that should be run as fast as possible. The impact of putting the O(n) loop higher in the hierarchy would make it close to impossible to have a final O(1) operation in the backend.

Cache Purging Logic

+1, I fully agree with you. But this PR shouldn't change the current cache purging logic as it is out of the scope of the issue it addresses. This should be refactored aside I would say.

Naming

Please rename "ez_mid" and "ez_mstring" to "ez_multi_id" and "ez_multi_string" [..]

Agreed, that will indeed be much more readable!

Method Complexity

Solr\Content\Search\Handler::mapContent()

I somewhat agree: externalizing it would decrease the class complexity at the price of increasing the overall complexity but would indeed allow an extension point.
This method isn't changing much of the existing complexity and extension points should ideally be documented.
For those reasons, I would rather handle this separately with an additional feature request.

@gggeek
Copy link
Contributor

gggeek commented Jul 10, 2013

I agree with passing around full objects, as it caters to the more general usecase

eg: typical signal emitted by legacy kernel on node/view is node_id, which forces user to fetch node again to make any use of the signal

also having coded ezcontentstaging, I learned a lot about serializing as much as possible of the current-state info being a good thing.
In fact in all events used there (which we can consider precursors to signals), i serialize all data to avoid fetching it again when later events are treated. Case in point: with async processing, I can make sense of a content-move info from a signal emitted a while ago, even if object has been removed in the meantime.There is only 1 event which breaks this principle (content/edit), and it is because I can not serialize properly an ez4 object without risking recursion.

Of course users can get id of value from serialized data and refetch it if what they want is the opposite.
And we should put a BIG SIGN telling them so.
Case in pouint: some of the standard workflow events in ez will crash and burn because they assume the object/node they received the id of is still existing in the db when event is executed. But if a wait-until event is added in the workflow, this guarantee breaks.

@gggeek
Copy link
Contributor

gggeek commented Jul 10, 2013

Side note: can we please make content-staging become the prime usecase for signalslots and rest-api?
I think if we can implement that, all other usecases are a breeze ;-)

@kore
Copy link
Contributor

kore commented Jul 11, 2013

@gggeek:

In fact in all events used there (which we can consider precursors to signals), i serialize all data to avoid fetching it again when later events are treated. Case in point: with async processing, I can make sense of a content-move info from a signal emitted a while ago, even if object has been removed in the meantime.

This will not work, as soon as you have multiple frontend nodes, which fire events. We then need a global, sortable revision attached to the changes and must ensure the events are idempotent. This is not yet the case. Currently the events are very likely to be processed in the correct order, but this is not ensured by anything and you will get inconsistencies, when using it that way.

This is exactly one reason, why I am against the full value objects embedded in the events – it makes you assume, you could implement something like this, while this won't work in environments, consisting of more then one node. Eventual consistency in a multi-node environment requires a little more effort – otherwise you will "replicate" inconsistent data, which will be almost impossible to fix later.

We sure can work out a way for content distribution to work, even on top of the existing events, but this is more then just fetching the data and serializing it, if want to make it resistant against common problems in network architectures.

@kore
Copy link
Contributor

kore commented Jul 11, 2013

@patrickallaert

Value Objects in Signals

It has not been made for performance reasons, even if this is one of the benefit.
The problem with IDs is that we can't always reconstruct the information back. In the case of Delete*Signal, how can a remoteId, a parentId, a pathString or... be retrieved if the information has been removed (including from the cache). [...]

We were aware, that there might be information missing, since we just added the default values, as we needed them. I can see that the delete signal makes it hard to fetch additional information later ;-) … I would still suggest to just add this missing information to the event structs – but keep it scalar values.

If we want to implement asynchronous processing of signals[...]

I agree with your concerns, but to me, this is highly hypothetical and nothing prevents creating asynchronous signals in the future starting from a synchronous one and take at that moment the design decision of what need to be stored in a queue and how.
I would therefore rather go with the current approach that will also play more nicely in terms of performance since in lot of cases: full objects are required and the cache will often not contain the info because we have invalidated them since we are dealing with changes regarding the concerned object.

We need this "asynchronous" processing as soon as we have multiple frontend nodes, I guess. Since the events are a public API, we won't be able to change them later again, but we would require two different event APIs (one strictly only for localhost, and one for multi-node-environments). And since network architecture is not a trivial issue, and a topic often misunderstood, I guess the APIs might commonly get misused. From an architectural point of view I would still prefer the "safe" way and just use simple scalars.

Additional note: As far as I know: The multi-node environments are a strategic goal, so we should optimize the architecture for that, and not for the much simpler single-node use-cases.

Content\Search\Handler::deleteLocation()

At a first glance: it doesn't look difficult to implement in current and future backends while a bit more difficult in the Public API implementation. This is also a very common operation that should be run as fast as possible. The impact of putting the O(n) loop higher in the hierarchy would make it close to impossible to have a final O(1) operation in the backend.

It is just that I would prefer to remove basically all logic from the persistence backend and keep the API as slim as possible, with the only the essential storage methods. That is the main point of this layer, after all. Abstraction, of course, always removes potential for optimizations.

In the end it is a matter of taste.

@gggeek
Copy link
Contributor

gggeek commented Jul 11, 2013

@kore I see your point, but i'm not 100% sure about it.

(sorry for going a bit offtopic and detailed)

  • ez4 workflow events are managed via the db, so they are naturally sorted in a queue (general case)
  • the contentstaging system is supposed to work even when content is edited on multiple frontends - even though the more common use case is all editors connecting to the same editorial node
  • it stores sync events in its own table in the db - which again insures sorting
  • proper usage of transactions makes us confident that sync-events order is same as content-changes order
  • even when editors connect to one node, they actions are async in respect to one another
  • which, following your line of thinking, means could we could get inconsistencies (if I got it right)
  • but this is averted by the events being fired on post-action triggers. Eg: editor A deletes node X, then editor B (tries to) moves it immediately afterwards (as he was looking at page of node x without reloading) => B will get an error msg in the GUI, and the sync event will not be triggered
  • so far the only problems which are observed are when sync events are not transmitted in proper order (of course sending an edit X after delete X will not work)

Or are you thinking about architectures where there is no centralized db/nosqldb/whatever-content-storage?
Note that the current staging architecture is designed for master-slave scenarios, not for master-master ones (it might work on limited but common cases in cross-replication mode, where editors use node X and users use nodes Y-Z-W but can only edit separate content areas)

As for the "idempotent" part: replaying a "delete node" or "move node" can of course never be idempotent.

@kore
Copy link
Contributor

kore commented Jul 11, 2013

@gggeek: Thanks for the input. This is definitely something we should keep in mind. But I will take the discussion on this out of this thread. I will write down a rough draft how we can make this work, mentioning the involved problems. I will CC you, once I wrote that down.

@patrickallaert
Copy link
Contributor Author

@kore

Value Objects in Signals

If we want to implement asynchronous processing of signals[...]

I agree with your concerns, but to me, this is highly hypothetical and nothing prevents creating asynchronous signals in the future starting from a synchronous one and take at that moment the design decision of what need to be stored in a queue and how.
I would therefore rather go with the current approach that will also play more nicely in terms of performance since in lot of cases: full objects are required and the cache will often not contain the info because we have invalidated them since we are dealing with changes regarding the concerned object.

We need this "asynchronous" processing as soon as we have multiple frontend nodes, I guess. Since the events are a public API, we won't be able to change them later again, but we would require two different event APIs (one strictly only for localhost, and one for multi-node-environments). And since network architecture is not a trivial issue, and a topic often misunderstood, I guess the APIs might commonly get misused. From an architectural point of view I would still prefer the "safe" way and just use simple scalars.

How would the information be stored in a queue? Would that be the whole signal object that would be serialized? In that case I see the benefit of having scalars only. Who would serialize the info? A generic slot catching all signals, are only signals that needs to be treated asynchronously?

I have the feeling that we can take the gentle approach of keeping Value Objects in signals to avoid costly fetch operations, cache not helping here and this has the benefit of simplifying the slots as well, and only store what needs to be stored in a queue from the signals, with some possible conversion. That way, we move some complexity from all Slots implementations to a possible conversion mechanism (that can most certainly be generalized). e.g.:
A Signal: $signal made of a Location: $location and a User: $user could be stored with:

array( "locationId" => $signal->location->id, "userId" => $signal->user->id );

@kore
Copy link
Contributor

kore commented Jul 11, 2013

@patrickallaert:

With the old state signals are just a name and a simple hash map. That's trivial to store with every format, which comes to mind.

If we introduce a conversion layer:

  1. We must build this conversion layer, which, indeed, might possible to implement in a generic way.
  2. We get two different APIs for event processing.
  3. If we want an event processor which can work locally and remote, we must build a second conversion layer. For the other conversion direction – not sure what our events will be used for.

Additionally there are potential issues even with locally deferred event processing. Events / signals are asynchronous by nature – using them in another way I would really dislike. We should, at least, also give them another name, then. (For the same reason, the Symfony2 kernel events should have been called Pipes & Filters, not events, because this is, what it is.)

Also, event data must be considered immutable, but our value objects suggest otherwise. People will modify those value objects in one event handler, which will then affect all following handlers of the same event. This does not happen with scalar values.

@gggeek
Copy link
Contributor

gggeek commented Jul 11, 2013

@kore maybe outlining the different usecases for async-events as opposed to a pipe|filter system is also something which can be beneficial in the document you mentioned writing. Apart from content-staging, I can see generation of image-variations or video transcoding (both need an async/deferred unit of work), custom cache-purging. And the holy grail: the no-shared-cache-with-minimallly-chatty-intra-node-cache-expiry-messaging architecture

@patrickallaert
Copy link
Contributor Author

@kore

The PR has been reworked and Value Objects are not anymore part of signals.

However, none of the other remarks have been taken into account here as they are all out of the scope of the issue targeted by this PR (including s/_mid/_multi_id/ which I thought was introduced here, but is already part of master as of 969e992) and I guess those remarks should be taken into account regardless of this PR being merged or not as they remain valid anyway.

@andrerom
Copy link
Contributor

+1

@kore
Copy link
Contributor

kore commented Jul 16, 2013

Perfect. Thanks @patrickallaert :-)

+1

@@ -632,6 +659,24 @@ public function rollback()
}

--$this->transactionDepth;
unset( $this->commitEventsQueue[$this->transactionCount] );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickallaert A bit late maybe, but should this int be decreased here? Or should transactionDepth be used as key instead?
Context: Looks like this will fail if there is several rollbacks in nested transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch :-S sorry @andrerom I don't have any clue about this, this is way too old for my memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really relevant anymore :)
signals are now by default sent only when transaction they are in are committed, making them all transaction safe and ready for being async safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants