Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fet/use indexes to decide habtm join table behaviour #877

Closed

Conversation

mnelson4
Copy link
Contributor

@mnelson4 mnelson4 commented Jan 3, 2019

Problem this Pull Request solves

Restores the behaviour from #798 (which limits there to only being one join table entry for the foreign keys) but only for the join models that expect that behaviour (other models allow duplicates based on foreign keys).

WIP!

Exterior behaviour

This means join tables can either impose a uniqueness constraint based on foreign keys, or not. If a model defines a unique index/combined key (that is, a set of fields that must be globally unique for that model) then the EE_HABTM_Relation will respect that when adding a relation (instead of inserting a new entry, it will update the existing one).
However, if a model doesn't define a unique index, the default is to to allow duplicate join table entries. (We could have the opposite be the default, but this is most backward compatible IMO).

Implementation details

This primarily means join models should define a EE_Unique_Index that lists the fields that make the combined key. Those that don't get the old behaviour (which allowed duplicates).

Up for debate

The main disadvantage is that there's no way for REST API clients to know which behaviour to expect. We should expose that either in the /resources endpoint, or somehow in the schema.

How has this been tested

Not yet tested.

Checklist

@mnelson4 mnelson4 self-assigned this Jan 4, 2019
@nerrad
Copy link
Contributor

nerrad commented Jan 4, 2019

However, if a model doesn't define a unique index, the default is to to allow duplicate join table entries. (We could have the opposite be the default, but this is most backward compatible IMO).

I see why you think this is the most backward compatible because currently all HABTM relationships allow duplicate entries. However, I'm not sure that's something we want and is more of an accident. Allowing duplicate entries:

  • encourages bad db schema design
  • creates complications for REST API (which surfaced this)
  • creates potential for harder to follow php code implementing things (In the case of event question group the developer needs to understand the way EQG behaves.

If we were to implement this, I'd prefer to become more opinionated about the join table behaviour and this is added simply to accommodate the exception of EQG. So EQG would receive similar treatment as what you are suggesting here for index (maybe instead of UniqueIndex it'd be DuplicateIndex or something like that?).

However...

I applaud the effort to "fix" this in the code as opposed to make a change in the database level (and I think what you are suggesting would technically work) but I'm wondering if we should bite the bullet and fix the schema for EQG instead. There's two ways we could do this:

  1. Add specific columns for covering if the relation applies to primary or attendees (instead of one column which is why we have "duplicate" entries) OR
  2. store the primary/attendee information in the esp_extra_meta table.

With both options we'd have to tweak the queries involving question groups, although we probably could put in some back compat shims in the Event model (currently that appears to be the only place that includes EQG_primary in the query) to translate things like Event_Question_Group.EQG_primary to whatever join is needed.

With both options there'd be some sort of db migration needed to migrate the existing entries and delete the "duplicates".

The primary downside to option 1 is that the resource hit will be larger on migration I think because it involves altering the tables (so impact on ES could be greater).

The primary downside to option 2 is that it means an additional join in current queries.

My preference is option 2 because I think that provides a future proof way to add additional meta data to this relation down the road (eg purchaser).

Why?

You've identified the primary problem with fixing this in the code only Mike:

The main disadvantage is that there's no way for REST API clients to know which behaviour to expect. We should expose that either in the /resources endpoint, or somehow in the schema.

I can't overstate how significant a disadvantage this is and greatly complicates understanding not only how to interpret model data but also how to dynamically create CRUD actions for model data. It increases the code needed and with that increase in code surface area comes an increase in the potential for bugs. Now, that's okay if we need to accommodate a feature needed or desired. But we're talking about having to deal with the idiosyncrasy of a single join table that results from (in hindsight) a poorly designed schema. I don't think it's a good idea to shift the problem of this design downstream to code to handle if it's something we don't want (duplicate entries in the join table) to begin with.

For example, with EQG left as is, this means the js client I'm building would have to make sure that every time the relation endpoint is used, it includes the EQG_primary field because that is effectively an index field for the "uniqueness" of the relation.

Let's think about how we might communicate this information via the REST API should something like this (allowing for duplicates in join tables) land.

  • What schema would include that information? What endpoint response would include that information?
  • We'd need to communicate not only that the join allows for duplicates, but also what additional column(s) are required in queries (especially for removing relations).

In the end, I think it lends to much simpler and easier to reason with code if we make the fix at the db table level (which I realize means we have to tweak a few places in our existing code) so that we enforce the opinion that join tables should only have unique records for the relations.

@mnelson4
Copy link
Contributor Author

mnelson4 commented Jan 4, 2019

Allowing duplicate entries: encourages bad db schema design

I didn't totally follow your rationale for that, so I did some reading to try to figure out "the proper way" to have join tables. I see one stack overflow article saying that, at least in Doctrine 2, join tables should ONLY contain foreign keys.. That definition came up again elsewhere.

But I found at least one tutorial talking about "join tables" with "extra columns".

So I don't think it's been a hard-and-fast rule that join tables should only contain foreign keys, but I get the idea that's the most common definition.

So ya, having other columns on "join tables" is a no-no to start with. Having duplicates seems to be more of a no-no. I wasn't aware of that before.

this is added simply to accommodate the exception of EQG

Event-Question-Group is the only table that is treated as a a join table with an extra column which "allows duplicates" (as defined solely by the foreign keys). Registration_Payment and Event_Venue are other join-models with an extra fields, but they are meant to only have one entry per foreign-key combo.

But Event_Question_Group is very similar to several other models that are just two foreign keys with one or two extra columns: Answer, Question_Mailchimp_Field, and Promotion_Object and possibly others. The only difference between it and those other models is that it's treated as a join table in a HABTM relation between events and question groups.

So if we're talking about backward incompatible changes, and we don't like having some join tables "allow duplicates" while others don't (what this branch does), we could remove this inconsistency by just not using Event_Question_Group as a join-model. Same goes for another other models we might find that are used for EE_HABTM_Relations but also "allow duplicates" (although I don't know of any).

I'd really like to avoid changing the data structure as that seems like it will take an order of magnitude more effort, and the only problem I really see with it is that we're calling Event_Question_Group as join table although it breaks two of the conventions of join-tables (it has extra columns and "allows duplicates"). If we no longer call it a join table (by removing the relations entry in Event for 'Question_Group'=> new EE_HABTM_Relation('Event_Question_Group'), and the corresponding entry in Question_Group) then that problem is gone. Plus this data structure has been ok for the past several years, this is the first time we've had a complaint with it, so refactoring it based solely on this feels like a knee-jerk reaction.

What do you think of this proposed alternative? (Just don't use "Event_Question_Group" as a join table, and the necessary refactoring in the code for that).

@nerrad
Copy link
Contributor

nerrad commented Jan 5, 2019

So I don't think it's been a hard-and-fast rule that join tables should only contain foreign keys, but I get the idea that's the most common definition.

I want to try to be a bit clearer here. I'm not against join tables having extra columns when it makes sense and the extra data needs are static and very clear (to represent order etc). I'm specifically addressing the flaw I think exists because we have duplicate rows for the joins solely because of the poor choice to have a single column represent a binary value and have it possible to have multiples of that binary value. I gave reasons for why (among the suggestions I gave) I prefer the option of storing the data in extra meta as opposed to adding additional columns in the table but among those reasons I'm not saying that join tables should never have extra data in them.

So if we're talking about backward incompatible changes, and we don't like having some join tables "allow duplicates" while others don't (what this branch does), we could remove this inconsistency by just not using Event_Question_Group as a join-model.

I don't think this is a valid solution as it'll involve more code churn than either of the solutions I'm proposing. Effectively, changing the "label" we give a table doesn't change its purpose. It'd still be the only way we could represent the Question Group to Event relation - only we'd be complicating the process to add that relation compared to the rest of the relations in our model system.

Same goes for another other models we might find that are used for EE_HABTM_Relations but also "allow duplicates" (although I don't know of any).

I think before we talk about other EE_HABTM_Relations that "allow duplicates", I think we need to put that hypothetical to bed. Either we have other existing models with that kind of scenario that need fixed or we don't. Either we decide we're going to support it or we don't. I think we need a decision here to help improve the stability and usability of our code and I'm arguing that we need to take the necessary steps to be clear we don't support it and we won't support it (for all the reasons I've already given).

I'd really like to avoid changing the data structure as that seems like it will take an order of magnitude more effort,

I do agree there will be slightly more short term pain but I'm arguing for this because I'm confident it'll save us even more headaches down the road (and yes I'm aware of the impact of db schema changes on larger db's like EventSmart).

and the only problem I really see with it is that we're calling Event_Question_Group as join table although it breaks two of the conventions of join-tables (it has extra columns and "allows duplicates"). If we no longer call it a join table (by removing the relations entry in Event for 'Question_Group'=> new EE_HABTM_Relation('Event_Question_Group'), and the corresponding entry in Question_Group) then that problem is gone.

I disagree with this. Changing the label of a table does not change how its used. By removing its status as a HABTM Relation table, we're just effectively increasing the difficulty of representing the relation between Question Group and Event (of which the need has not disappeared). So while the problem prompting this discussion may "disappear" I think we'll introduce a new set of problems needing fixed. I'm aware that any fix has the potential for introducing a new set of problems but I'm confident that what I'm proposing will help make the codebase more solid and improve the stability of the db schema as opposed to simply trying to accommodate the current flaws.

Plus this data structure has been ok for the past several years, this is the first time we've had a complaint with it, so refactoring it based solely on this feels like a knee-jerk reaction.

Do we really know its been okay for the past several years? Is it possible there are some bugs and/or weird behaviour that have happened here and there over the years that are because of this table setup that we've just never identified yet? I think it's a pretty significant claim. Appears to have been okay might be slightly more accurate (and likely what you meant).

Also the nature of the "complaint" that is surfaced matters imo. There's real complications involved in any client code consuming and interacting with HABTM_relations in the EE schema if there's potential that there may be multiples of those relations with arbitrary indexes that define their "uniqueness". There's a very clear issue with that and an opportunity (especially with this affecting only ONE table) to simplify things and make them more in line with convention (which exists for good reason). In my developer career, I've read about and experienced the kind of pain that happens when db schema issues are covered over with clever code solutions instead of dealing with them "properly" (we're still working out what the "properly" part is). I want to avoid that future pain for us (especially the real pain being felt currently in building our js client for the REST API).

I feel confident I've stated my opinions on this in what direction we should take. I don't feel I'm the person that should make the decision on this so I don't think I have anything more to say other than to respond to additional questions for clarity etc. At this point I think @tn3rb should probably weigh in and make the decision on what path we take.

@tn3rb
Copy link
Member

tn3rb commented Jan 7, 2019

I just want to clarify a few things first before discussing any possible solutions...

having other columns on "join tables" is a no-no to start with

No that's incorrect. The links you posted are not saying that an intersection table can not have extra columns, they are merely pointing out that both Doctrine 2 and Hibernate need to map the join table as an entity table in order to access the data in those extra columns. So that's a constraint of those ORMs and not a constraint of SQL.

Here is a more pure explanation of Intersection tables and Intersection Data in SQL: https://docs.oracle.com/cd/B31104_02/books/ConfigApps/ConfigApps_TablesColumns7.html

Now that said, if we want to get technical about things, our tables like esp_event_question_group are not pure intersection tables at all because they possess their own auto-increment primary keys and do not utilize a composite primary key made up of the two foreign keys that define the relation. So in this sense, our tables behave like the entity tables that Doctrine 2 and Hibernate require for intersections with data columns.

It is THIS property that allows us to have duplicate entries in the first place. Had we chosen to build proper intersection tables that utilized a composite primary key, then duplicate entries would be absolutely impossible because there would be no way to have two records that both have a composite key like 123-123. My understanding was that initially, the model system was not capable of handling tables that did not have their own PK and/or could not handle only having a composite key, and so the decision was made to add auto-inc PKs to our intersection tables.

Now with regards to our specific tables, here are a couple of examples of how this is implemented:

Event Question Groups

Consists of an intersection table with intersection data columns.
It uses EE_HABTM_Relation in the foreign tables and EE_Belongs_To_Relation in the intersection.

  • EEM_Event->_model_relations->Event_Question_Group => EE_HABTM_Relation
  • EEM_Event_Question_Group->_model_relations->Event => EE_Belongs_To_Relation
  • EEM_Event_Question_Group->_model_relations->Question_Group => EE_Belongs_To_Relation
  • EEM_Question_Group->_model_relations->Event_Question_Group => EE_HABTM_Relation

Registration Payments

Consists of an intersection table with intersection data columns,
but uses EE_Has_Many_Relation in the foreign tables and EE_Belongs_To_Relation in the intersection.

  • EEM_Registration->_model_relations->Registration_Payment => EE_Has_Many_Relation
  • EEM_Registration_Payment->_model_relations->Registration => EE_Belongs_To_Relation
  • EEM_Registration_Payment->_model_relations->Payment => EE_Belongs_To_Relation
  • EEM_Payment->_model_relations->Registration_Payment => EE_Has_Many_Relation

So just to clarify, is it the usage of EE_HABTM_Relation instead of EE_Has_Many_Relation that allows for the duplicate entries?

@mnelson4
Copy link
Contributor Author

mnelson4 commented Jan 7, 2019

Summary (so you don't have to read all the above):

We discussed in slack https://eventespresso.slack.com/archives/C080DDVLM/p1546881500247100
From my perspective, this is the gist:

  • the original problem was that we'd like models being used as "join tables" (eg event_question_group and others) to not have any "duplicates" in them (based on their foreign keys; eg only allow one entry in event_question_group per EVT_ID and QSG_ID combo). This will simplify REST API (and model) code that was wants to add or remove relations
  • The problematic model that "allow duplicates" is Event_Question_Group. All other models used as join tables don't allow duplicates. This is a gotcha to REST API clients and model clients. We'd rather be consistent: any model/table used as a join table won't allow for duplicates.
  • we could do the above and stick with this PR, but EQG_primary needs to be considered part of event_question_group's combined primary key. (So two rows with the same EVT_ID and QSG_ID, but different EQG_primary, would NOT be considered duplicates.) We'd rather not do that because it's another gotcha for REST API and model clients.
  • so, we'll be adding a new column event_question_group.EQG_additional which will indicate if a question group applies to additional attendees or not.
  • with that, we'll also need to have a data migration script to update the event_question_group table
  • we'll then need to adjust client code
  • we'd like to also add event_question_group.EQG_purchaser to indicating purchasing agent question groups. We're a little on the fence on whether ot add a column we won't immediately be using, but are confident it will be useful for supporting the purchasing agent functionality.

How long will the above changes take? Initial development probably a few days, testing a few days, roll-out of DB schema changes to ES would be another day or two, and then probably extra support in helping clients run the data migration script (but theoretically it will just work).

This PR is an alternative, but it won't make the DB change for purchasing agent, and will introduce the gotcha that EQG_primary is part of the "combined key" for the event_question_group table.

@joshfeck as PM, you should probably ok the DB change and migration script before we get too far along in this. For the record, Brent and Darren are on board with it, I can go either way.

@joshfeck
Copy link
Contributor

joshfeck commented Jan 7, 2019

I think these are needed since 1) it's exposed a problem while building out the underpinnings needed for REM, and 2) This was a blocker when the first crack at the Purchasing Agent was made.

While we're adding to the esp_event_question_group table, could we also add one more field that's been discussed for another requested feature? We want to be able to add a context field to allow for different registration form contexts (default context would be the registration form). There's some backstory in codebase 4426

@mnelson4
Copy link
Contributor Author

mnelson4 commented Jan 7, 2019

KK I'll embark on another PR asap.
Yes we could add another column, but it is preferable to only add it if we're really confident it will have the right structure and we have actual plans to implement it. Otherwise we risk having another half-built feature which makes other work more complicated. (Granted, just adding a DB column isn't huge).
Another consideration: just adding a database column doesn't require a DMS; so adding a column is easy (except a bit of a headache on ES). We just need a DMS this time to modify existing data. (So I'd actually also be fine not adding EQG_purchaser because adding that when we actually implement the feature will be quite easy.)

@mnelson4 mnelson4 closed this Jan 7, 2019
@mnelson4 mnelson4 deleted the FET/use-indexes-to-decide-habtm-join-table-behaviour branch January 7, 2019 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants