Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pinot upsert features to pinot common #5175

Merged
merged 17 commits into from
Apr 24, 2020
Merged

add pinot upsert features to pinot common #5175

merged 17 commits into from
Apr 24, 2020

Conversation

jamesyfshao
Copy link
Contributor

@jamesyfshao jamesyfshao commented Mar 24, 2020

This diff will create necessary pinot schema/constant value changes to support pinot upsert features.

The following will be changed:

  1. add IngestionModeConfig field to schema and table definition:
    This field will contains several sub-fields:
    a. ingestionMode:
    ingestion mode could be either upsert/append.
    b. primaryKeys:
    a list of string that indicate which columns would be use as primary key of the current upsert table. However we are going to enforce primary key to be one only only one column. In the future we can explore the support for multiple primary keys
    c. offsetKey:
    a string that indicate which column would be use as offset of the current pinot upsert table.

design doc for pinot upsert feature:

https://cwiki.apache.org/confluence/display/PINOT/%5BProposal%5D+Pinot+upsert+design+doc
this change focus on the changes in pinot common repo

@codecov-io
Copy link

codecov-io commented Mar 24, 2020

Codecov Report

Merging #5175 into master will decrease coverage by 9.25%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5175      +/-   ##
==========================================
- Coverage   66.20%   56.95%   -9.26%     
==========================================
  Files        1067     1068       +1     
  Lines       54532    54570      +38     
  Branches     8128     8133       +5     
==========================================
- Hits        36104    31080    -5024     
- Misses      15786    21070    +5284     
+ Partials     2642     2420     -222     
Impacted Files Coverage Δ
...ain/java/org/apache/pinot/spi/utils/JsonUtils.java 30.00% <ø> (ø)
...org/apache/pinot/spi/config/table/TableConfig.java 64.91% <60.00%> (-1.13%) ⬇️
...he/pinot/common/utils/config/TableConfigUtils.java 86.31% <100.00%> (-3.46%) ⬇️
...java/org/apache/pinot/spi/config/UpsertConfig.java 100.00% <100.00%> (ø)
...he/pinot/spi/utils/builder/TableConfigBuilder.java 73.39% <100.00%> (+0.49%) ⬆️
...a/org/apache/pinot/minion/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/minion/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/core/query/reduce/ComparisonFunction.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
... and 316 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8a2705...a5fd4b9. Read the comment docs.

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the PR.

@@ -75,6 +79,9 @@
private Map<InstancePartitionsType, InstanceAssignmentConfig> _instanceAssignmentConfigMap;
private List<FieldConfig> _fieldConfigList;

@JsonPropertyDescription(value = "The update semantic of the table, either append or upsert, default as append")
private UpdateSemantic _updateSemantic;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to IngestMode or just have a boolean enableUpsert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for real-time? We already have push type of either APPEND or REFRESH for offline; aggregateMetrics for real-time. What does UPSERT stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kishoreg I am thinking if using enum here will be more flexible in the case of future features changes or whatnot. Boolean seems to be only limiting two options and it might limit what other changes we might want to do with pinot. Please let me what you think. ingestionMode definitely sounds more explicit than what I am using here and I will definite consider about this.

@Jackie-Jiang yes. So basically it is previous old design about we want to support update in pinot realtime ingestion by adding override by primary key (eg, a second message in Kafka with the same primary key represents the update of existing key instead of two records with the same primary key). Upsert stands for this new ingestion semantic (insert if no duplicated primary key, update if duplicate primary key). This feature will be applying to realtime table only

* @param type the type of the table for it to fill in if the type info is missing
* @return the table type name with the type info
*/
public static String ensureTableNameWithType(String tableName, TableType type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use TableNameBuilder.forType(type).tableNameWithType(tableName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this method, @kishoreg this method is used in other components that I will submit PR later

@@ -26,7 +26,8 @@
*
*/
public enum BrokerGauge implements AbstractMetrics.Gauge {
QUERY_QUOTA_CAPACITY_UTILIZATION_RATE("tables", false), NETTY_CONNECTION_CONNECT_TIME_MS("nettyConnection", true);
QUERY_QUOTA_CAPACITY_UTILIZATION_RATE("tables", false), NETTY_CONNECTION_CONNECT_TIME_MS("nettyConnection", true),
TABLE_MIN_LOW_WATER_MARK("tables", false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this represent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It represents how many table pinot broker is collecting the metrics for. The exact usage of the metrics will show up in later diffs


// upsert related config
// primary key refers to the column name that is the primary key of this upsert table
private String _primaryKey;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the primary key is made up of two columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or multiple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now we don't have support for multiple primary keys. Do you think it will be an important feature to add multiple primary key support right now or you think it is fine to postpone it to later diff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to design it for multiple, even if the implementation can handle only one primary key

Copy link
Contributor

@chethanuk chethanuk Mar 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have support for multiple primary keys

There will be some cases where multiple keys make sense right? Like in clickstream: userId and timestamp
Will support for multiple keys be added later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kishoreg @chethanuk I have changed the primary key in schema config to be a list instead of a single string. For now we are still performing check to ensure that primary key will be one and only one. However we should be able to change that once we have mulitple-primary-key support in place

pom.xml Outdated Show resolved Hide resolved
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please link the design doc in the pr description so that it is easier to understand the concept of upsert?
Since this pr is touching the core configs, we need to spend some time discussing how to design the config keys because they are very hard to change in the future

@@ -30,17 +30,21 @@
import java.util.List;
import java.util.Map;
import javax.annotation.Nullable;
import javax.validation.constraints.Null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue with merging, fixed. Thanks for pointing it out

@@ -75,6 +79,9 @@
private Map<InstancePartitionsType, InstanceAssignmentConfig> _instanceAssignmentConfigMap;
private List<FieldConfig> _fieldConfigList;

@JsonPropertyDescription(value = "The update semantic of the table, either append or upsert, default as append")
private UpdateSemantic _updateSemantic;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for real-time? We already have push type of either APPEND or REFRESH for offline; aggregateMetrics for real-time. What does UPSERT stand for?

* @param type the type of the table for it to fill in if the type info is missing
* @return the table type name with the type info
*/
public static String ensureTableNameWithType(String tableName, TableType type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use TableNameBuilder.forType(type).tableNameWithType(tableName)

@jamesyfshao
Copy link
Contributor Author

Can you please link the design doc in the pr description so that it is easier to understand the concept of upsert?
Since this pr is touching the core configs, we need to spend some time discussing how to design the config keys because they are very hard to change in the future

Linked the design doc in the description. also google doc design doc with comments. We have discussed this a long time back but I definitely open to go through it again as things have already changed. Do you think we should do some remote meeting sync or something else?

@kishoreg
Copy link
Member

One concern here would be that the update related configuration appears in multiple places. One thing that can simplify this PR is to add a UpdateConfig under TableConfig and move everything under that. That allows you to mark that config as beta/unstable and iterate quickly

  • updatesemantic (or the new name)
  • primarykey
  • offsetkey

@chethanuk
Copy link
Contributor

chethanuk commented Mar 25, 2020

also google doc design doc with comments.

It's asking uber login. Can you change share settings?

remote meeting sync or something else

@kishoreg At least monthly meeting and sync up on project status will be helpful to all...

@jamesyfshao
Copy link
Contributor Author

also google doc design doc with comments.

It's asking uber login. Can you change share settings?

remote meeting sync or something else

@kishoreg At least monthly meeting and sync up on project status will be helpful to all...

Yeah our google doc account sharing setting is kind of wanky that it doesn't allow blanket share with everyone outside our organization. I hate it but it is really out of my control. However, just submit a request to view/comment and I can add you as soon as I see it.

We originally plan to have a meeting earlier this month but the pandemic kind of mess up with all of our plans. Maybe we can have a work-group specific remote meeting (eg, upsert meeting with kishore/jackie/chethan etc) in the meantime and do the pinot committee meeting later

@Jackie-Jiang @kishoreg @chethanuk let me know if you feel meeting is necessary or you prefer me to update the schema first and see if a remote meeting is necessary

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better. I still feel that it might be easier to iterate and less invasive if we move everything related to upsertfeature into TableConfig.UpsertConfig and move all the changed in Schema and few things in TableConfig like UpdateSemantic etc to TableConfig.UpsertConfig.

That will allow you to iterate on UpsertConfig very fast, you can mark that config as unstable for now.

_updateSemantic = updateSemantic;
}

public boolean isTableForUpsert() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid adding additional utility methods to config classes if possible

@jamesyfshao
Copy link
Contributor Author

@kishoreg @mcvsubbu I have updated the diff per suggestion you guys gave in the last meeting, please feel free to take a look

@Jackie-Jiang I have updated the diff to make sure #1: it is only related table config changes required by pinot upsert feature (see doc for details) and #2 push other pinot-common related changes for pinot upsert features to later PRs

@jamesyfshao
Copy link
Contributor Author

@mcvsubbu @Jackie-Jiang updated per your feedback and make the config to be used only if the table is for upsert and clean up on naming, please take a look when you have time

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Please make sure the test can pass

@jamesyfshao
Copy link
Contributor Author

hi @mcvsubbu we have updated the pr following the comments. Appreciate if you can take a look to see if anything else you feel we need to update further

Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, other than that, lgtm thanks

jamesyfshao and others added 4 commits April 24, 2020 13:22
…ig.java

Co-Authored-By: Subbu Subramaniam <mcvsubbu@users.noreply.github.com>
…ig.java

Co-Authored-By: Subbu Subramaniam <mcvsubbu@users.noreply.github.com>
…ig.java

Co-Authored-By: Subbu Subramaniam <mcvsubbu@users.noreply.github.com>
…ig.java

Co-Authored-By: Subbu Subramaniam <mcvsubbu@users.noreply.github.com>
@jamesyfshao
Copy link
Contributor Author

minor comments, other than that, lgtm thanks

merged the comments diff, thanks for checking on those details. Do you think any further changes you want? It would be great if you can change your review status from requesting-changes so I can start merging it in

@mcvsubbu
Copy link
Contributor

@jamesyfshao I am good with this PR, thanks.

@jamesyfshao jamesyfshao merged commit 22865e6 into master Apr 24, 2020
@Jackie-Jiang Jackie-Jiang deleted the upsert-pr-land branch August 25, 2020 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants