Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-25900: Materialized view registry does not clean non existing views at refresh #2984

Closed

Conversation

kasakrisz
Copy link
Contributor

@kasakrisz kasakrisz commented Jan 27, 2022

What changes were proposed in this pull request?

  1. Split implementation of HiveMaterializedViewsRegistry to a dummy DummyMaterializedViewsRegistry and the default InMemoryMaterializedViewsRegistry
  2. Add functionality to remove dropped MVs from cache at refresh
  3. Add unit tests for Refresh algorithm

Why are the changes needed?

Dropped MVs stayed forver in the cache if the Drop operation was done by a different HS2 instance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestMiniLlapLocalCliDriver -Dqfile=materialized_view_create_rewrite_rebuild_dummy.q,materialized_view_create_rewrite_dummy.q -pl itests/qtest -Pitests
mvn test -Dtest=TestHiveMaterializedViewsRegistryLoader -pl ql

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @kasakrisz . I went over the PR and left some comments.

While reviewing this part of code I get the impression that the registry/cache is not thread safe. It also seems that multiple threads do perform operations on this registry/cache concurrently which makes me think that we should invest in making it thread safe.

The changes in this PR stem from the fact that we may end up with an OOM cause we do not invalidate the entries in the cache. Cache invalidation though could come almost for free if we used some existing cache implementation (such as those provided by Guava). The additional benefit is that most existing cache implementations are thread-safe by design.

Comment on lines 101 to 106
/**
* Registry for materialized views. The goal of this cache is to avoid parsing and creating
* logical plans for the materialized views at query runtime. When a query arrives, we will
* just need to consult this cache and extract the logical plans for the views (which had
* already been parsed) from it. This cache lives in HS2.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit a confusing that we have the same javadoc twice (here and in the top level class).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed one of them

* just need to consult this cache and extract the logical plans for the views (which had
* already been parsed) from it. This cache lives in HS2.
*/
public interface MaterializedViewsRegistry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have both a class (HiveMaterializedViewsRegistry) and an inner interface (MaterializedViewsRegistry)? This pattern is not very easy to follow and it is not clear what are the responsibilities of each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to extract the Dummy and Default implementations to separate classes. The MaterializedViewsRegistry interface is the common interface of the two implementations. Moved it to a separate file.


private final MaterializedViewsCache materializedViewsCache = new MaterializedViewsCache();
/**
* Adds a newly created materialized view to the cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Use "cache" or "registry" but not both. Probably we should follow the name of the class/interface.

It seems that some methods mention registry while others mention cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed cache to registry, since this is more like a registry.

Comment on lines 145 to 149
/**
* Returns the materialized views in the cache for the given database.
*
* @return the collection of materialized views, or the empty collection if none
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc does not correspond to what this method does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 155 to 161
default List<HiveRelOptMaterialization> getRewritingMaterializedViews(String querySql) {
return Collections.emptyList();
}

default boolean isEmpty() {
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 229 to 163
public interface MaterializedViewObjects {
List<Table> getAllMaterializedViewObjectsForRewriting() throws HiveException;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new interface gives the impression that we are adding complexity instead of simplifying things. Keeping a List of tables looks simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to make the Loader testable. The MaterializedViewObjects represents the HMS client. The interface enables us to mock it in tests. Please see TestHiveMaterializedViewsRegistryLoader.

By default the getAllMaterializedViewObjectsForRewriting method is called periodically by the registry.

Could you please show an example with List of tables?

protected final HiveConf hiveConf;
protected final MaterializedViewsRegistry materializedViewsRegistry;
protected final MaterializedViewObjects materializedViewObjects;
/* Whether the cache has been initialized or not. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bogus comment?

Suggested change
/* Whether the cache has been initialized or not. */
/* Whether the cache has been initialized or not. */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


private Loader(Hive db) {
this.db = db;
public static class Loader implements Runnable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the class need to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to package private.


for (HiveRelOptMaterialization materialization : materializedViewsRegistry.getRewritingMaterializedViews()) {
Table mvTableInCache = HiveMaterializedViewUtils.extractTable(materialization);
Table mvTableInHMS = materializedViewObjects.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we keep materializedViewObjects up-to-date with the HMS? It seems that we obtain the views once when the Loader is created but then I don't see it changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update is scheduled at a fixed rate:

pool.scheduleAtFixedRate(new Loader(db.getConf(), SINGLETON, objects), 0, period, TimeUnit.SECONDS);

Comment on lines +294 to +230
private PerfLogger getPerfLogger() {
SessionState ss = new SessionState(hiveConf);
ss.setIsHiveServerQuery(true); // All is served from HS2, we do not need e.g. Tez sessions
SessionState.start(ss);
return SessionState.getPerfLogger();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the perf logger? Using a session just to obtain a logger seems a bit of an overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added by https://issues.apache.org/jira/browse/HIVE-21344 and I think the intention was to monitor the refresh rate.

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Apr 4, 2022
@github-actions github-actions bot closed this Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants