Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move metadata cache one layer up #6153

Merged

Conversation

youngyjd
Copy link
Contributor

@youngyjd youngyjd commented Oct 19, 2018

  1. pyhive is calling use default when a connection is initiated, https://github.com/dropbox/PyHive/blob/2c2446bf905ea321aac9dcdd3fa033909ff0b0b5/pyhive/hive.py#L205. This statement took 3 seconds in my experiment, so even metadata cache is enabled for Hive, user experience is still not very good.

Moving the cache one layer up can avoid running use default statement.

  1. Refactored the cache logic for multi schema metadata fetch. Every time when multi schema metadata fetch is updated, it refreshes table list cache for every schema as well.

@youngyjd youngyjd changed the title Update wording Move metadata cache one layer up Oct 24, 2018
@youngyjd youngyjd force-pushed the update-database-configuration-instruction branch 2 times, most recently from 994f50b to 199c7b5 Compare October 24, 2018 05:49
@codecov-io
Copy link

codecov-io commented Oct 24, 2018

Codecov Report

Merging #6153 into master will increase coverage by 0.03%.
The diff coverage is 40.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6153      +/-   ##
==========================================
+ Coverage   76.92%   76.95%   +0.03%     
==========================================
  Files          47       47              
  Lines        9388     9375      -13     
==========================================
- Hits         7222     7215       -7     
+ Misses       2166     2160       -6
Impacted Files Coverage Δ
superset/cli.py 35.85% <0%> (-0.15%) ⬇️
superset/config.py 92.36% <100%> (ø) ⬆️
superset/db_engine_specs.py 55.45% <15.38%> (-0.72%) ⬇️
superset/views/core.py 73.86% <27.27%> (-0.27%) ⬇️
superset/models/core.py 84.35% <48.64%> (+0.47%) ⬆️
superset/utils/cache.py 62.5% <64.28%> (+14.35%) ⬆️
superset/connectors/sqla/models.py 80.71% <0%> (-0.24%) ⬇️
superset/security.py 75.67% <0%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e21e6a7...b42a404. Read the comment docs.

@youngyjd
Copy link
Contributor Author

@betodealmeida @mistercrunch Please take a look

@youngyjd youngyjd force-pushed the update-database-configuration-instruction branch 3 times, most recently from fe747fb to 214eaa1 Compare October 25, 2018 20:12
@youngyjd youngyjd force-pushed the update-database-configuration-instruction branch from 1fd8623 to 06b3a25 Compare October 25, 2018 20:48
"""
if not self.allow_multi_schema_metadata_fetch:
return []
tables_dict = self.db_engine_spec.fetch_result_sets(self, 'table')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result from the method fetch_results_sets is super confusing... took me a while to understand line 879.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was feeling confused as well. let me see whether I can refactor it as well.

superset/cli.py Outdated
print('Fetching {} datasources ...'.format(database.name))
try:
database.all_table_names_in_database(
db_id=database.id, force=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to pass database.id here, if this is a method of database? Looking at other places where this method is called, looks like db_id is always the same as database.id, no?

cache_timeout=None, force=False):
"""
Parameters need to be passed as keyword arguments.
If cache=True, db_id must be passed in for setting cache key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you want the database id to be part of the key. In that case, I think it's better to create a new decorator reusing cache_util.memoized_func, that extracts the id from self. Here's a simple example showing the technique:

def instance_memo(attribute, cache={}):
    def decorator(func):
        def decorated(self, *args, **kwargs):
            # in this example, key is just `Database.id`
            key = getattr(self, attribute)
            if key not in cache:
                print('Writing to cache')
                cache[key] = func(self, *args, **kwargs)
            else:
                print('Reading from cache')
            return cache[key]
        return decorated
    return decorator


class Database:
    def __init__(self, id):
        self.id = id

    @instance_memo('id')
    def method(self):
        pass

Calling it we get:

>>> db1 = Database(1)
>>> db1.method()
Writing to cache
>>> db1.method()
Reading from cache
>>> db1.method()
Reading from cache
>>> db2 = Database(2)
>>> db2.method()
Writing to cache
>>> db2.method()
Reading from cache
>>> db2.method()
Reading from cache

inspector=self.inspector, schema=schema)
except Exception as e:
logging.exception(e)
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need pass here, you only need it in empty blocks (eg, if line 927 didn't have the logging).

except Exception:
inspector=self.inspector, schema=schema)
except Exception as e:
logging.exception(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can remove pass below.

@youngyjd
Copy link
Contributor Author

@betodealmeida Comments addressed. Please take another look.

@betodealmeida
Copy link
Member

Awesome! Thanks, @youngyjd!

@betodealmeida betodealmeida merged commit c552c12 into apache:master Oct 31, 2018
CACHE_CONFIG = {'CACHE_TYPE': 'null'}
TABLE_NAMES_CACHE_CONFIG = {'CACHE_TYPE': 'null'}
CACHE_CONFIG = {'CACHE_TYPE': 'simple'}
TABLE_NAMES_CACHE_CONFIG = {'CACHE_TYPE': 'simple'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you changed this for testing — can you revert the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! My bad

youngyjd added a commit to lyft/incubator-superset that referenced this pull request Nov 2, 2018
* Update wording

* nit update for api endpoint url

* move metadata cache one layer up

* refactor cache

* fix flake8 and DatabaseTablesAsync

* nit

* remove logging for cache

* only fetch for all tables that allows cross schema fetch

* default allow_multi_schema_metadata_fetch to False

* address comments

* remove unused defaultdict

* flake 8

(cherry picked from commit c552c12)
youngyjd added a commit to lyft/incubator-superset that referenced this pull request Nov 2, 2018
* Update wording

* nit update for api endpoint url

* move metadata cache one layer up

* refactor cache

* fix flake8 and DatabaseTablesAsync

* nit

* remove logging for cache

* only fetch for all tables that allows cross schema fetch

* default allow_multi_schema_metadata_fetch to False

* address comments

* remove unused defaultdict

* flake 8

(cherry picked from commit c552c12)
bipinsoniguavus pushed a commit to ThalesGroup/incubator-superset that referenced this pull request Dec 26, 2018
* Update wording

* nit update for api endpoint url

* move metadata cache one layer up

* refactor cache

* fix flake8 and DatabaseTablesAsync

* nit

* remove logging for cache

* only fetch for all tables that allows cross schema fetch

* default allow_multi_schema_metadata_fetch to False

* address comments

* remove unused defaultdict

* flake 8
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants