-
Notifications
You must be signed in to change notification settings - Fork 60
Conversation
…into retrieve_subprovider
…into retrieve_subprovider
…into retrieve_subprovider
…into retrieve_subprovider
I think I will later need to have the mapping external to the API script such that it's accessible by the database updating script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main substantive changes I'd ask for are:
- remove the step copying the provider over to the source column.
- remove the SpaceX user from the NASA subprovider.
Other than that, please double-check the new changes with pycodestyle
; there's some extra whitespace hanging around here and there.
We'll need to test the performance of the table update at scale.
UPDATE {image_table} | ||
SET {col.SOURCE} = public.{temp_table}.{col.PROVIDER} | ||
FROM public.{temp_table} | ||
WHERE | ||
{image_table}.{col.CREATOR_URL} = public.{temp_table}.{ | ||
col.CREATOR_URL}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to test this at scale to see whether we need an index to make this workable. However, we could always add the index within this function, use it, then drop it (to avoid slowing down other things. If the index is added concurrently, that wouldn't block too much. I think it might be worth it, since we're looping through a number of creator URLs (and that number is expected to grow); we'd get to reuse the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a request that I forgot before.
source = next((s for s in SUB_PROVIDERS if owner in SUB_PROVIDERS[s]), | ||
PROVIDER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I forgot in my earlier review:
Please pass SUB_PROVIDERS
and PROVIDER
in as parameters. It will make it easier to experiment with other sub-provider sets down the road, and makes testing more robust, since you can pass in precisely the subprovider list you want to test against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean pass them in as parameters to _process_image_data
? Then we need to decide how far up the parameter passing should go. Should it be from where the _process_interval
method is called from within the main method because that's the starting point of the flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of making them defaults in `_process_image_data. That is, the signature would be:
def _process_image_data(image_data, sub_providers=SUB_PROVIDERS, provider=PROVIDER):
Then the further up functions don't need to know about them. The point would be to enable passing different values for testing, and if someone wants to use the function in a not-yet-thought-of manner, but avoid having functions that are already using it needing more info than necessary to call the function.
…into retrieve_subprovider
…into retrieve_subprovider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome.
Have you tested to make sure the variant methods work? You could parameterize the test you already have to do so.
I'd like to be able to change the method used via environment variable in the near term.
I made a couple of notes about switching some SQL statements around to use the indexes more efficiently (AND
isn't commutative in this situation).
postgres.run( | ||
dedent( | ||
f''' | ||
CREATE INDEX IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be concurrent to avoid locking
CREATE INDEX IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx | |
CREATE INDEX CONCURRENTLY IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is not supported. I get the following error: psycopg2.errors.ActiveSqlTransaction: CREATE INDEX CONCURRENTLY cannot run inside a transaction block.
The suggestion I see for this issue on forums is to create the index on the empty table which is not possible in our case
{image_table}.{col.FOREIGN_ID} = '{foreign_id}' | ||
AND | ||
{image_table}.{col.PROVIDER} = '{default_provider}'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the index is set up means this won't use it, but my suggestion will:
{image_table}.{col.FOREIGN_ID} = '{foreign_id}' | |
AND | |
{image_table}.{col.PROVIDER} = '{default_provider}'; | |
{image_table}.{col.PROVIDER} = '{default_provider}' | |
AND | |
MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}'); |
The switch in order and adding of md5s aligns with the precise index so that the planner will set up a complete index scan, which will be as fast as possible.
{image_table}.{col.FOREIGN_ID} = 'foreign_id' | ||
AND | ||
{image_table}.{col.PROVIDER} = '{default_provider}'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{image_table}.{col.FOREIGN_ID} = 'foreign_id' | |
AND | |
{image_table}.{col.PROVIDER} = '{default_provider}'; | |
{image_table}.{col.PROVIDER} = '{default_provider}' | |
AND | |
MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}'); |
assert check_result | ||
|
||
|
||
def test_update_sub_providers(postgres_with_load_and_image_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could parameterize this to check all three methods
…into retrieve_subprovider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks ready to go.
I took the liberty of adding a little logging so that we can see how many rows we're changing.
Fixes
Fixes #419 by @ChariniNana, Related to #392
Fixed #414 by @kgodey
Description
This addresses the requirement of retrieving sub providers within Flickr. For the time being, it only considers the nasa and bio diversity sub providers. There are seven users currently considered under nasa which may need to be extended/modified later on. The list of sub-providers considered too may be expanded in the future. There are two aspects to this requirement which are as follows:
Technical details
We maintain a mapping of the sub providers and the IDs of the users (what is contained in the owner field of the API response) that come under each sub provider.
Tests
test_process_image_data_with_sub_provider
withintest_flickr
test suite checks whether the source is properly set when a sub provider from our mapping is encountered.test_create_temp_sub_provider_table
andtest_update_sub_providers
withintest_sql
checks the creation of the temporary table (to help with the DB update) and the successful updating of the image table respectively.test_sub_provider_update_workflow
I locally tested that the update of the table happens successfully via the
sub_provider_update_workflow
. The ID, PROVIDER and SOURCE fields of the table look as follows before and after the update.Before:
After:
Checklist
Update index.md
).master
branch of the repository.I added or updated documentation (if applicable).visible errors.
Developer Certificate of Origin
Developer Certificate of Origin