Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Retrieve sub providers within Flickr #420

Merged
merged 27 commits into from
Jun 22, 2020
Merged

Conversation

ChariniNana
Copy link
Contributor

@ChariniNana ChariniNana commented Jun 3, 2020

Fixes

Fixes #419 by @ChariniNana, Related to #392
Fixed #414 by @kgodey

Description

This addresses the requirement of retrieving sub providers within Flickr. For the time being, it only considers the nasa and bio diversity sub providers. There are seven users currently considered under nasa which may need to be extended/modified later on. The list of sub-providers considered too may be expanded in the future. There are two aspects to this requirement which are as follows:

  1. Retrieve sub-providers at the API level, as and when pulling data from Flickr API.
  2. Update the existing Flickr related information present in the database to reflect the sub-provider information.

Technical details

We maintain a mapping of the sub providers and the IDs of the users (what is contained in the owner field of the API response) that come under each sub provider.

  1. At the API script level, when an image is processed, we check whether the user ID (owner) has a corresponding sub provider in the mapping, and set the source field in the Image Store to the relevant sub provider. Otherwise, the default sub provider Flickr is used.
  2. At the DB level, we initially create a temporary table with the creator URLs (which is the field containing the user ID) and the corresponding sub provider value (such as 'nasa'). Then a join with the image table (where the provider image data is stored) is performed on the creator URL field. Subsequently, the source field is set to the sub provider value from the temporary table, where the creator URLs match, or value 'flickr' otherwise.

Tests

  1. API script level sub provider retrieval: The function test_process_image_data_with_sub_provider within test_flickr test suite checks whether the source is properly set when a sub provider from our mapping is encountered.
  2. DB level sub provider update: The functions test_create_temp_sub_provider_table and test_update_sub_providers within test_sql checks the creation of the temporary table (to help with the DB update) and the successful updating of the image table respectively.
  3. Test for the workflow created for DB sub-provider update is test_sub_provider_update_workflow

I locally tested that the update of the table happens successfully via the sub_provider_update_workflow. The ID, PROVIDER and SOURCE fields of the table look as follows before and after the update.
Before:

  id   | provider | source 
-------+----------+--------
 14335 | flickr   | flickr
 14341 | flickr   | flickr
 14344 | flickr   | flickr
 14351 | flickr   | flickr
 14355 | flickr   | flickr
 14357 | flickr   | flickr
 14361 | flickr   | flickr
 14364 | flickr   | flickr
 14366 | flickr   | flickr
 14369 | flickr   | flickr
 14372 | flickr   | flickr
 14375 | flickr   | flickr
 14378 | flickr   | flickr
 14382 | flickr   | flickr
 40784 | flickr   | flickr
 47237 | flickr   | flickr
 47242 | flickr   | flickr
 47244 | flickr   | flickr
 47245 | flickr   | flickr

After:

  id   | provider |    source     
-------+----------+---------------
 14335 | flickr   | bio_diversity
 14341 | flickr   | bio_diversity
 14344 | flickr   | bio_diversity
 14351 | flickr   | bio_diversity
 14355 | flickr   | bio_diversity
 14357 | flickr   | bio_diversity
 14361 | flickr   | bio_diversity
 14364 | flickr   | bio_diversity
 14366 | flickr   | bio_diversity
 14369 | flickr   | bio_diversity
 14372 | flickr   | bio_diversity
 14375 | flickr   | bio_diversity
 14378 | flickr   | bio_diversity
 14382 | flickr   | bio_diversity
 40784 | flickr   | nasa
 47237 | flickr   | nasa
 47242 | flickr   | nasa
 47244 | flickr   | nasa
 47245 | flickr   | nasa

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@ChariniNana ChariniNana requested review from a team and mathemancer and removed request for a team June 3, 2020 21:33
@ChariniNana
Copy link
Contributor Author

I think I will later need to have the mapping external to the API script such that it's accessible by the database updating script

@ChariniNana ChariniNana self-assigned this Jun 8, 2020
@ChariniNana ChariniNana marked this pull request as draft June 8, 2020 08:42
@ChariniNana ChariniNana marked this pull request as ready for review June 9, 2020 21:54
@ChariniNana ChariniNana requested review from a team and kss682 and removed request for a team June 9, 2020 21:54
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main substantive changes I'd ask for are:

  • remove the step copying the provider over to the source column.
  • remove the SpaceX user from the NASA subprovider.

Other than that, please double-check the new changes with pycodestyle; there's some extra whitespace hanging around here and there.

We'll need to test the performance of the table update at scale.

src/cc_catalog_airflow/dags/util/loader/sql.py Outdated Show resolved Hide resolved
Comment on lines 371 to 376
UPDATE {image_table}
SET {col.SOURCE} = public.{temp_table}.{col.PROVIDER}
FROM public.{temp_table}
WHERE
{image_table}.{col.CREATOR_URL} = public.{temp_table}.{
col.CREATOR_URL};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to test this at scale to see whether we need an index to make this workable. However, we could always add the index within this function, use it, then drop it (to avoid slowing down other things. If the index is added concurrently, that wouldn't block too much. I think it might be worth it, since we're looping through a number of creator URLs (and that number is expected to grow); we'd get to reuse the index.

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a request that I forgot before.

Comment on lines 266 to 267
source = next((s for s in SUB_PROVIDERS if owner in SUB_PROVIDERS[s]),
PROVIDER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I forgot in my earlier review:

Please pass SUB_PROVIDERS and PROVIDER in as parameters. It will make it easier to experiment with other sub-provider sets down the road, and makes testing more robust, since you can pass in precisely the subprovider list you want to test against.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean pass them in as parameters to _process_image_data? Then we need to decide how far up the parameter passing should go. Should it be from where the _process_interval method is called from within the main method because that's the starting point of the flow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of making them defaults in `_process_image_data. That is, the signature would be:

def _process_image_data(image_data, sub_providers=SUB_PROVIDERS, provider=PROVIDER):

Then the further up functions don't need to know about them. The point would be to enable passing different values for testing, and if someone wants to use the function in a not-yet-thought-of manner, but avoid having functions that are already using it needing more info than necessary to call the function.

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome.

Have you tested to make sure the variant methods work? You could parameterize the test you already have to do so.

I'd like to be able to change the method used via environment variable in the near term.

I made a couple of notes about switching some SQL statements around to use the indexes more efficiently (AND isn't commutative in this situation).

postgres.run(
dedent(
f'''
CREATE INDEX IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be concurrent to avoid locking

Suggested change
CREATE INDEX IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx
CREATE INDEX CONCURRENTLY IF NOT EXISTS {image_table}_{col.CREATOR_URL}_idx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not supported. I get the following error: psycopg2.errors.ActiveSqlTransaction: CREATE INDEX CONCURRENTLY cannot run inside a transaction block.
The suggestion I see for this issue on forums is to create the index on the empty table which is not possible in our case

Comment on lines 444 to 446
{image_table}.{col.FOREIGN_ID} = '{foreign_id}'
AND
{image_table}.{col.PROVIDER} = '{default_provider}';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the index is set up means this won't use it, but my suggestion will:

Suggested change
{image_table}.{col.FOREIGN_ID} = '{foreign_id}'
AND
{image_table}.{col.PROVIDER} = '{default_provider}';
{image_table}.{col.PROVIDER} = '{default_provider}'
AND
MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}');

The switch in order and adding of md5s aligns with the precise index so that the planner will set up a complete index scan, which will be as fast as possible.

Comment on lines 484 to 486
{image_table}.{col.FOREIGN_ID} = 'foreign_id'
AND
{image_table}.{col.PROVIDER} = '{default_provider}';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{image_table}.{col.FOREIGN_ID} = 'foreign_id'
AND
{image_table}.{col.PROVIDER} = '{default_provider}';
{image_table}.{col.PROVIDER} = '{default_provider}'
AND
MD5({image_table}.{col.FOREIGN_ID}) = MD5('{foreign_id}');

assert check_result


def test_update_sub_providers(postgres_with_load_and_image_table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could parameterize this to check all three methods

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks ready to go.

I took the liberty of adding a little logging so that we can see how many rows we're changing.

@mathemancer mathemancer merged commit 7eb308b into master Jun 22, 2020
@ChariniNana ChariniNana deleted the retrieve_subprovider branch June 23, 2020 09:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
2 participants