Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

🔄 synced file(s) with WordPress/openverse #963

Merged
merged 5 commits into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# here will be overwritten. Please make any necessary edits to these files:
# - https://github.com/WordPress/openverse/blob/main/.pre-commit-config.yaml.jinja

exclude: Pipfile\.lock|migrations|\.idea|node_modules|archive
exclude: Pipfile\.lock|migrations|\.idea|node_modules|archive|retired

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
Expand Down Expand Up @@ -30,7 +30,7 @@ repos:

# Use the `.isort.cfg` file to configure additional project-specific requirements.
- repo: https://github.com/PyCQA/isort
rev: 5.9.1
rev: 5.12.0
hooks:
- id: isort
files: \.py$
Expand Down Expand Up @@ -62,6 +62,22 @@ repos:
args:
- --safe

- repo: https://github.com/pycqa/pydocstyle
rev: 6.2.2 # 6.2.3 is slightly broken
hooks:
- id: pydocstyle
args:
- --convention=pep257
# Additional ignore reasons:
# D1xx: we do not want to force contributors to write redundant or useless docstrings
# D202: additional whitespace helps with readability
# D205: we don't want to always require a single line summary
# D211: same as D202
# D400: first line doesn't need to end in a period
# See the following documentation for what each rule does:
# https://www.pydocstyle.org/en/6.2.3/error_codes.html#error-codes
- --add-ignore=D1,D202,D205,D211,D400

# Use the `.prettierignore` and `.prettier.config.js` files to configure project-specific requirements.
- repo: https://github.com/pre-commit/mirrors-prettier
rev: v2.6.0
Expand Down
22 changes: 11 additions & 11 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ The following is documentation associated with each DAG (where available):

## `airflow_log_cleanup`

### Clean up airflow logs

A maintenance workflow that you can deploy into Airflow to periodically clean
out the task logs to avoid those getting too big. By default, this will also
clean child process logs from the 'scheduler' directory.
Expand Down Expand Up @@ -183,7 +185,9 @@ and related PRs:

## `check_silenced_dags`

Checks for DAGs that have silenced Slack alerts which may need to be turned back
### Silenced DAGs check

Check for DAGs that have silenced Slack alerts which may need to be turned back
on.

When a DAG has known failures, it can be ommitted from Slack error reporting by
Expand Down Expand Up @@ -393,10 +397,6 @@ or restrictions. https://nappy.co/

### OAuth Provider Authorization

**Author**: Madison Swain-Bowden

**Created**: 2021-10-13

Iterates through all the OAuth2 providers and attempts to authorize them using
tokens found in the in the `OAUTH2_AUTH_KEYS` Variable. Once authorization has
been completed successfully, the auth token is removed from that Variable. The
Expand All @@ -411,10 +411,6 @@ authorization will create an access/refresh token pair in the

### OAuth Provider Token Refresh

**Author**: Madison Swain-Bowden

**Created**: 2021-10-13

Iterates through all OAuth2 providers and attempts to refresh the access token
using the refresh token stored in the `OAUTH2_ACCESS_TOKENS` Variable. This DAG
will update the tokens stored in the Variable upon successful refresh.
Expand Down Expand Up @@ -445,6 +441,8 @@ Notes: http://phylopic.org/api/ No rate limit specified.

## `pr_review_reminders`

### PR Review Reminders

Iterates through open PRs in our repositories and pings assigned reviewers who
have not yet approved the PR or explicitly requested changes.

Expand Down Expand Up @@ -515,8 +513,10 @@ so is only counted once in the reporting by this DAG.

## `rotate_db_snapshots`

Manages weekly database snapshots. RDS does not support weekly snapshots
schedules on its own, so we need a DAG to manage this for us.
Manages weekly database snapshots.

RDS does not support weekly snapshots schedules on its own, so we need a DAG to
manage this for us.

It runs on Saturdays at 00:00 UTC in order to happen before the data refresh.

Expand Down
4 changes: 1 addition & 3 deletions openverse_catalog/dags/common/extensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@


def extract_filetype(url: str, media_type: str) -> str | None:
"""
Extracts the filetype from a media url extension.
"""
"""Extract the filetype from a media url extension."""
possible_filetype = url.split(".")[-1]
if possible_filetype in EXTENSIONS.get(media_type, {}):
return possible_filetype
Expand Down
4 changes: 1 addition & 3 deletions openverse_catalog/dags/common/github.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@

class GitHubAPI:
def __init__(self, pat: str):
"""
:param pat: GitHub Personal Access Token to use to authenticate requests
"""
""":param pat: GitHub Personal Access Token to use to authenticate requests"""
self.session = requests.Session()
self.session.headers["Authorization"] = f"token {pat}"

Expand Down
2 changes: 1 addition & 1 deletion openverse_catalog/dags/common/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ class IngestionInput(NamedTuple):

def get_partitioned_reingestion_days(inputs: list[IngestionInput]):
"""
This method calculates day-shift lists for Provider API workflows.
Calculate day-shift lists for Provider API workflows.

The input should be a list of pairs of integers:

Expand Down
4 changes: 2 additions & 2 deletions openverse_catalog/dags/common/licenses/licenses.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class InvalidLicenseURLException(Exception):
@lru_cache(maxsize=1024)
def get_license_info(license_url=None, license_=None, license_version=None):
"""
Returns a valid license, version, license URL tuple if possible.
Return a valid license, version, license URL tuple if possible.

Three optional arguments:
license_url: String URL to a CC license page.
Expand Down Expand Up @@ -182,7 +182,7 @@ def get_license_info_from_license_pair(
license_, license_version, pair_map=REVERSE_LICENSE_PATH_MAP
) -> tuple[str | None, str | None, str | None]:
"""
Validates a given license pair, and derives a license URL from it.
Validate a given license pair, and derive a license URL from it.

Returns both the validated pair and the derived license URL.
"""
Expand Down
2 changes: 1 addition & 1 deletion openverse_catalog/dags/common/loader/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def upsert_data(
duplicates_count: tuple[int, int],
) -> RecordMetrics:
"""
Upserts data into the catalog DB from the loading table, and calculates
Upsert data into the catalog DB from the loading table, and calculate
final record metrics.
"""
missing_columns, foreign_id_dup = duplicates_count
Expand Down
7 changes: 6 additions & 1 deletion openverse_catalog/dags/common/loader/paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

def _extract_media_type(tsv_file_name: str | None) -> str:
"""
Extract the media type from a filename.

By default, the filename will be:
`folder/provider_timestamp.tsv` for older version
`folder/provider_<media_type>_timestamp.tsv` for newer version
Expand All @@ -27,7 +29,10 @@ def _extract_media_type(tsv_file_name: str | None) -> str:


def get_tsv_version(tsv_file_name: str) -> str:
"""TSV file version can be deducted from the filename
"""
Extract the TSV version from a filename.

TSV file version can be deducted from the filename using:
v0: without _vN_ in the filename
v1+: has a _vN in the filename

Expand Down
34 changes: 21 additions & 13 deletions openverse_catalog/dags/common/loader/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,12 @@


def create_column_definitions(table_columns: list[Column], is_loading=True):
"""Loading table should not have 'NOT NULL' constraints: all TSV values
are copied, and then the items without required columns are dropped"""
"""
Create column definitions for a table.

Loading table should not have 'NOT NULL' constraints: all TSV values
are copied, and then the items without required columns are dropped.
"""
definitions = [column.create_definition(is_loading) for column in table_columns]
return ",\n ".join(definitions)

Expand All @@ -62,9 +66,7 @@ def create_loading_table(
identifier: str,
media_type: str = IMAGE,
):
"""
Create intermediary table and indices if they do not exist
"""
"""Create intermediary table and indices if they do not exist."""
load_table = _get_load_table_name(identifier, media_type=media_type)
postgres = PostgresHook(postgres_conn_id=postgres_conn_id)
loading_table_columns = TSV_COLUMNS[media_type]
Expand Down Expand Up @@ -129,8 +131,10 @@ def load_local_data_to_intermediate_table(

def _handle_s3_load_result(cursor) -> int:
"""
Handle the results of the aws_s3.table_import_from_s3 function. Locally this will
return an integer, but on AWS infrastructure it will return a string similar to:
Handle the results of the aws_s3.table_import_from_s3 function.

Locally this will return an integer, but on AWS infrastructure it will return a
string similar to:

500 rows imported into relation "..." from file ... of ... bytes
"""
Expand Down Expand Up @@ -176,8 +180,9 @@ def clean_intermediate_table_data(
media_type: MediaType = IMAGE,
) -> tuple[int, int]:
"""
Necessary for old TSV files that have not been cleaned up,
using `MediaStore` class:
Clean the data in the intermediate table.

Necessary for old TSV files that have not been cleaned up, using `MediaStore` class:
Removes any rows without any of the required fields:
`url`, `license`, `license_version`, `foreign_id`.
Also removes any duplicate rows that have the same `provider`
Expand Down Expand Up @@ -216,9 +221,11 @@ def _is_tsv_column_from_different_version(
column: Column, media_type: str, tsv_version: str
) -> bool:
"""
Checks that column is a column that exists in TSV files (unlike the db-only
Check that a column appears in the available columns for a TSV verison.

Check that column is a column that exists in TSV files (unlike the db-only
columns like IDENTIFIER or CREATED_ON), but is not available for `tsv_version`.
For example, Category column was added to Image TSV in version 001
For example, Category column was added to Image TSV in version 001:
>>> from common.storage import CATEGORY, DIRECT_URL
>>> _is_tsv_column_from_different_version(CATEGORY, IMAGE, '000')
True
Expand All @@ -227,7 +234,6 @@ def _is_tsv_column_from_different_version(
>>> from common.storage import IDENTIFIER
>>> _is_tsv_column_from_different_version(IDENTIFIER, IMAGE, '000')
False

"""
return (
column not in COLUMNS[media_type][tsv_version]
Expand All @@ -243,9 +249,11 @@ def upsert_records_to_db_table(
tsv_version: str = CURRENT_TSV_VERSION,
):
"""
Upserts newly ingested records from loading table into the main db table.
Upsert newly ingested records from loading table into the main db table.

For tsv columns that do not exist in the `tsv_version` for `media_type`,
NULL value is used.

:param postgres_conn_id
:param identifier
:param db_table
Expand Down
13 changes: 8 additions & 5 deletions openverse_catalog/dags/common/log_cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,9 @@ def dir_size_in_mb(dir_paths: list[Path] | Path):


def get_folders_to_delete(dag_log_folder: Path, max_log_age_in_days: int) -> list[Path]:
"""Returns a list of log folders that are older `than max_log_age_in_days`
"""
Return a list of log folders that are older `than max_log_age_in_days`.

The folder structure is as follows:
`{dag_id}/{task_id}/{timestamp}/{try}.log`
This function iterates over all `{timestamp}` folders, detects the
Expand Down Expand Up @@ -90,10 +92,11 @@ def clean_up(
should_delete: bool | str,
**kwargs,
) -> list[Path]:
"""Finds all log folders that were modified more than
`max_log_age_in_days` days ago, and
deletes them, if `should_delete` is True, or
logs them, if `should_delete` is False.
"""
Find and delete all log folders that were modified more than `max_log_age_in_days`.

Deletion only happens if `should_delete` is True, otherwise they are logged
(e.g. if `should_delete` is False).

:param base_log_folder: the folder in which dag log folders
are located.
Expand Down
4 changes: 1 addition & 3 deletions openverse_catalog/dags/common/popularity/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,9 +303,7 @@ def create_standardized_media_popularity_function(


def create_audioset_view_query():
"""
Returns SQL to create the audioset_view.
"""
"""Return SQL to create the audioset_view."""
return dedent(
f"""
CREATE VIEW public.{AUDIOSET_VIEW_NAME}
Expand Down
6 changes: 3 additions & 3 deletions openverse_catalog/dags/common/requester.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ class SocketConnectBlockedError(Exception):


class RetriesExceeded(Exception):
"""
Custom exception for when the number of allowed retries has been exceeded.
"""
"""Custom exception for when the number of allowed retries has been exceeded."""

pass


class DelayedRequester:
"""
Requester class with a built-in delay.

Provides methods `get` and `head` that are wrappers around the `requests`
module methods with the same name (i.e., it simply passes along whatever
arguments it receives). The difference is that when this class is initialized
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

class SingleRunExternalDAGsSensor(BaseSensorOperator):
"""
Sensor for ensuring only one DAG runs at a time.

Waits for a list of related DAGs, each assumed to have a similar Sensor,
to not be running. A related DAG is considered to be 'running' if it is
itself in the running state, and its corresponding wait task completed
Expand Down
Loading