-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create maintenance activity ETL workflow (#939)
* added the maintenance activity etl model in model.py * added the MaintenanceActivity class in model.py * refactored naming of classes to GitHub in model.py * updated table name in model.py * added logic to update github activity to handler.py * added the update_github_activity method to handle updates of github data with respect to dyanamodb * added _write_activity_to_dynamo method for github etl * refactored comment in update_github_activity * added a new method in the GitHubActivity class * deleted handler.py since it will exist after Manasa's PR is merged in * added more details to class and method * update class GitHubActivityType values for variables: LATEST, MONTH, TOTAL * added methods to github_activity_model.py * added update_github_activity method to processor.py * added code related to github data on handler.py and processor.py * modified format for LATEST data store * import github_activity_model to be used in snowflake_adapter.py * added get_plugins_with_commits_in_window method and called the method in update_github_activity method * modified class GitHubActivityType to reflect the schema of the maintenance activity ETL workflow * update the get_query_timestamp_projection method * pushed code for debugging * added mode sql code for fetching github commits and updated the model * added comment to address testing code changes * Testing etl workflow * Tested the GitHubActivityType.TOTAL workflow and verified that all attributes of all items in github_activity_model.py have the correct data prior to writing data to table * Verified that the attributes saved in each item for LATEST and TOTAL are correct prior to saving data to DynamoDB * Verified data stored in each item for all github_activity_model types are correct prior to storing in DynamoDN * refactored table_name and column name commit_count * moved shared methods between install_activity and github_activity to utils.py and modified install_activity_model.py and github_activity_model.py to reflect the changes * removed testing changes * code cleanup * added comment to clarify the need to convert plugin name to case sensitive form * refactored import statements and variables * addressed feedback with respect to refactoring variables * addressed code review feedback * fixed test errors * fixed test errors and refactored code for readability * fixed test errors * added more details to comments in accumulator methods in snowflake_adapter.py * testing if changing timestamp to utc fixes the errors * revert the previous commit since the errors persist * modified _format_timestamp to return timestamp in utc * testing changes to see if errors are fixed * testing changes to see if errors are fixed * reverted changes since tests are still failing due to time difference in tests * reverted changes due to test errors persisting * removed unused import statements * addressed all code review feedback * refactored get_query to improve code readability * updated install activity tests to reflect changes in variable renaming * refactored handler.py to separate updates for install and github activities * added changes such that commit data for hidden plugins is included in DynamoDB * added changes to include commit data to excluded plugins * added import statement * refactored code to capture hidden plugins since there is no data for non-hidden plugins in the exclusion list * revert changes adding hidden plugins * adding terraform changes for accessing s3 in the data-workflows lambda * refactored terraform code * change if statement to elif statement since only one would execute * addressed code review feedback and a few more optimizations * refactored test_snowflake_adapter.py * more code cleanup * converted plugin_name to lower case to be stored in dynamoDB to maintain parity with the implementation of install activity * refactored helpers.py and github_activity_model.py such that the dictionary contains lower case value of plugin name, and therefore there is no need for plugin_name.lower() * addressed feedback and refactored github_activity_model.py even further * added docstring to transform_and_write_to_dynamo in github_activity_model.py * added docstring to snowflake_adapter method to get plugins commit count * addressed docstring nits
- Loading branch information
Showing
11 changed files
with
303 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
import logging | ||
import time | ||
from datetime import datetime | ||
from enum import Enum, auto | ||
from typing import List, Union | ||
import os | ||
|
||
from pynamodb.models import Model | ||
from pynamodb.attributes import UnicodeAttribute, NumberAttribute | ||
|
||
from utils.utils import get_current_timestamp, date_to_utc_timestamp_in_millis, datetime_to_utc_timestamp_in_millis | ||
from plugin.helpers import _get_cache, _get_repo_to_plugin_dict | ||
|
||
|
||
LOGGER = logging.getLogger() | ||
TIMESTAMP_FORMAT = "TO_TIMESTAMP('{0:%Y-%m-%d %H:%M:%S}')" | ||
|
||
|
||
class GitHubActivityType(Enum): | ||
def __new__(cls, timestamp_formatter, type_identifier_formatter, query_projection, query_sorting): | ||
github_activity_type = object.__new__(cls) | ||
github_activity_type._value = auto() | ||
github_activity_type.timestamp_formatter = timestamp_formatter | ||
github_activity_type.type_identifier_formatter = type_identifier_formatter | ||
github_activity_type.query_projection = query_projection | ||
github_activity_type.query_sorting = query_sorting | ||
return github_activity_type | ||
|
||
LATEST = (datetime_to_utc_timestamp_in_millis, 'LATEST:{0}', | ||
'repo AS name, to_timestamp(max(commit_author_date)) as latest_commit', 'name') | ||
MONTH = (date_to_utc_timestamp_in_millis, 'MONTH:{1:%Y%m}:{0}', | ||
'repo AS name, date_trunc("month", to_date(commit_author_date)) as month, count(*) as commit_count', | ||
'name, month') | ||
TOTAL = (lambda timestamp: None, 'TOTAL:{0}', 'repo AS name, count(*) as commit_count', 'name') | ||
|
||
def format_to_timestamp(self, timestamp: datetime) -> Union[int, None]: | ||
return self.timestamp_formatter(timestamp) | ||
|
||
def format_to_type_identifier(self, repo_name: str, identifier_timestamp: str) -> str: | ||
return self.type_identifier_formatter.format(repo_name, identifier_timestamp) | ||
|
||
def _create_subquery(self, plugins_by_earliest_ts: dict[str, datetime]) -> str: | ||
if self is GitHubActivityType.MONTH: | ||
return " OR ".join( | ||
[ | ||
f"repo = '{name}' AND to_timestamp(commit_author_date) >= " | ||
f"{TIMESTAMP_FORMAT.format(ts.replace(day=1))}" | ||
for name, ts in plugins_by_earliest_ts.items() | ||
] | ||
) | ||
return f"""repo IN ({','.join([f"'{plugin}'" for plugin in plugins_by_earliest_ts.keys()])})""" | ||
|
||
def get_query(self, plugins_by_earliest_ts: dict[str, datetime]) -> str: | ||
return f""" | ||
SELECT | ||
{self.query_projection} | ||
FROM | ||
imaging.github.commits | ||
WHERE | ||
repo_type = 'plugin' | ||
AND {self._create_subquery(plugins_by_earliest_ts)} | ||
GROUP BY {self.query_sorting} | ||
ORDER BY {self.query_sorting} | ||
""" | ||
|
||
|
||
class GitHubActivity(Model): | ||
class Meta: | ||
host = os.getenv('LOCAL_DYNAMO_HOST') | ||
region = os.getenv('AWS_REGION', 'us-west-2') | ||
table_name = f"{os.getenv('STACK_NAME', 'local')}-github-activity" | ||
|
||
plugin_name = UnicodeAttribute(hash_key=True) | ||
type_identifier = UnicodeAttribute(range_key=True) | ||
granularity = UnicodeAttribute(attr_name='type') | ||
timestamp = NumberAttribute(null=True) | ||
commit_count = NumberAttribute(null=True) | ||
repo = UnicodeAttribute() | ||
last_updated_timestamp = NumberAttribute(default_for_new=get_current_timestamp) | ||
|
||
def __eq__(self, other): | ||
if isinstance(other, GitHubActivity): | ||
return ( | ||
self.plugin_name == other.plugin_name and | ||
self.type_identifier == other.type_identifier and | ||
self.granularity == other.granularity and | ||
self.timestamp == other.timestamp and | ||
self.commit_count == other.commit_count and | ||
self.repo == other.repo | ||
) | ||
return False | ||
|
||
|
||
def transform_and_write_to_dynamo(data: dict[str, List], activity_type: GitHubActivityType) -> None: | ||
"""Transforms plugin commit data generated by get_plugins_commit_count_since_timestamp to the expected format | ||
and then writes the formatted data to the corresponding github-activity dynamo table in each environment | ||
:param dict[str, list] data: plugin commit data of type dictionary in which the key is plugin name | ||
of type str and the value is Github activities of type list | ||
:param GitHubActivityType activity_type: | ||
""" | ||
LOGGER.info(f'Starting item creation for github-activity type={activity_type.name}') | ||
|
||
batch = GitHubActivity.batch_write() | ||
|
||
start = time.perf_counter() | ||
count = 0 | ||
repo_to_plugin_dict = _get_repo_to_plugin_dict() | ||
for repo, github_activities in data.items(): | ||
plugin_name = repo_to_plugin_dict.get(repo) | ||
if plugin_name is None: | ||
continue | ||
for activity in github_activities: | ||
identifier_timestamp = activity.get('timestamp', '') | ||
timestamp = activity.get('timestamp') | ||
commit_count = activity.get('count') | ||
item = GitHubActivity( | ||
plugin_name, | ||
activity_type.format_to_type_identifier(repo, identifier_timestamp), | ||
granularity=activity_type.name, | ||
timestamp=activity_type.format_to_timestamp(timestamp), | ||
commit_count=commit_count, | ||
repo=repo) | ||
batch.save(item) | ||
count += 1 | ||
|
||
batch.commit() | ||
duration = (time.perf_counter() - start) * 1000 | ||
|
||
LOGGER.info(f'Items github-activity type={activity_type.name} count={count}') | ||
LOGGER.info(f'Transform and write to github-activity type={activity_type.name} timeTaken={duration}ms') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.