Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figures pipeline performance improvement #429

Merged
merged 6 commits into from Mar 1, 2022

Conversation

johnbaldwin
Copy link
Contributor

@johnbaldwin johnbaldwin commented Feb 25, 2022

This PR contains code to improve Figures pipeline performance for collecting daily metrics

This PR follows on #427

What is in this PR?

  • figures.course - A new module to simplify retrieving data about a specific course. See commit, 839b322
  • figures.pipeline.enrollment_metrics_next - A new module to support an alternate workflow to collect per-enrollment metrics and provide aggregate progress metrics. See commit, 4dcbd75
  • Update to figures.pipeline.course_daily_metrics to allow the deployment to chose which enrollment data collection and reporting to use. See commit, 76aea12
  • Update to figures.tasks to provide new tasks to run this alternate workflow. See commit, 2f08796

See also #428

What needs to follow on as a new PR

  • Update the Django management commands to backfill data

@johnbaldwin johnbaldwin force-pushed the john/pipeline-workflow-next branch 2 times, most recently from 903b064 to 4cd78a7 Compare February 25, 2022 06:05
This PR adds `figures.course`, a new module that has the `Course` class

The goal of this class is to simplify Figures code

This class provides data specific to and associated with a course. Our
initial version provides only the essential data needed for our pipeline
performance improvement
@johnbaldwin johnbaldwin force-pushed the john/pipeline-workflow-next branch 3 times, most recently from 6142aa2 to 439063a Compare February 25, 2022 19:59
This PR provides alternate execution to create the same data as before

* See the module docstring in `figures.pipeline.enrollment_metrics_next` for details
* This PR contains a new module and tests to support the new module
* Default workflow is the old progress calculator, enrollment data
collector (in `figures.pipeline.enrollment_metrics`)
* New workflow calls `figures.pipeline.enrollment_metrics_next'`
* Updates tests to handle both conditions and default when `ed_next` is
not set
* Updated daily metrics tests to exercise new code
* Added new helper function, `fake_course_key` so we don't have to create
  a CourseOverview to get a couse id when we just need a valid course key
* Also lint fixed figures/management/commands/backfill_figures_daily_metrics.py
@johnbaldwin johnbaldwin changed the base branch from john/pipeline-perf-improvement-prep to main February 28, 2022 19:46
Copy link
Contributor

@OmarIthawi OmarIthawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbaldwin I've added two optional nitpicks and one question for celery task scheudling.

The pull request looks good, thanks!

def extract(self, course_id, date_for, **_kwargs):
"""
defaults = dict(
def extract(self, course_id, date_for, ed_next=False, **_kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: ed_next -> use_next_enrollment_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will add the ed_next=False default. But want to keep the name short. It is temporary until it replaces the existing system, I made sure to have all of these checks use the same variable name to make it simple to grep the code for this and the longer use_next_enrollment_data made the source look messier. I was spending too much time making code pretty to fix the quite long variable name. This felt like a waste of time and the short name, being short lived was a reasonable trade-off.

edit: Hopefully sooner rather than later, I'm going to overhaul the pipeline code and task scheduling to be more plugable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looks like I already set ed_next=False as default. I agree, the name isn't great, but I did that abbreviation on purpose. Having you have a nit about it is a bonus. Once we accept "Workflow Next" as the new workflow for Tahoe, we can clean house on the old workflow, reduce a ton of code and this will go away too

@@ -75,11 +75,6 @@ def handle(self, *args, **options):
metrics_func(**kwargs)
else:
metrics_func.delay(**kwargs) # pragma: no cover
# except Exception as e: # pylint: disable=bare-except
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: also remove # try: please:

Comment on lines +307 to +310
populate_daily_metrics_for_site(site_id=site.id,
date_for=date_for,
ed_next=True,
force_update=force_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Celery queue question: This would run populate_daily_metrics_for_site synchronously. Do we want this behavior?

populate_daily_metrics_for_site.delay() should be easier to work with in terms of deployments and queue impact.

We have the group() and delay() pattern used in figures:

figures/figures/tasks.py

Lines 456 to 468 in 2f08796

@shared_task
def run_figures_monthly_metrics():
"""
Populate monthly metrics for all sites.
"""
if waffle.switch_is_active(WAFFLE_DISABLE_PIPELINE):
logger.info('Figures pipeline is disabled due to %s being active.',
WAFFLE_DISABLE_PIPELINE)
return
logger.info('Starting figures.tasks.run_figures_monthly_metrics...')
all_sites_jobs = group(populate_monthly_metrics_for_site.s(site.id) for site in get_sites())
all_sites_jobs.delay()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarIthawi I would prefer to run the tasks synchronously. We have not historically because when I originally implemented the celery tasks, I tried it synchronously with very simple vanilla tasks on standalone edx-platform and they would just "poof" disappear. No logs, no exceptions, nothing. Just drop out of existence. So we carried them on as serial executions since.

Soon in my queue is to drastically improve Figures backfill / data repair toolkit and health checking, and improve the Figures Celery docker devsite environment.

Once we have those plus documentation and training the platform team on how to use it. THEN we can optimize Figures tasks to run in parallel as you mention

* Removed dead code
@johnbaldwin johnbaldwin merged commit deaa5fd into main Mar 1, 2022
@johnbaldwin johnbaldwin deleted the john/pipeline-workflow-next branch March 1, 2022 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants