Figures pipeline performance improvement #429

johnbaldwin · 2022-02-25T05:12:04Z

This PR contains code to improve Figures pipeline performance for collecting daily metrics

This PR follows on #427

What is in this PR?

figures.course - A new module to simplify retrieving data about a specific course. See commit, 839b322
figures.pipeline.enrollment_metrics_next - A new module to support an alternate workflow to collect per-enrollment metrics and provide aggregate progress metrics. See commit, 4dcbd75
Update to figures.pipeline.course_daily_metrics to allow the deployment to chose which enrollment data collection and reporting to use. See commit, 76aea12
Update to figures.tasks to provide new tasks to run this alternate workflow. See commit, 2f08796

What needs to follow on as a new PR

Update the Django management commands to backfill data

This PR adds `figures.course`, a new module that has the `Course` class The goal of this class is to simplify Figures code This class provides data specific to and associated with a course. Our initial version provides only the essential data needed for our pipeline performance improvement

This PR provides alternate execution to create the same data as before * See the module docstring in `figures.pipeline.enrollment_metrics_next` for details * This PR contains a new module and tests to support the new module

* Default workflow is the old progress calculator, enrollment data collector (in `figures.pipeline.enrollment_metrics`) * New workflow calls `figures.pipeline.enrollment_metrics_next'` * Updates tests to handle both conditions and default when `ed_next` is not set

* Updated daily metrics tests to exercise new code * Added new helper function, `fake_course_key` so we don't have to create a CourseOverview to get a couse id when we just need a valid course key * Also lint fixed figures/management/commands/backfill_figures_daily_metrics.py

OmarIthawi

@johnbaldwin I've added two optional nitpicks and one question for celery task scheudling.

The pull request looks good, thanks!

OmarIthawi · 2022-03-01T07:57:53Z

figures/pipeline/course_daily_metrics.py

-    def extract(self, course_id, date_for, **_kwargs):
-        """
-            defaults = dict(
+    def extract(self, course_id, date_for, ed_next=False, **_kwargs):


nitpick: ed_next -> use_next_enrollment_data

Good point, I will add the ed_next=False default. But want to keep the name short. It is temporary until it replaces the existing system, I made sure to have all of these checks use the same variable name to make it simple to grep the code for this and the longer use_next_enrollment_data made the source look messier. I was spending too much time making code pretty to fix the quite long variable name. This felt like a waste of time and the short name, being short lived was a reasonable trade-off.

edit: Hopefully sooner rather than later, I'm going to overhaul the pipeline code and task scheduling to be more plugable.

Hmm, looks like I already set ed_next=False as default. I agree, the name isn't great, but I did that abbreviation on purpose. Having you have a nit about it is a bonus. Once we accept "Workflow Next" as the new workflow for Tahoe, we can clean house on the old workflow, reduce a ton of code and this will go away too

OmarIthawi · 2022-03-01T07:59:23Z

figures/management/commands/backfill_figures_daily_metrics.py

@@ -75,11 +75,6 @@ def handle(self, *args, **options):
                metrics_func(**kwargs)
            else:
                metrics_func.delay(**kwargs)  # pragma: no cover
-            # except Exception as e:  # pylint: disable=bare-except


nitpick: also remove # try: please:

figures/figures/management/commands/backfill_figures_daily_metrics.py

Line 73 in 2f08796

# try:

OmarIthawi · 2022-03-01T08:04:23Z

figures/tasks.py

+            populate_daily_metrics_for_site(site_id=site.id,
+                                            date_for=date_for,
+                                            ed_next=True,
+                                            force_update=force_update)


Celery queue question: This would run populate_daily_metrics_for_site synchronously. Do we want this behavior?

populate_daily_metrics_for_site.delay() should be easier to work with in terms of deployments and queue impact.

We have the group() and delay() pattern used in figures:

figures/figures/tasks.py

Lines 456 to 468 in 2f08796

@shared_task

def run_figures_monthly_metrics():

"""

Populate monthly metrics for all sites.

"""

if waffle.switch_is_active(WAFFLE_DISABLE_PIPELINE):

logger.info('Figures pipeline is disabled due to %s being active.',

WAFFLE_DISABLE_PIPELINE)

return

logger.info('Starting figures.tasks.run_figures_monthly_metrics...')

all_sites_jobs = group(populate_monthly_metrics_for_site.s(site.id) for site in get_sites())

all_sites_jobs.delay()

@OmarIthawi I would prefer to run the tasks synchronously. We have not historically because when I originally implemented the celery tasks, I tried it synchronously with very simple vanilla tasks on standalone edx-platform and they would just "poof" disappear. No logs, no exceptions, nothing. Just drop out of existence. So we carried them on as serial executions since.

Soon in my queue is to drastically improve Figures backfill / data repair toolkit and health checking, and improve the Figures Celery docker devsite environment.

Once we have those plus documentation and training the platform team on how to use it. THEN we can optimize Figures tasks to run in parallel as you mention

* Removed dead code

johnbaldwin force-pushed the john/pipeline-workflow-next branch 2 times, most recently from 903b064 to 4cd78a7 Compare February 25, 2022 06:05

johnbaldwin force-pushed the john/pipeline-workflow-next branch 3 times, most recently from 6142aa2 to 439063a Compare February 25, 2022 19:59

Add next iteration to collect enrollment data

f6a177d

This PR provides alternate execution to create the same data as before * See the module docstring in `figures.pipeline.enrollment_metrics_next` for details * This PR contains a new module and tests to support the new module

johnbaldwin force-pushed the john/pipeline-workflow-next branch from 439063a to f6a177d Compare February 25, 2022 21:06

johnbaldwin added 2 commits February 25, 2022 17:56

johnbaldwin marked this pull request as ready for review February 26, 2022 06:25

johnbaldwin requested review from OmarIthawi, melvinsoft, shadinaif, estherjsuh and bryanlandia as code owners February 26, 2022 06:25

Merge branch 'main' into john/pipeline-workflow-next

3a98078

johnbaldwin changed the base branch from john/pipeline-perf-improvement-prep to main February 28, 2022 19:46

OmarIthawi approved these changes Mar 1, 2022

View reviewed changes

Address PR #429 comment

6c4e043

* Removed dead code

johnbaldwin merged commit deaa5fd into main Mar 1, 2022

johnbaldwin deleted the john/pipeline-workflow-next branch March 1, 2022 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures pipeline performance improvement #429

Figures pipeline performance improvement #429

johnbaldwin commented Feb 25, 2022 •

edited

OmarIthawi left a comment

OmarIthawi Mar 1, 2022

johnbaldwin Mar 1, 2022 •

edited

johnbaldwin Mar 1, 2022

OmarIthawi Mar 1, 2022

OmarIthawi Mar 1, 2022

johnbaldwin Mar 1, 2022

	@shared_task
	def run_figures_monthly_metrics():
	"""
	Populate monthly metrics for all sites.
	"""
	if waffle.switch_is_active(WAFFLE_DISABLE_PIPELINE):
	logger.info('Figures pipeline is disabled due to %s being active.',
	WAFFLE_DISABLE_PIPELINE)
	return

	logger.info('Starting figures.tasks.run_figures_monthly_metrics...')
	all_sites_jobs = group(populate_monthly_metrics_for_site.s(site.id) for site in get_sites())
	all_sites_jobs.delay()

Figures pipeline performance improvement #429

Figures pipeline performance improvement #429

Conversation

johnbaldwin commented Feb 25, 2022 • edited

What is in this PR?

What needs to follow on as a new PR

OmarIthawi left a comment

Choose a reason for hiding this comment

OmarIthawi Mar 1, 2022

Choose a reason for hiding this comment

johnbaldwin Mar 1, 2022 • edited

Choose a reason for hiding this comment

johnbaldwin Mar 1, 2022

Choose a reason for hiding this comment

OmarIthawi Mar 1, 2022

Choose a reason for hiding this comment

OmarIthawi Mar 1, 2022

Choose a reason for hiding this comment

johnbaldwin Mar 1, 2022

Choose a reason for hiding this comment

johnbaldwin commented Feb 25, 2022 •

edited

johnbaldwin Mar 1, 2022 •

edited