Move HoC activity and project count cronjobs to reader #33611

bencodeorg · 2020-03-13T00:27:00Z

[update: I ended up running the script on our production clone to generate data for the past 4 months. I can move those JSON files (generated daily) to production-daemon if that seems sensible to others? In any case, the contents of this PR still stand.]

High level approach proposed:

Copy four months of backfilled data from adhoc to production-daemon.
Manually execute analyze_hoc_activity for single day on production-daemon.
If no problems with store sprite.value again, use it properly in setSpriteSize #2, manually execute remaining 2 weeks (March 3-present) of analyze_hoc_activity on production-daemon.
Confirm that the total # of hours served is sensible, and turn off manual override if so.
Uncomment cron tasks that run these scripts daily.

Details:

The analyze_hoc_activity and update_project_count scripts that produce high level stats that appear on our website (eg, # of projects created and number of hours of code started) have been broken for several months.

After much research, I've pointed the analyze_hoc_activity script and update_project_count scripts to our reader connection (which, in my understanding, may actually be routed back to the writer on production-daemon).

The whole analyze_hoc_activity script (consists of 10 or so pretty large SQL queries) took between 10 and 60 seconds to run on a production clone (depending if covering a weekday or weekend -- could be as high as 5 minutes for days during CSEdWeek). I am pretty confident we could run these queries safely in production, at least for a single day of data (even if it's the writer connection).

I think the speediness is a function of the comprehensive indexes on tables / columns used in this script, particularly hoc_activities and forms in pegasus. Try out select index from hoc_activity, for example, to check out all the indexes on that table.

I've commented out the cron entries that run these scripts so that we can run this manually for a single day (currently set up to run just for February 1 2020 for testing purposes) before running it for more days (we'll need to backfill several months of data, have a couple ideas on how we could do that).

Other notes:

there's some code that uses "milestone logs" to get total "lines of code written". I looked at the folder on production-daemon where that happens, and it looks incomplete. My guess is we should deprecate those counts from our website, but would need to check with stakeholders first.

Testing story

Tested this manually on an adhoc connected to a production clone. Note that I overrode the db_reader config entry so that the queries were pointed at the reader endpoint, although I am not sure if this will be possible by default on production-daemon currently.

bencodeorg · 2020-03-13T18:52:24Z

Update: the test day I picked (Feb 1 2020) was a weekend, so much less data. Weekdays take between 30-45 seconds. Running on production clone now to see if we can just move daily JSON files from adhoc to production-daemon.

bencodeorg · 2020-03-13T21:00:31Z

Ran last several months on ad hoc, CSEdWeek days can take 5 minutes or so:

2019-12-01 analyzed in 9.96792265 seconds
2019-12-02 analyzed in 70.652760377 seconds
2019-12-03 analyzed in 87.93380079 seconds
2019-12-04 analyzed in 105.384783584 seconds
2019-12-05 analyzed in 107.52120741 seconds
2019-12-06 analyzed in 122.852147989 seconds
2019-12-07 analyzed in 23.076201008 seconds
2019-12-08 analyzed in 22.159987951 seconds
2019-12-09 analyzed in 248.853900645 seconds
2019-12-10 analyzed in 301.444195127 seconds
2019-12-11 analyzed in 304.008889511 seconds
2019-12-12 analyzed in 330.687762912 seconds
2019-12-13 analyzed in 335.317417808 seconds
2019-12-14 analyzed in 43.055903091 seconds
2019-12-15 analyzed in 28.982055843 seconds
2019-12-16 analyzed in 152.126159487 seconds
2019-12-17 analyzed in 161.083998451 seconds
2019-12-18 analyzed in 169.7574601 seconds
2019-12-19 analyzed in 159.981020923 seconds
2019-12-20 analyzed in 107.839640934 seconds
2019-12-21 analyzed in 18.972263545 seconds
2019-12-22 analyzed in 13.468384268 seconds
2019-12-23 analyzed in 19.42988667 seconds
2019-12-24 analyzed in 17.144985815 seconds
2019-12-25 analyzed in 12.958418824 seconds
2019-12-26 analyzed in 13.788052331 seconds
2019-12-27 analyzed in 12.013978134 seconds
2019-12-28 analyzed in 8.211618392 seconds
2019-12-29 analyzed in 7.337637396 seconds
2019-12-30 analyzed in 10.139073447 seconds
2019-12-31 analyzed in 9.440766457 seconds

… day

sureshc · 2020-03-17T00:05:57Z

bin/cron/analyze_hoc_activity

@@ -51,10 +51,10 @@ def main
  # https://docs.google.com/document/d/1RTTCpkDYZjqZxfVehkZRkk1HckYMvFdFGs6SEZnK1I8
  total_started += 14_861_327

-  today = DateTime.now.to_date
-  day = Date.strptime('2014/12/06', '%Y/%m/%d')
+  today = Date.strptime('2020/02/01', '%Y/%m/%d')


Could we use the Date constructor, which takes integer year, month, and day arguments? And same for the next line?

Suggested change

today = Date.strptime('2020/02/01', '%Y/%m/%d')

today = Date.new(2020, 2, 1)

I can do that :)

sureshc · 2020-03-17T00:08:35Z

bin/cron/analyze_hoc_activity


-  while day <= today
+  while day < today


day and today are equal, right (Feb 1, 2020)? Will this entire while loop be skipped?

True -- I changed this because (please confirm my logic) the analyze_hoc_activity cronjob was scheduled to run every hour on the :35, so what ended up actually happening is that the counts for a given day would be run (and overwritten every hour) until 11:35 PM, which would represent the "entire day" of data. This (I think) will result in only running this script when we have a full day's data (I'm imagining if we turn it back on, we should switch the cron schedule just to run daily, instead of hourly).

So, like you said, this actually won't run as-is, but my thought was that for running this manually, we can just modify the dates to whatever we'd like once it's on production-daemon.

OK - now I understand. Looks great! Nice that you were able to run this on the clone. :)

bencodeorg · 2020-03-17T19:05:15Z

One more thing to get your eyes on @sureshc -- there was an increase in select latency on the clone when I ran this, to about half a second:

There aren't any other reads happening to this database (that I know of) at this time -- does latency = query execution time (which I'd expect to be higher given these are expensive queries), or does it represent something like time between query request made and query actually beginning to execute?

wjordan · 2020-03-17T19:16:37Z

does latency = query execution time (which I'd expect to be higher given these are expensive queries), or does it represent something like time between query request made and query actually beginning to execute?

Pretty sure this metric is average query execution time, so it's expected this would fluctuate based on the types of queries being sent to the instance.

wjordan

Given the query times posted I'm not too concerned with running this query wherever it needs to be run, overall using the db_reader endpoints for them is fine by me.

We should probably eventually set db_reader to the read-replica endpoint on production-daemon / production-console, but I think this script update is fine even before that change.

wjordan · 2020-03-17T19:23:03Z

bin/cron/analyze_hoc_activity

+  # Note that these queries use the "DB" connection, which is set in lib/cdo/properties.rb to PEGASUS_DB
+  # PEGASUS_DB is defined in lib/cdo/db.rb using sequel with connections to the pegasus writer and reader.
+  # So, even though most of the script uses the reader connections (eg, DASHBOARD_DB_READER),
+  # the "DB" connection used here does include a connection to the writer.


If we're just doing SELECT queries with the DB connection here, should we just change DB[:forms].where to PEGASUS_DB_READER[:forms].where so this script is consistently reading from the same place?

That makes sense to me -- let me try and switch and confirm everything works smoothly on the clone.

bencodeorg added 2 commits March 12, 2020 17:08

Move HoC activity and project count cronjobs to reader

1e0c0b5

Verbose comment about database connection

2bdbd79

bencodeorg requested review from sureshc, wjordan and islemaster March 13, 2020 16:58

Only calculate hour of code activity when we have complete data for a…

083d9a3

… day

sureshc reviewed Mar 17, 2020

View reviewed changes

sureshc approved these changes Mar 17, 2020

View reviewed changes

wjordan approved these changes Mar 17, 2020

View reviewed changes

Respond to PR feedback

4de7f8e

bencodeorg merged commit f93a3e3 into staging Mar 18, 2020

bencodeorg deleted the resuscitate-analyze-hoc-activity branch March 18, 2020 17:45

bencodeorg mentioned this pull request Mar 20, 2020

Turn back on analyze_hoc_activity cron job #33748

Merged

7 tasks

bencodeorg mentioned this pull request Mar 30, 2021

Manually include data from Microsoft in Hours Served metric #39807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move HoC activity and project count cronjobs to reader #33611

Move HoC activity and project count cronjobs to reader #33611

bencodeorg commented Mar 13, 2020 •

edited

bencodeorg commented Mar 13, 2020

bencodeorg commented Mar 13, 2020

sureshc Mar 17, 2020

bencodeorg Mar 17, 2020

sureshc Mar 17, 2020

bencodeorg Mar 17, 2020

bencodeorg Mar 17, 2020

sureshc Mar 17, 2020

bencodeorg commented Mar 17, 2020

wjordan commented Mar 17, 2020

wjordan left a comment

wjordan Mar 17, 2020

bencodeorg Mar 17, 2020

	today = Date.strptime('2020/02/01', '%Y/%m/%d')
	today = Date.new(2020, 2, 1)

Move HoC activity and project count cronjobs to reader #33611

Move HoC activity and project count cronjobs to reader #33611

Conversation

bencodeorg commented Mar 13, 2020 • edited

Testing story

bencodeorg commented Mar 13, 2020

bencodeorg commented Mar 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg commented Mar 17, 2020

wjordan commented Mar 17, 2020

wjordan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bencodeorg commented Mar 13, 2020 •

edited