Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move HoC activity and project count cronjobs to reader #33611

Merged
merged 4 commits into from Mar 18, 2020

Conversation

bencodeorg
Copy link
Contributor

@bencodeorg bencodeorg commented Mar 13, 2020

[update: I ended up running the script on our production clone to generate data for the past 4 months. I can move those JSON files (generated daily) to production-daemon if that seems sensible to others? In any case, the contents of this PR still stand.]

High level approach proposed:

  1. Copy four months of backfilled data from adhoc to production-daemon.
  2. Manually execute analyze_hoc_activity for single day on production-daemon.
  3. If no problems with store sprite.value again, use it properly in setSpriteSize #2, manually execute remaining 2 weeks (March 3-present) of analyze_hoc_activity on production-daemon.
  4. Confirm that the total # of hours served is sensible, and turn off manual override if so.
  5. Uncomment cron tasks that run these scripts daily.

Details:

The analyze_hoc_activity and update_project_count scripts that produce high level stats that appear on our website (eg, # of projects created and number of hours of code started) have been broken for several months.

After much research, I've pointed the analyze_hoc_activity script and update_project_count scripts to our reader connection (which, in my understanding, may actually be routed back to the writer on production-daemon).

The whole analyze_hoc_activity script (consists of 10 or so pretty large SQL queries) took between 10 and 60 seconds to run on a production clone (depending if covering a weekday or weekend -- could be as high as 5 minutes for days during CSEdWeek). I am pretty confident we could run these queries safely in production, at least for a single day of data (even if it's the writer connection).

I think the speediness is a function of the comprehensive indexes on tables / columns used in this script, particularly hoc_activities and forms in pegasus. Try out select index from hoc_activity, for example, to check out all the indexes on that table.

I've commented out the cron entries that run these scripts so that we can run this manually for a single day (currently set up to run just for February 1 2020 for testing purposes) before running it for more days (we'll need to backfill several months of data, have a couple ideas on how we could do that).

Other notes:

Testing story

Tested this manually on an adhoc connected to a production clone. Note that I overrode the db_reader config entry so that the queries were pointed at the reader endpoint, although I am not sure if this will be possible by default on production-daemon currently.

@bencodeorg
Copy link
Contributor Author

Update: the test day I picked (Feb 1 2020) was a weekend, so much less data. Weekdays take between 30-45 seconds. Running on production clone now to see if we can just move daily JSON files from adhoc to production-daemon.

@bencodeorg
Copy link
Contributor Author

Ran last several months on ad hoc, CSEdWeek days can take 5 minutes or so:

2019-12-01 analyzed in 9.96792265 seconds
2019-12-02 analyzed in 70.652760377 seconds
2019-12-03 analyzed in 87.93380079 seconds
2019-12-04 analyzed in 105.384783584 seconds
2019-12-05 analyzed in 107.52120741 seconds
2019-12-06 analyzed in 122.852147989 seconds
2019-12-07 analyzed in 23.076201008 seconds
2019-12-08 analyzed in 22.159987951 seconds
2019-12-09 analyzed in 248.853900645 seconds
2019-12-10 analyzed in 301.444195127 seconds
2019-12-11 analyzed in 304.008889511 seconds
2019-12-12 analyzed in 330.687762912 seconds
2019-12-13 analyzed in 335.317417808 seconds
2019-12-14 analyzed in 43.055903091 seconds
2019-12-15 analyzed in 28.982055843 seconds
2019-12-16 analyzed in 152.126159487 seconds
2019-12-17 analyzed in 161.083998451 seconds
2019-12-18 analyzed in 169.7574601 seconds
2019-12-19 analyzed in 159.981020923 seconds
2019-12-20 analyzed in 107.839640934 seconds
2019-12-21 analyzed in 18.972263545 seconds
2019-12-22 analyzed in 13.468384268 seconds
2019-12-23 analyzed in 19.42988667 seconds
2019-12-24 analyzed in 17.144985815 seconds
2019-12-25 analyzed in 12.958418824 seconds
2019-12-26 analyzed in 13.788052331 seconds
2019-12-27 analyzed in 12.013978134 seconds
2019-12-28 analyzed in 8.211618392 seconds
2019-12-29 analyzed in 7.337637396 seconds
2019-12-30 analyzed in 10.139073447 seconds
2019-12-31 analyzed in 9.440766457 seconds

@@ -51,10 +51,10 @@ def main
# https://docs.google.com/document/d/1RTTCpkDYZjqZxfVehkZRkk1HckYMvFdFGs6SEZnK1I8
total_started += 14_861_327

today = DateTime.now.to_date
day = Date.strptime('2014/12/06', '%Y/%m/%d')
today = Date.strptime('2020/02/01', '%Y/%m/%d')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the Date constructor, which takes integer year, month, and day arguments? And same for the next line?

Suggested change
today = Date.strptime('2020/02/01', '%Y/%m/%d')
today = Date.new(2020, 2, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that :)


while day <= today
while day < today
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

day and today are equal, right (Feb 1, 2020)? Will this entire while loop be skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True -- I changed this because (please confirm my logic) the analyze_hoc_activity cronjob was scheduled to run every hour on the :35, so what ended up actually happening is that the counts for a given day would be run (and overwritten every hour) until 11:35 PM, which would represent the "entire day" of data. This (I think) will result in only running this script when we have a full day's data (I'm imagining if we turn it back on, we should switch the cron schedule just to run daily, instead of hourly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, like you said, this actually won't run as-is, but my thought was that for running this manually, we can just modify the dates to whatever we'd like once it's on production-daemon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - now I understand. Looks great! Nice that you were able to run this on the clone. :)

@bencodeorg
Copy link
Contributor Author

One more thing to get your eyes on @sureshc -- there was an increase in select latency on the clone when I ran this, to about half a second:

image

There aren't any other reads happening to this database (that I know of) at this time -- does latency = query execution time (which I'd expect to be higher given these are expensive queries), or does it represent something like time between query request made and query actually beginning to execute?

@wjordan
Copy link
Contributor

wjordan commented Mar 17, 2020

does latency = query execution time (which I'd expect to be higher given these are expensive queries), or does it represent something like time between query request made and query actually beginning to execute?

Pretty sure this metric is average query execution time, so it's expected this would fluctuate based on the types of queries being sent to the instance.

Copy link
Contributor

@wjordan wjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the query times posted I'm not too concerned with running this query wherever it needs to be run, overall using the db_reader endpoints for them is fine by me.

We should probably eventually set db_reader to the read-replica endpoint on production-daemon / production-console, but I think this script update is fine even before that change.

# Note that these queries use the "DB" connection, which is set in lib/cdo/properties.rb to PEGASUS_DB
# PEGASUS_DB is defined in lib/cdo/db.rb using sequel with connections to the pegasus writer and reader.
# So, even though most of the script uses the reader connections (eg, DASHBOARD_DB_READER),
# the "DB" connection used here does include a connection to the writer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just doing SELECT queries with the DB connection here, should we just change DB[:forms].where to PEGASUS_DB_READER[:forms].where so this script is consistently reading from the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me -- let me try and switch and confirm everything works smoothly on the clone.

@bencodeorg bencodeorg merged commit f93a3e3 into staging Mar 18, 2020
@bencodeorg bencodeorg deleted the resuscitate-analyze-hoc-activity branch March 18, 2020 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants