New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move HoC activity and project count cronjobs to reader #33611
Conversation
Update: the test day I picked (Feb 1 2020) was a weekend, so much less data. Weekdays take between 30-45 seconds. Running on production clone now to see if we can just move daily JSON files from adhoc to production-daemon. |
Ran last several months on ad hoc, CSEdWeek days can take 5 minutes or so:
|
bin/cron/analyze_hoc_activity
Outdated
@@ -51,10 +51,10 @@ def main | |||
# https://docs.google.com/document/d/1RTTCpkDYZjqZxfVehkZRkk1HckYMvFdFGs6SEZnK1I8 | |||
total_started += 14_861_327 | |||
|
|||
today = DateTime.now.to_date | |||
day = Date.strptime('2014/12/06', '%Y/%m/%d') | |||
today = Date.strptime('2020/02/01', '%Y/%m/%d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the Date constructor, which takes integer year, month, and day arguments? And same for the next line?
today = Date.strptime('2020/02/01', '%Y/%m/%d') | |
today = Date.new(2020, 2, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that :)
|
||
while day <= today | ||
while day < today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
day
and today
are equal, right (Feb 1, 2020)? Will this entire while
loop be skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True -- I changed this because (please confirm my logic) the analyze_hoc_activity
cronjob was scheduled to run every hour on the :35, so what ended up actually happening is that the counts for a given day would be run (and overwritten every hour) until 11:35 PM, which would represent the "entire day" of data. This (I think) will result in only running this script when we have a full day's data (I'm imagining if we turn it back on, we should switch the cron schedule just to run daily, instead of hourly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, like you said, this actually won't run as-is, but my thought was that for running this manually, we can just modify the dates to whatever we'd like once it's on production-daemon
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - now I understand. Looks great! Nice that you were able to run this on the clone. :)
One more thing to get your eyes on @sureshc -- there was an increase in select latency on the clone when I ran this, to about half a second: There aren't any other reads happening to this database (that I know of) at this time -- does latency = query execution time (which I'd expect to be higher given these are expensive queries), or does it represent something like time between query request made and query actually beginning to execute? |
Pretty sure this metric is average query execution time, so it's expected this would fluctuate based on the types of queries being sent to the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the query times posted I'm not too concerned with running this query wherever it needs to be run, overall using the db_reader
endpoints for them is fine by me.
We should probably eventually set db_reader
to the read-replica endpoint on production-daemon
/ production-console
, but I think this script update is fine even before that change.
bin/cron/analyze_hoc_activity
Outdated
# Note that these queries use the "DB" connection, which is set in lib/cdo/properties.rb to PEGASUS_DB | ||
# PEGASUS_DB is defined in lib/cdo/db.rb using sequel with connections to the pegasus writer and reader. | ||
# So, even though most of the script uses the reader connections (eg, DASHBOARD_DB_READER), | ||
# the "DB" connection used here does include a connection to the writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just doing SELECT queries with the DB
connection here, should we just change DB[:forms].where
to PEGASUS_DB_READER[:forms].where
so this script is consistently reading from the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me -- let me try and switch and confirm everything works smoothly on the clone.
[update: I ended up running the script on our production clone to generate data for the past 4 months. I can move those JSON files (generated daily) to
production-daemon
if that seems sensible to others? In any case, the contents of this PR still stand.]High level approach proposed:
analyze_hoc_activity
for single day on production-daemon.analyze_hoc_activity
on production-daemon.Details:
The
analyze_hoc_activity
andupdate_project_count
scripts that produce high level stats that appear on our website (eg, # of projects created and number of hours of code started) have been broken for several months.After much research, I've pointed the
analyze_hoc_activity
script andupdate_project_count
scripts to our reader connection (which, in my understanding, may actually be routed back to the writer onproduction-daemon
).The whole
analyze_hoc_activity
script (consists of 10 or so pretty large SQL queries) took between 10 and 60 seconds to run on a production clone (depending if covering a weekday or weekend -- could be as high as 5 minutes for days during CSEdWeek). I am pretty confident we could run these queries safely in production, at least for a single day of data (even if it's the writer connection).I think the speediness is a function of the comprehensive indexes on tables / columns used in this script, particularly
hoc_activities
andforms
in pegasus. Try outselect index from hoc_activity
, for example, to check out all the indexes on that table.I've commented out the cron entries that run these scripts so that we can run this manually for a single day (currently set up to run just for February 1 2020 for testing purposes) before running it for more days (we'll need to backfill several months of data, have a couple ideas on how we could do that).
Other notes:
Testing story
Tested this manually on an adhoc connected to a production clone. Note that I overrode the
db_reader
config entry so that the queries were pointed at the reader endpoint, although I am not sure if this will be possible by default on production-daemon currently.