Adding script to update time_spent to seconds from milliseconds #36398

jmkulwik · 2020-08-21T19:34:06Z

The time_spent field in the user_levels table has been recording time in milliseconds. This will cause time_spent to max out at ~25 days due to mysql's integer size. Instead we want to record time in seconds. This will cause time_spent to max out at ~68 years.

Recording time_spent was temporarily disabled by this pr so we can make this change on all existing UserLevel records.

Links

jira

Testing story

Reviewer Checklist:

Tests provide adequate coverage
Code is well-commented
New features are translatable or updates will not break translations
Relevant documentation has been added or updated
User impact is well-understood and desirable
Pull Request is labeled appropriately
Follow-up work items (including potential tech debt) are tracked and linked

clareconstantine · 2020-08-24T20:02:03Z

bin/oneoff/data_fix/time_spent_milliseconds_to_seconds

+# time_spent should be transformed as such
+# (time_spent.to_f/1000).ceil.to_i
+
+UserLevel.where("time_spent > ?", 0).where.not(time_spent: nil).each do |user_level|


One way to mitigate the impact of running this query on such a big table would be to use a method like find_each (info) that can be chained to a where, that gets the records in batches rather than trying to instantiate them all at once.

Additionally, it might make sense to do this in a transaction so that if something goes wrong, you won't end up in a state where some records are in seconds and other are in milliseconds. A good example is probably this script - although you would want all of your updates to happen in one transaction. That might negate the benefits of using find_each, though, depending on how transactions work.

clareconstantine

I like your approach! this seems like a great way to combine the transaction + batches :)

clareconstantine · 2020-08-25T20:27:47Z

bin/oneoff/data_fix/time_spent_milliseconds_to_seconds

+
+UserLevel.find_in_batches(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level_slice|
+  puts "PROCESSING: slice #{slice} with starting id #{user_level_slice.first.id}."
+  ActiveRecord::Base.transaction do


if UserLevels could have been deleted, do we also need the ending id of each slice? And is there anything you want to capture if it fails, like which row it failed on? (maybe this is captured in whatever error message you get by default)

Ending id: I don't think we need it. It'll just be staring_id + 5000
Row it failed on: Hmm. I thought about this for a bit. I'm super curious if you have additional thoughts on the matter. This is my first migration, so I don't have much experience to go on. My gut feeling: It might be useful as a learning experience for me, but for the migration itself, I don't think it would end up saving any time. Since the migration logic is so short and the slices are relatively small, the time spent writing error logging would outweigh the benefits of having that logging.

…ps validation

sureshc · 2020-08-25T23:30:58Z

bin/oneoff/data_fix/time_spent_milliseconds_to_seconds

+
+UserLevel.find_in_batches(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level_slice|
+  puts "PROCESSING: slice #{slice} with starting id #{user_level_slice.first.id}."
+  ActiveRecord::Base.transaction do


I don't think we need to wrap each slice of 5000 rows in a single transaction. Can we log any row that we fail to update (Rescue any error and log the id and the reason) and then move on the next row?

sureshc · 2020-08-26T00:58:20Z

bin/oneoff/data_fix/time_spent_milliseconds_to_seconds

-        user_level.save(touch: false, validate: false)
-      end
-    end
+UserLevel.find_each(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level|


Could we use find_each.with_index to track where we are and occasionally (every 5000 rows?) output where we are, so we can monitor this long running process similar to the way we were previously logging the start of each new slice?

sureshc · 2020-08-26T17:11:00Z

bin/oneoff/data_fix/time_spent_milliseconds_to_seconds

+  end
+  if user_level.time_spent && user_level.time_spent > 0
+    user_level.time_spent = (user_level.time_spent.to_f / 1000).ceil.to_i
+    user_level.save(touch: false, validate: false)


The ActiveRecord update_column method might be appropriate to use here. We have callbacks on this model, and performance might improve and we might avoid unintended side effects by skipping callbacks.
https://apidock.com/rails/ActiveRecord/Persistence/update_column

jmkulwik · 2020-08-26T18:42:33Z

From running this on the adhoc, I'm seeing only the column we care about being updated

Before:
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| level_id | attempts | created_at | updated_at | best_result | script_id | level_source_id | submitted | readonly_answers | unlocked_at | time_spent |
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| 16073 | 1 | 2019-10-14 15:30:12 | 2020-08-20 15:31:20 | 100 | 375 | 852310263 | 0 | NULL | NULL | 0 |
| 11878 | 5 | 2019-10-14 16:37:44 | 2020-08-20 13:49:50 | 100 | 302 | 613711192 | 0 | NULL | NULL | 6618 |
| 11879 | 38 | 2019-10-14 16:40:23 | 2020-08-20 13:50:07 | 20 | 302 | 951400461 | 0 | NULL | NULL | 10165 |
| 11986 | 1 | 2019-10-14 16:50:10 | 2020-08-20 13:50:10 | 100 | 302 | 613341647 | 0 | NULL | NULL | 0 |

After:
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| level_id | attempts | created_at | updated_at | best_result | script_id | level_source_id | submitted | readonly_answers | unlocked_at | time_spent |
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| 16073 | 1 | 2019-10-14 15:30:12 | 2020-08-20 15:31:20 | 100 | 375 | 852310263 | 0 | NULL | NULL | 0 |
| 11878 | 5 | 2019-10-14 16:37:44 | 2020-08-20 13:49:50 | 100 | 302 | 613711192 | 0 | NULL | NULL | 7 |
| 11879 | 38 | 2019-10-14 16:40:23 | 2020-08-20 13:50:07 | 20 | 302 | 951400461 | 0 | NULL | NULL | 11 |
| 11986 | 1 | 2019-10-14 16:50:10 | 2020-08-20 13:50:10 | 100 | 302 | 613341647 | 0 | NULL | NULL | 0 |

Similarly, entries that should not be updated have not been changed.

sureshc · 2020-08-26T20:55:59Z

I know we're pretty committed to this solution, but wanted to note here a couple of other possible solutions ...

A single SQL statement that updates all 600K rows?

UPDATE user_levels
SET time_spent = time_spent / 1000
WHERE time_spent IS NOT NULL;

Or SELECT & export the id and time_spent column for the 600K rows to a CSV and then implement a script that issues a raw SQL UPDATE for each row, one at a time, as it reads through the CSV. This would eliminate the overhead of instantiating a UserLevel model for each row.

Adding script to update time_spent to seconds from milliseconds

decb4f9

jmkulwik requested review from sureshc, a team, clareconstantine, cforkish and mvkski and removed request for a team August 21, 2020 19:34

clareconstantine reviewed Aug 24, 2020

View reviewed changes

jmkulwik added 2 commits August 25, 2020 00:53

From code review: updated to mitigate the performance impact

9d0369c

Additional update to select smaller set of user_levels

e4ab1b0

jmkulwik requested a review from clareconstantine August 25, 2020 18:08

clareconstantine reviewed Aug 25, 2020

View reviewed changes

Script no longer edits the updated_at timestamp. Additionally, it ski…

69dfa8f

…ps validation

jmkulwik requested a review from clareconstantine August 25, 2020 23:04

sureshc reviewed Aug 25, 2020

View reviewed changes

Updated to remove transaction and add error handling.

6aae5b3

sureshc reviewed Aug 26, 2020

View reviewed changes

Added point in time logging

e6cffb7

jmkulwik requested a review from sureshc August 26, 2020 16:58

sureshc reviewed Aug 26, 2020

View reviewed changes

Replace save with update_columns

4896fd6

jmkulwik requested a review from sureshc August 26, 2020 18:02

sureshc approved these changes Aug 26, 2020

View reviewed changes

jmkulwik merged commit 1a47c07 into staging Aug 26, 2020

jmkulwik deleted the fix-time-spent-seconds branch August 26, 2020 20:24

jmkulwik mentioned this pull request Aug 28, 2020

Re-enable time_spent logging in the user_levels table #36496

Merged

7 tasks

jmkulwik added the time-spent label Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding script to update time_spent to seconds from milliseconds #36398

Adding script to update time_spent to seconds from milliseconds #36398

jmkulwik commented Aug 21, 2020 •

edited

Loading

clareconstantine Aug 24, 2020

clareconstantine Aug 24, 2020

clareconstantine left a comment

clareconstantine Aug 25, 2020

jmkulwik Aug 25, 2020

sureshc Aug 25, 2020

sureshc Aug 26, 2020

jmkulwik Aug 26, 2020

sureshc Aug 26, 2020 •

edited

Loading

jmkulwik commented Aug 26, 2020

sureshc commented Aug 26, 2020

Adding script to update time_spent to seconds from milliseconds #36398

Adding script to update time_spent to seconds from milliseconds #36398

Conversation

jmkulwik commented Aug 21, 2020 • edited Loading

Links

Testing story

Reviewer Checklist:

clareconstantine Aug 24, 2020

Choose a reason for hiding this comment

clareconstantine Aug 24, 2020

Choose a reason for hiding this comment

clareconstantine left a comment

Choose a reason for hiding this comment

clareconstantine Aug 25, 2020

Choose a reason for hiding this comment

jmkulwik Aug 25, 2020

Choose a reason for hiding this comment

sureshc Aug 25, 2020

Choose a reason for hiding this comment

sureshc Aug 26, 2020

Choose a reason for hiding this comment

jmkulwik Aug 26, 2020

Choose a reason for hiding this comment

sureshc Aug 26, 2020 • edited Loading

Choose a reason for hiding this comment

jmkulwik commented Aug 26, 2020

sureshc commented Aug 26, 2020

jmkulwik commented Aug 21, 2020 •

edited

Loading

sureshc Aug 26, 2020 •

edited

Loading