Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding script to update time_spent to seconds from milliseconds #36398

Merged
merged 7 commits into from
Aug 26, 2020

Conversation

jmkulwik
Copy link
Contributor

@jmkulwik jmkulwik commented Aug 21, 2020

The time_spent field in the user_levels table has been recording time in milliseconds. This will cause time_spent to max out at ~25 days due to mysql's integer size. Instead we want to record time in seconds. This will cause time_spent to max out at ~68 years.

Recording time_spent was temporarily disabled by this pr so we can make this change on all existing UserLevel records.

Links

Testing story

Reviewer Checklist:

  • Tests provide adequate coverage
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

@jmkulwik jmkulwik requested review from sureshc, a team, clareconstantine, cforkish and mvkski and removed request for a team August 21, 2020 19:34
# time_spent should be transformed as such
# (time_spent.to_f/1000).ceil.to_i

UserLevel.where("time_spent > ?", 0).where.not(time_spent: nil).each do |user_level|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to mitigate the impact of running this query on such a big table would be to use a method like find_each (info) that can be chained to a where, that gets the records in batches rather than trying to instantiate them all at once.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it might make sense to do this in a transaction so that if something goes wrong, you won't end up in a state where some records are in seconds and other are in milliseconds. A good example is probably this script - although you would want all of your updates to happen in one transaction. That might negate the benefits of using find_each, though, depending on how transactions work.

Copy link

@clareconstantine clareconstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your approach! this seems like a great way to combine the transaction + batches :)


UserLevel.find_in_batches(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level_slice|
puts "PROCESSING: slice #{slice} with starting id #{user_level_slice.first.id}."
ActiveRecord::Base.transaction do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if UserLevels could have been deleted, do we also need the ending id of each slice? And is there anything you want to capture if it fails, like which row it failed on? (maybe this is captured in whatever error message you get by default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending id: I don't think we need it. It'll just be staring_id + 5000
Row it failed on: Hmm. I thought about this for a bit. I'm super curious if you have additional thoughts on the matter. This is my first migration, so I don't have much experience to go on. My gut feeling: It might be useful as a learning experience for me, but for the migration itself, I don't think it would end up saving any time. Since the migration logic is so short and the slices are relatively small, the time spent writing error logging would outweigh the benefits of having that logging.


UserLevel.find_in_batches(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level_slice|
puts "PROCESSING: slice #{slice} with starting id #{user_level_slice.first.id}."
ActiveRecord::Base.transaction do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to wrap each slice of 5000 rows in a single transaction. Can we log any row that we fail to update (Rescue any error and log the id and the reason) and then move on the next row?

user_level.save(touch: false, validate: false)
end
end
UserLevel.find_each(start: 93_221_000, finish: 2_811_329_000, batch_size: 5000) do |user_level|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use find_each.with_index to track where we are and occasionally (every 5000 rows?) output where we are, so we can monitor this long running process similar to the way we were previously logging the start of each new slice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jmkulwik jmkulwik requested a review from sureshc August 26, 2020 16:58
end
if user_level.time_spent && user_level.time_spent > 0
user_level.time_spent = (user_level.time_spent.to_f / 1000).ceil.to_i
user_level.save(touch: false, validate: false)
Copy link
Contributor

@sureshc sureshc Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ActiveRecord update_column method might be appropriate to use here. We have callbacks on this model, and performance might improve and we might avoid unintended side effects by skipping callbacks.
https://apidock.com/rails/ActiveRecord/Persistence/update_column

@jmkulwik jmkulwik requested a review from sureshc August 26, 2020 18:02
@jmkulwik
Copy link
Contributor Author

From running this on the adhoc, I'm seeing only the column we care about being updated

Before:
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| level_id | attempts | created_at | updated_at | best_result | script_id | level_source_id | submitted | readonly_answers | unlocked_at | time_spent |
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| 16073 | 1 | 2019-10-14 15:30:12 | 2020-08-20 15:31:20 | 100 | 375 | 852310263 | 0 | NULL | NULL | 0 |
| 11878 | 5 | 2019-10-14 16:37:44 | 2020-08-20 13:49:50 | 100 | 302 | 613711192 | 0 | NULL | NULL | 6618 |
| 11879 | 38 | 2019-10-14 16:40:23 | 2020-08-20 13:50:07 | 20 | 302 | 951400461 | 0 | NULL | NULL | 10165 |
| 11986 | 1 | 2019-10-14 16:50:10 | 2020-08-20 13:50:10 | 100 | 302 | 613341647 | 0 | NULL | NULL | 0 |

After:
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| level_id | attempts | created_at | updated_at | best_result | script_id | level_source_id | submitted | readonly_answers | unlocked_at | time_spent |
+----------+----------+---------------------+---------------------+-------------+-----------+-----------------+-----------+------------------+-------------+------------+
| 16073 | 1 | 2019-10-14 15:30:12 | 2020-08-20 15:31:20 | 100 | 375 | 852310263 | 0 | NULL | NULL | 0 |
| 11878 | 5 | 2019-10-14 16:37:44 | 2020-08-20 13:49:50 | 100 | 302 | 613711192 | 0 | NULL | NULL | 7 |
| 11879 | 38 | 2019-10-14 16:40:23 | 2020-08-20 13:50:07 | 20 | 302 | 951400461 | 0 | NULL | NULL | 11 |
| 11986 | 1 | 2019-10-14 16:50:10 | 2020-08-20 13:50:10 | 100 | 302 | 613341647 | 0 | NULL | NULL | 0 |

Similarly, entries that should not be updated have not been changed.

@jmkulwik jmkulwik merged commit 1a47c07 into staging Aug 26, 2020
@jmkulwik jmkulwik deleted the fix-time-spent-seconds branch August 26, 2020 20:24
@sureshc
Copy link
Contributor

sureshc commented Aug 26, 2020

I know we're pretty committed to this solution, but wanted to note here a couple of other possible solutions ...

A single SQL statement that updates all 600K rows?

UPDATE user_levels
SET time_spent = time_spent / 1000
WHERE time_spent IS NOT NULL;

Or SELECT & export the id and time_spent column for the 600K rows to a CSV and then implement a script that issues a raw SQL UPDATE for each row, one at a time, as it reads through the CSV. This would eliminate the overhead of instantiating a UserLevel model for each row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants