Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New sharing_disabled backfill script #21681

Merged
merged 2 commits into from
Apr 5, 2018
Merged

New sharing_disabled backfill script #21681

merged 2 commits into from
Apr 5, 2018

Conversation

islemaster
Copy link
Contributor

@islemaster islemaster commented Apr 4, 2018

Uses activerecord-import to rapidly update a value in the properties JSON blob in our users table for roughly 35 million users, in batches of 10,000. Will ran some tests and found that this approach can update 10k rows in about 1.6 seconds so we're estimating a little over 90 minutes to run this query (against 97% of our user rows!).

Next steps

We need to deploy this change so that the activerecord-import gem gets installed on production-console, then we'll manually run the oneoff script to update user rows.

Why

We're going with this approach (updating so many rows) instead of some other approaches we discussed where application logic is used to minimize the number of rows we need to update, because:

  • The application implementation for this approach is already done.
  • We've got a small number of user rows that have already been updated from a previous attempt (no user bad, but it'd be nice to be consistent again)
  • There are product implications of other approaches we discussed, and we know this approach meets spec.

Awesomeness

@wjordan wrote:

I'd like to see us start integrating activerecord-import into other parts of our stack, specifically bulk seeding steps.

We can't wait to see what sort of other speedups this permits...

@Erin007
Copy link
Contributor

Erin007 commented Apr 4, 2018

Do we still have to consider validation, like in @caleybrock 's earlier PR #21671?

@islemaster
Copy link
Contributor Author

I believe the validate: false on this line deals with that.

@ewjordan
Copy link
Contributor

ewjordan commented Apr 5, 2018

This is awesome, I'm definitely going to use this approach next time I have to do a big data migration. Good find!

Copy link
Contributor

@balderdash balderdash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

user.update! sharing_disabled: true
num_students_updated += 1
puts "Updated #{num_students_updated} students so far." if num_students_updated % 100000 == 0
User.where('birthday IS NULL OR birthday > ?', min_birthday).in_batches(of: batch_size) do |where|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the variable should be named something like users instead of where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or batch

@islemaster islemaster merged commit 0ef5807 into staging Apr 5, 2018
@islemaster islemaster deleted the sharing-backfill branch April 5, 2018 18:53
values.each do |_id, properties|
properties['sharing_disabled'] = true
end
User.import([:id, :properties], values, validate: false, on_duplicate_key_update: [:properties])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@islemaster what does on_duplicate_key_update do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default mode for the activerecord-import gem is a bulk-insert, but here we want to do a bulk-update. The on_duplicate_key_update setting turns this into an upsert operation that will update the properties column when primary keys match (in this case id).

See also: On Duplicate Key Update on the activerecord-import wiki

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshlory See https://github.com/zdennis/activerecord-import/wiki/On-Duplicate-Key-Update

TL;DR - it's an upsert, when the key (id) already exists update properties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants