Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: First pass to improve efficiency of secure uploads rake task #9284

Conversation

@martin-brennan
Copy link
Contributor

martin-brennan commented Mar 26, 2020

Get rid of harmful each loop over uploads to update. Instead we put all the unique access control posts for the uploads into a map for fast access (vs using the slow .find through array) and look up the post when it is needed when looping through the uploads in batches.

On a Discourse instance with ~93k uploads, a simplified version of the old method takes > 1 minute:

unique_access_control_posts = Post.unscoped.select(:id, :topic_id).includes(topic: :category).where(id: uploads_to_update.pluck(:access_control_post_id).uniq)
uploads_to_update.each do |upload|
  upload.access_control_post = unique_access_control_posts.find { |post| post.id == upload.access_control_post_id }
end

A simplified version of the new method takes ~18s and uses a lot less memory:

unique_access_control_posts = {}
Post.unscoped.select(:id, :topic_id).includes(topic: :category)
       .where(id: uploads_to_update.pluck(:access_control_post_id).uniq).find_each do |post|
  unique_access_control_posts[post.id] = post
end

uploads_to_update.find_each do |upload|
  upload.access_control_post = unique_access_control_posts[upload.access_control_post_id]
end
* instead we put all the unique access control posts for the uploads
  into a map for fast access (vs using the slow .find through array)
  and look up the post when it is needed when looping
  through the uploads in batches
* this avoids keeping the map in memory for extra time and
  passing it around
Post.unscoped.select(:id, :topic_id)
.includes(topic: :category)
.where(id: uploads_to_update.pluck(:access_control_post_id).uniq).find_each do |post|
unique_access_control_posts[post.id] = post

This comment has been minimized.

Copy link
@SamSaffron

SamSaffron Mar 26, 2020

Member

Does this not pull a huge number of posts into memory? Why not just keep the ids and then instead just update access_control_post_id = id ?

This comment has been minimized.

Copy link
@martin-brennan

martin-brennan Mar 26, 2020

Author Contributor

@SamSaffron because the post's topic and that topic's category are used in UploadSecurity to determine whether the post has secure media. See:

https://github.com/discourse/discourse/blob/master/lib/upload_security.rb#L51-L53
https://github.com/discourse/discourse/blob/master/app/models/post.rb#L500-L504

The uploads themselves already have an access_control_post_id set earlier in the task.

This comment has been minimized.

Copy link
@martin-brennan

martin-brennan Mar 26, 2020

Author Contributor

In the forum I have tested on this results in ~3400 posts in memory for ~93k uploads.

This comment has been minimized.

Copy link
@discoursereviewbot

discoursereviewbot Mar 26, 2020

SamSaffron posted:

I see, then why not just keep a list of ids, then you do a Post.find(id) in the loop. That way you don't keep 3000 posts in memory.

@discoursereviewbot

This comment has been minimized.

Copy link

discoursereviewbot commented Mar 26, 2020

Martin Brennan posted:

I thought it would be more efficient to keep 3000 posts in memory than to go to the DB 93,000 times?

@discoursereviewbot

This comment has been minimized.

Copy link

discoursereviewbot commented Mar 26, 2020

SamSaffron posted:

Its tricky, depends how big the list is, if you can comfortably hold it in memory then I guess it is fine.

I guess this is already much improved so we go with it, but if you are doing this treatment on a DB with 100k posts that need touching there is a chance you can not hold all of that in memory.

I say merge for now, you need to unblock yourself.

@discoursereviewbot

This comment has been minimized.

Copy link

discoursereviewbot commented Mar 26, 2020

Martin Brennan posted:

Ok all good, I guess it is a balancing act. I might just be being overly cautious with the DB performance based on past experience, probably because I have never worked anywhere with a full infrastructure team that is great at handling this side of things :slight_smile:

@martin-brennan martin-brennan merged commit 6f978bc into master Mar 26, 2020
7 checks passed
7 checks passed
PLUGINS-BACKEND
Details
CORE-BACKEND
Details
PLUGINS-FRONTEND
Details
CORE-FRONTEND
Details
PLUGINS-LINT
Details
CORE-LINT
Details
license/cla Contributor License Agreement is signed.
Details
@martin-brennan martin-brennan deleted the issue/improve-efficiency-of-secure-media-rake-task-first-pass branch Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.