Skip to content

Alias foreign key and enumerate columns to push condition into subselect#1

Merged
bhserna merged 1 commit into
bhserna:masterfrom
bensheldon:better_query
Jul 31, 2024
Merged

Alias foreign key and enumerate columns to push condition into subselect#1
bhserna merged 1 commit into
bhserna:masterfrom
bensheldon:better_query

Conversation

@bensheldon
Copy link
Copy Markdown
Contributor

Hi! Thank you so much for blogging about N+1 queries and popularizing lateral joins in associations.

So I found a problem with the query: the outermost condition of the association (e.g. WHERE user_id IN ($1, ...)) doesn't get pushed down in the current query. This leads to the lateral join being applied to all records before the condition is applied.

I have a solution though. I only modified one of the queries in the PR to demonstrate it.

(Also, I did do some slight Arel-izing, but I'm happy to rewrite it with strings if that's clearer. Mainly here I just wanted to show you what I found.)

Here's some notes from my own application with EXPLAIN ANALYZE: https://gist.github.com/bensheldon/4054feed291e8ae7f54d40b14ad7ba79

@bhserna
Copy link
Copy Markdown
Owner

bhserna commented Jul 24, 2024

Woo! what an honor to receive a pull request from you @bensheldon 😃 !

Let's see if I understand...

Apparently the thing that is doing the magic is the user_id... am I right 🤔 ?

For example... Do you know if in your gist, something like this could work?...

SELECT users.id AS user_id, latest.*
 ...

@bensheldon
Copy link
Copy Markdown
Contributor Author

💖 I recommend your blog posts about Lateral Joins all the time. It's a fantastic resource! THANK YOU! This was me on Reddit 😊 https://www.reddit.com/r/rails/comments/kmofhp/comment/ghnf0hy/

For example... Do you know if in your gist, something like this could work? SELECT users.id AS user_id, latest.*

Unfortunately that results in the error:

ERROR:  column reference "user_id" is ambiguous
LINE 14: WHERE "newsletter_deliveries"."user_id" IN (1, 2)

The error happens because the query in the sub-select has two user_id columns: one from users.id AS user_id and then the user_id from latest.*. That's why it's necessary to enumerate the columns to omit the latter. It's not great looking 😞

I have yet to find a working strategy that allows for not enumerating the columns. But I'm chatting with some folks about it 🤞🏻

@bhserna
Copy link
Copy Markdown
Owner

bhserna commented Jul 25, 2024

I try to copy what you did in your original pull request https://github.com/bensheldon/open311status/pull/98/files and I think that is working.

example "In a has many association" do
  class User < ActiveRecord::Base
    has_many :posts
    has_many :last_posts, -> { last_n_per_user(3) }, class_name: "Post"
  end

  class Post < ActiveRecord::Base
    scope :last_n_per_user, ->(n) {
      query = select('subquery.*').from('(SELECT *, id AS user_id FROM users) AS posts')

      join_sql = <<~SQL
        JOIN LATERAL (
          SELECT * FROM posts AS sub_posts
          WHERE sub_posts.user_id = posts.user_id
          ORDER BY sub_posts.id DESC LIMIT :limit
        ) AS subquery ON TRUE
      SQL

      query.joins(sanitize_sql_array([join_sql, { limit: n }]))
    }
  end

  users = User.preload(:last_posts).limit(5).explain
  puts users
  #pp users.map { |user| [user.name, user.last_posts.map(&:id)] }
end

This is the output...

Example: In a has many association
----------------------------------
D, [2024-07-25T10:05:21.489839 #7445] DEBUG -- :   User Load (0.1ms)  SELECT "users".* FROM "users" LIMIT $1  [["LIMIT", 5]]
D, [2024-07-25T10:05:21.491662 #7445] DEBUG -- :   Post Load (0.4ms)  SELECT subquery.* FROM (SELECT *, id AS user_id FROM users) AS posts JOIN LATERAL (
  SELECT * FROM posts AS sub_posts
  WHERE sub_posts.user_id = posts.user_id
  ORDER BY sub_posts.id DESC LIMIT 3
) AS subquery ON TRUE WHERE "posts"."user_id" IN ($1, $2, $3, $4, $5)  [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]
EXPLAIN SELECT "users".* FROM "users" LIMIT $1 [["LIMIT", 5]]
                           QUERY PLAN
----------------------------------------------------------------
 Limit  (cost=0.00..0.09 rows=5 width=40)
   ->  Seq Scan on users  (cost=0.00..22.00 rows=1200 width=40)
(2 rows)

EXPLAIN SELECT subquery.* FROM (SELECT *, id AS user_id FROM users) AS posts JOIN LATERAL (
  SELECT * FROM posts AS sub_posts
  WHERE sub_posts.user_id = posts.user_id
  ORDER BY sub_posts.id DESC LIMIT 3
) AS subquery ON TRUE WHERE "posts"."user_id" IN ($1, $2, $3, $4, $5) [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5]]
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=22.55..86.21 rows=15 width=44)
   ->  Bitmap Heap Scan on users  (cost=8.80..17.29 rows=5 width=8)
         Recheck Cond: (id = ANY ('{1,2,3,4,5}'::bigint[]))
         ->  Bitmap Index Scan on users_pkey  (cost=0.00..8.80 rows=5 width=0)
               Index Cond: (id = ANY ('{1,2,3,4,5}'::bigint[]))
   ->  Limit  (cost=13.74..13.75 rows=3 width=44)
         ->  Sort  (cost=13.74..13.76 rows=6 width=44)
               Sort Key: sub_posts.id DESC
               ->  Bitmap Heap Scan on posts sub_posts  (cost=4.20..13.67 rows=6 width=44)
                     Recheck Cond: (user_id = users.id)
                     ->  Bitmap Index Scan on index_posts_on_user_id  (cost=0.00..4.20 rows=6 width=0)
                           Index Cond: (user_id = users.id)
(12 rows)

Can you help me see if this is right/better?

@bensheldon
Copy link
Copy Markdown
Contributor Author

bensheldon commented Jul 26, 2024

@bhserna ooh, nice! Yes, that does look equivalent.

Sidenote: I just noticed in this project that there isn't multicolumn index on (post_id, id). That would speed up the lateral join because it's a filter by post_id and order by id.

So yes, I think your query is probably better than mine from a readability perspective, and equivalent in performance (maybe better / more reliable) 🙌🏻 So yes, let's go with that.

Counterpoint: I'm working on turning this into an extension for Active Record (gem here, though I haven't published to Rubygems yet). The one thing about your proposal I need to look into, from an Active Record perspective (rather than raw SQL) is how to rewrite all of the scopes in the lateral to use the table alias.

@bhserna
Copy link
Copy Markdown
Owner

bhserna commented Jul 31, 2024

@bensheldon after seeing what you are trying to do in your gem I think that is better to use the approach that you just shared in the pull request.

I couldn't find a way to be able to pass a scope in a lambda, like in this example...

class User < ActiveRecord::Base
 # like this
  has_one_of_many :last_post, -> { order("created_at DESC") }, class_name: "Post"
end

... because if I understand it right, you need to preserve the table name without alias to allow the user of the gem to build scopes without thinking in the internal representation.

So, I think that I will merge your pull request as is... Thank you!

Also, I will try to update the post or share in some way what you found... jeje but it won't be easy to explain 😅

@bhserna bhserna merged commit e0ef92e into bhserna:master Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants