Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide find_in_batches and find_each #87

Merged
merged 3 commits into from
Nov 18, 2017
Merged

Provide find_in_batches and find_each #87

merged 3 commits into from
Nov 18, 2017

Conversation

robacarp
Copy link
Member

@robacarp robacarp commented Nov 6, 2017

I'm unsure if this is in the spirit of this project or not, so I wanted to put some code out there and start a discussion before investing enough time to complete the algorithm and write specs.

.all is a fine way to deal with a couple hundred or so rows in a table, but above that it becomes a performance risk. ActiveRecord provides a nice way to get around this problem by fetching N rows from the database at once, and iterating over them in step. find_each iterates individual records and find_in_batches iterates batches as arrays of records. I consider this pattern essential for large datasets.

I noticed that there aren't any tests around the existing finders, which might mean that I'd struggle to set an acceptable spec pattern for these finders.

Assuming that these functions are a desireable addition to granite, here is a list of things I think might be needed to merge this in:

  • find_in_batches does not deal at all with the parameterized query components
  • it also should check for limit and offset already present in the sql string? The only precedent here is first which does not check for the presence of LIMIT at all, so maybe not.
  • returning a lot of rows with SQL LIMIT and OFFSET isn't the best pattern for these functions, but I think it is minimally viable for all the drivers. Postgres has cursors, I'm not sure what the rest of the databases have. The weakness of LIMIT/OFFSET is that if the database is modified by the code here or elsewhere, duplicated or missing rows are possible. This should be implemented elsewhere / later, I think. It'll require diving into the database drivers explicitly.
  • specs for this code?

Thoughts?

@drujensen
Copy link
Member

@robacarp I think this would be a great addition. I would also like a last as well since I tend to use that often.

@drujensen
Copy link
Member

@robacarp I wanted to check in and see if you have made any progress on this feature. Looking forward to adding this new functionality! 💯

@robacarp
Copy link
Member Author

@drujensen I'm happy to finish this up, but I'm a little unsure of the direction the specs should take since several of the other methods aren't spec'd at all.

@drujensen
Copy link
Member

Yeah, the tests are more full integration tests instead of unit tests. I found this more useful since we need to test all 3 dbs.

Create a model and populate with test data. Verify find in batches works against all 3 drivers.

Let me know if you need help creating these.

@robacarp
Copy link
Member Author

@drujensen I see, I didn't expect specs for these methods to be in this file: https://github.com/amberframework/granite-orm/blob/master/spec/adapter/pg_spec.cr so I overlooked them completely.

@robacarp
Copy link
Member Author

@drujensen Updated, now with specs.

Copy link
Member

@drujensen drujensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks for adding some tests. is there any edge cases we should test like when find_in_batches doesn't have enough items to create a full batch?

@robacarp
Copy link
Member Author

robacarp commented Nov 14, 2017

@drujensen yes, there are a few more edge cases which still need to be tested:

  • batch: returning less than batch_size results in a smaller batch
  • both: never yielding when query is empty
  • both: starting from some offset other than 0
  • both: passing in a parameterized query and getting the expected results iterated

Any others you think are worth spec-ing?

@drujensen
Copy link
Member

@robacarp that looks like an excellent list. Will you have time to add these? If not, we can look at doing them in another PR down the road.

@robacarp
Copy link
Member Author

@drujensen I will, and should within the week. I can't guarantee that though 😅

This is a little out of scope of this pull, but the repetition between the three driver test suites makes me a little weary. It makes it difficult to see if there actually are any subtle differences between the drivers which need to be maintained. Is something like this out of the question?

{% adapters = ["pg","mysql","sqlite"] %}

{% for adapter in adapters %}

  {%
   parent_class = "Parent#{adapter}".id
   user_class = "User#{adapter}".id
   child_class = "Child#{adapter}".id
   role_class = "Role#{adapter}".id
  %}

class {{ parent_class.id }} < Granite::ORM::Base
  adapter {{ adapter.id }}

  has_many :users
  has_many :childs, through: :users

  field name : String
  timestamps
end

class {{ user_class.id }}< Granite::ORM::Base
  adapter {{ adapter.id }}

  belongs_to :parent
  belongs_to :child

  field name : String
  field pass : String
  field total : Int32
  timestamps
end

class {{ child_class.id }}< Granite::ORM::Base
  adapter {{ adapter.id }}

  has_many :users
  has_many :parents, through: :users

  field name : String
end


class {{ role_class.id }}< Granite::ORM::Base
  adapter {{ adapter.id }}

  primary custom_id : Int32
  field name : String
end

{{ parent_class }}.exec("DROP TABLE IF EXISTS parents;")
{{ parent_class }}.exec("CREATE TABLE parents (
  id BIGSERIAL PRIMARY KEY,
  name VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);
")

{{ user_class }}.exec("DROP TABLE IF EXISTS users;")
{{ user_class }}.exec("CREATE TABLE users (
  id BIGSERIAL PRIMARY KEY,
  name VARCHAR,
  pass VARCHAR,
  total INT,
  parent_id BIGINT,
  child_id BIGINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);
")

{{ child_class }}.exec("DROP TABLE IF EXISTS childs;")
{{ child_class }}.exec("CREATE TABLE childs (
  id BIGSERIAL PRIMARY KEY,
  name VARCHAR,
  user_id BIGINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);
")

{{ role_class }}.exec("DROP TABLE IF EXISTS roles;")
{{ role_class }}.exec("CREATE TABLE roles (
  custom_id SERIAL PRIMARY KEY,
  name VARCHAR
);
")


describe {{ adapter }} do
  Spec.before_each do
    {{ parent_class }}.clear
    {{ child_class }}.clear
    {{ role_class }}.clear
  end

  describe "#all" do
    it "finds all the users" do
      user = {{ user_class }}.new
      user.name = "Test {{ user_class }}"
      user.total = 10
      user.save
      user = {{ user_class }}.new
      user.name = "Test {{ user_class }} 2"
      user.save
      users = {{ user_class }}.all
      users.size.should eq 2
    end

    it "finds users matching clause using named substitution" do
      user = {{ user_class }}.new
      user.name = "Bob"
      user.total = 1000
      user.save
      user = {{ user_class }}.new
      user.name = "Joe"
      user.total = 2000
      user.save
      user = {{ user_class }}.new
      user.name = "Joline"
      user.total = 3000
      user.save

      users = {{ user_class }}.all("WHERE name LIKE $1", ["Jo%"])
      users.size.should eq 2
    end

    it "finds users matching clause using multiple named substitutions" do
      user = {{ user_class }}.new
      user.name = "Bob"
      user.total = 1000
      user.save
      user = {{ user_class }}.new
      user.name = "Joe"
      user.total = 2000
      user.save
      user = {{ user_class }}.new
      user.name = "Joline"
      user.total = 3000
      user.save

      users = {{ user_class }}.all("WHERE name LIKE ANY(ARRAY[$1, $2])", ["Joe%", "Joline%"])
      users.size.should eq 2
    end

    it "finds users matching clause using question mark substitution" do
      user = {{ user_class }}.new
      user.name = "Bob"
      user.total = 1000
      user.save
      user = {{ user_class }}.new
      user.name = "Joe"
      user.total = 2000
      user.save
      user = {{ user_class }}.new
      user.name = "Joline"
      user.total = 3000
      user.save

      users = {{ user_class }}.all("WHERE name LIKE ?", ["Jo%"])
      users.size.should eq 2
    end

    it "finds users matching clause using multiple question mark substitutions" do
      user = {{ user_class }}.new
      user.name = "Bob"
      user.total = 1000
      user.save
      user = {{ user_class }}.new
      user.name = "Joe"
      user.total = 2000
      user.save
      user = {{ user_class }}.new
      user.name = "Joline"
      user.total = 3000
      user.save

      users = {{ user_class }}.all("WHERE name LIKE ANY(ARRAY[?, ?])", ["Joe%", "Joline%"])
      users.size.should eq 2
    end
  end

{% end %}

This is just bulk find/replace'd code, for demonstration. But I think this would DRY up the test suite, make it more clear that the bulk of methods across the drivers are identical, and most importantly highlight places where they differ.

@drujensen
Copy link
Member

No, not at all. That is an excellent suggestion! There are some nuances between each driver but the adapters should try to handle those.

What if you create a new test suite in the root folder for your feature and then use this technique for your testing? We can look at moving the other test suites over to this technique if we find it works well.

@robacarp
Copy link
Member Author

@drujensen The most recent commit has a spec which attempts to be polymorphic about testing #find_each and #find_in_batches, and the code isn't bad. Since granite doesn't provide migration functionality, it got messy around primary keys and other places in the table definitions, so I kept the schema for the test table to an absolute minimum. However, I think this is a good thing and will help to enforce testing only one thing at a time.

Let me know your thoughts.

@drujensen
Copy link
Member

@robacarp The code is kinda ugly because of the large amount of macros but I think its better than copying the test 3 times per driver. I like that the feature testing is in one place for all three drivers. I think it will make things easier down the road.

@amberframework/core-team wdyt?

@@ -99,6 +99,7 @@ describe Granite::Adapter::Mysql do
Spec.before_each do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is a dangerous think to do. Spec.before_each will run before every single test in the test suite. It is not scoped by file nor describe block.

Often that leads to unintended consequences down the road.

That said, these statements look like they would do no harm, and if they would for a future test, there would likely be a compile time error.

Copy link
Member

@drujensen drujensen Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marksiemers Right, we are aware if that limitation. Do you have a better recommendation than doing it this way?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code, I think it is fine - I don't recommend or expect any changes.

I'm more wary of it in generated code - as the end user may not know the implications of putting things in there.

Any solution I have isn't perfect. You can wrap the code you want in a macro, and insert that macro "before_each" test where it applies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific pattern was copied straight out of the other tests, but I'm aware that it's not ideal. I toyed around with some other patterns, but it ultimately nothing cleaned up the pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had is that, should this pattern propagate and be used in a greater portion of the test suite, it may be useful to create a set of models which are used throughout the test suite. Dropped/created in spec_helper, and truncated as here but with the definition of that truncation clearly placed in spec_helper as well.

Hopefully it would be easier to deal with the subtleties of database drivers which have different levels of adherence to the SQL spec in different ways, and that polymorphism would be constrained to the spec_helper as well (or perhaps in a kind of run-at-spec-start migration file for each driver).

As for the heavy use of macros, there may be a better way to write code and I'm open to suggestions. I think it's really important that a project like this have as much parity between drivers as possible. One good way to enforce that is through uniform testing. As features are added, it becomes more and more likely driver tests would diverge in subtle ways.

@drujensen drujensen merged commit f57c7ac into amberframework:master Nov 18, 2017
@robacarp robacarp deleted the find_each branch November 18, 2017 06:03
@faustinoaq faustinoaq added this to Done in Framework 2018 May 5, 2018
@faustinoaq faustinoaq removed this from Done in Framework 2018 Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants