Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicating non-stable field expr. in query - is completely wrong and can produce wrong results #682

Open
mvorisek opened this issue Jul 28, 2020 · 12 comments · May be fixed by #934
Open

Comments

@mvorisek
Copy link
Member

mvorisek commented Jul 28, 2020

Of course, I am thinking about optimal data modelling, not about implementation. Implementation is always possible.

Because of separation of Model and Persistence. Each persistence handles the JOIN in a child class

that is the same as with Union - we need universal nesting support from Model which could result in join, union, wrap.

The current design is actually completely wrong as presented a few days ago:
Imagine field with expr RAND() > 0. This field is currently rendered multiple like:

SELECT RAND() > 0 FROM DUAL WHERE RAND() > 0

it is not only rand, any expression like SELECT name FROM t LIMIT 1 is unstable as it has no explicit ORDER. Even if we (in theory, we normally do no even know them) add all table columns as a fallback order, queries like SELECT name, COUNT(*) FROM t ORDER BY name LIMIT 1 will be still unstable because of grouping first.

This is major bug which must be addressed.

Any non-trivial expression must SELECT under an alias ONCE and then this alias must be used anywhere else - https://stackoverflow.com/questions/35695171/remove-duplicate-sub-query

@DarkSide666
Copy link
Member

DarkSide666 commented Jul 30, 2020

This is important to think about and also this is not that simple to change and fix. That's because there can be anything in expression field. It can be subquery, some fieldname, function call etc.

What resulting select you would like to see instead of this ?

-- foo = field expression: RAND()
SELECT RAND() as foo FROM DUAL WHERE RAND() > 0

@mvorisek
Copy link
Member Author

-- foo = field expression: RAND()
SELECT RAND() as foo FROM DUAL WHERE RAND() > 0

should ALWAYS be queried like

SELECT RAND() as foo FROM DUAL HAVING foo > 0

non trivial fields (or to be precise fields that are non-constant and not using constant functions) that are used multiple times must be firstly selected and then reused/conditioned using HAVING

sql render using nesting is wrong, we must render to some intermediate form and then build sql once

@DarkSide666
Copy link
Member

DarkSide666 commented Jul 30, 2020

Nested queries should be much more faster that using having which executes on full result-set.
On other hand, in some cases (like mentioned above) using having is a solution.

But again, it will perform much slower. For example, this simple case:

SELECT
    (select name from person where id=e.person_id) AS emp_name
FROM employee e
WHERE 
    (select name from person where id=e.person_id) = 'John'

This is probably faster than using having in case we have millions of employees.

Even more complexity shows up if we use both - where and having in same query. And what if conditions are nested?

@mvorisek
Copy link
Member Author

mvorisek commented Jul 30, 2020

@DarkSide666 this should use join if 1:1, if 1:N, your query will fail (bacause subquery will return more than 1 row)

MySQL and other DBs like Oracle or MSSQL engines can probably optimize some HAVING clauses the same like subqueries

nested conditions - you must query data and then you can use constant functions/junctions/operators like you want

@DarkSide666
Copy link
Member

DarkSide666 commented Jul 30, 2020

Yes, this example is for simple hasOne case and standard title_field.

@mvorisek mvorisek changed the title Current design - duplicating non-stable field expr. in query - is completely wrong and can produce wrong results Duplicating non-stable field expr. in query - is completely wrong and can produce wrong results Nov 12, 2021
@mvorisek mvorisek linked a pull request Nov 26, 2021 that will close this issue
@mkrecek234
Copy link
Contributor

Can confirm that - if you have more complex hasOne/hasMany table structures, Atk4/Data gets very slow in case you add fields to child model from parent, grand-parent and grand-grand-parent tables via hasOne->addFields(...). This requires optimization, otherwise for performance reason it is much faster to load parent/grand-parent tables only where needed for single records, rather than adding those (grand-)parent fields straight into the model. If we render SQL correctly, this should not slow down SQL performance though.

@mvorisek
Copy link
Member Author

@mkrecek234 can you please post here a repro code like in #1078 (comment) for your nested usecase?

@mkrecek234
Copy link
Contributor

@mvorisek Here you go - sorry not designed as a test but normal code:

<?php

use Atk4\Data\Persistence;
use \Atk4\Data\Model;
use \Atk4\Data\Model\AggregateModel;

include '../vendor/autoload.php';

$db = new \Atk4\Data\Persistence\Sql('mysql:host=localhost;dbname=atk4;port=3306;user=root;password=root');

/* Create database 'atk4' in localhost MySql
 *

CREATE TABLE `account_group` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(32) DEFAULT NULL,
  `sign` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `account_group` (`id`, `name`, `sign`)
VALUES
	(1, 'Acount Group', -1);

CREATE TABLE `account` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(32) DEFAULT NULL,
  `account_group_id` int unsigned DEFAULT NULL,
  `fx_rate` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `account` (`id`, `name`, `account_group_id`, `fx_rate`)
VALUES
	(1, 'Test Account', 1, 1.2);

CREATE TABLE `booking` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `account_id` int unsigned DEFAULT NULL,
  `booking_group` varchar(32) DEFAULT NULL,
  `value` float DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `booking` (`id`, `account_id`, `booking_group`, `value`)
VALUES
	(1, 1, 'Group1', 123),
	(2, 1, 'Group1', 456),
	(3, 1, 'Group2', 789);

 */
class AccountGroup extends \Atk4\Data\Model {
    public $table = 'account_group';

    protected function init(): void
    {
        parent::init();
        $this->addFields(['name', 'sign' => ['type' => 'float']]);;
    }
}
class Account extends \Atk4\Data\Model {
	public $table = 'account';
    
    protected function init(): void
    {
		parent::init();
		$this->addFields(['name', 'fx_rate' => ['type' => 'float']]);
        $this->hasOne('account_group_id', ['model' => [AccountGroup::class]])->addField('sign');
	}
}

class Booking extends \Atk4\Data\Model {
	public $table = 'booking';

    protected function init(): void
    {
		parent::init();
		$this->addFields(['booking_group', 'value' => ['type' => 'float']]);
		$this->hasOne('account_id', ['model' => [Account::class]])->addFields(['fx_rate', 'sign']);
		
	}
}

    $accountModel = new Account($db);
    $bookingModel = new Booking($db);

    $bookingAggregate = new AggregateModel($bookingModel);
    $bookingAggregate->setGroupBy(['booking_group'], [
        'total' => ['expr' => 'sum([value])'], ]);

    foreach ($bookingAggregate as $booking) {
        echo 'Result: ' . $booking->get('booking_group') . ' | ' . $booking->get('total');
    }

    echo 'That worked..';

    $bookingAggregate = new AggregateModel($bookingModel);
    $bookingAggregate->setGroupBy(['booking_group'], [
        'total' => ['expr' => 'sum([sign] * [value])'], ]);

    foreach ($bookingAggregate as $booking) {
        echo 'Result: ' . $booking->get('total');
    }
    echo 'That did not work..';


@mvorisek
Copy link
Member Author

it seems you somewhat copied the AggregateModel testcase from aggregate issue

subquery like:

      (
        select
          `fx_rate`
        from
          `account` `_a_8a089c2a7e6c`
        where
          `id` = `booking`.`account_id`
      ) `fx_rate`

should be quite fast as long as the selection is done on unique index, any reasonable SQL optimizer should detect it is a 1:1 join query

In chat you have written:

Just sonething I noticed: When you have multi-layered hasOne relations (so parents tables, grand-parent tables and grand-grandparents tables) and if you always addFields of the respective higher levels, so in your child table you have a lot of fields from grand/grand/grand-tables, Atk4/data gets terribly slow. I stopped doing this and even load fields later, instead of adding them theough hasOne relations. In SQL this would not be so slow IMHO, because you do: SELECT child.field, parent.field, grand-parent.field from child join parent on (…) join grand-parent on (…) join grand-grand-parent on (…)
Atk4/data seems to create separate sub-query selects for each added field.

It would be good if you can provide the actual slow query and table metrics

@mkrecek234
Copy link
Contributor

@mvorisek The example above shows a nesting, as we cascade the field down from AccountGroup via Account to Booking. If this was not what you meant, please specify what you need. On the performance I do not have performance metrics, but I can tell that complex models with down-cascaded fields slow down simple load and save model commands to a good extent.

@mvorisek
Copy link
Member Author

so no AggregateModel above is needed and (new Booking($db))->executeCountQuery() has the same performance problem?

@mvorisek
Copy link
Member Author

mvorisek commented Nov 6, 2023

⚠ even more conterintuitive: https://sqlite.org/forum/forumpost/8202db3916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants