Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix postgresql functionality in queries.py. #56

Closed
wants to merge 4 commits into from

Conversation

changeling
Copy link
Contributor

Add check for database engine.
Add conditional CHUNK_SIZE for _sql_split() for database backend.
Add c.name to GROUP BY clause in _query_get_sex() as per postgres requirement.
Add conditional initial and group_concat declarations (string_agg for postgres) for postgres functionality in get_ancestorss() and get_descendants().
Minor change to query execution in get_descendants() to match get_ancestorss() structure.

Add check for database engine.
Add conditional `CHUNK_SIZE` for `_sql_split()` for database backend.
Add `c.name` to `GROUP BY` clause in `_query_get_sex()` as per postgres requirement.
Add conditional `initial` and `group_concat` declarations (`string_agg` for postgres) for postgres functionality in `get_ancestorss()` and `get_descendants()`.
Minor change to query execution in `get_descendants()` to match `get_ancestorss()` structure.
@briot
Copy link
Owner

briot commented Mar 28, 2019

Excellent and timely PR !
I was wondering whether you would have time to look at Postgresql, since sqlite is no match given the size of your database. I wonder how much of an improvement you saw by moving to Postgresql ?

I am doing some work to optimize the Source View page by doing similar recursive queries and also using group_concat, so I'll take your patch into account for that. Will review soon

@changeling
Copy link
Contributor Author

I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:

Screen Shot 2019-03-28 at 5 06 12 AM

Next, I'll load in the ManyMany generation-deep data and take a look.

The debounce and paged list views are really great! I have some thoughts regarding interface that I'll add to an issue after I sleep for a bit. Cheers!

@briot
Copy link
Owner

briot commented Mar 28, 2019 via email

@changeling
Copy link
Contributor Author

I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.

CHUNK_SIZE = 8900
else:
CHUNK_SIZE = 996

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer to directly add a key in the settings file, and use it from here. That way it is possible to change it more easily, and avoids hard-coding this check for databases here.
https://docs.djangoproject.com/en/2.1/topics/settings/

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have done that to begin with. :}

How about this, in the DATABASES setting (I'm using variables at the top of settings for the particulars):

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'CONN_MAX_AGE': 10,
        'NAME': PG_DATABASE_NAME,
        'USER': PG_DATABASE_USER,
        'PASSWORD': PG_DATABASE_PASSWORD,
        'HOST': 'localhost',
        'PORT': PG_DATABASE_PORT,
        'CHUNK_SIZE': 8900
    }
}

With this at the top of queries.py:

# Add max query size for other database backends
CHUNK_SIZE = settings.DATABASES['default']['CHUNK_SIZE']

And then:

    def _sql_split(self, ids, chunk_size=CHUNK_SIZE):
        """
        Generate multiple tuples to split a long list of ids into more
        manageable chunks for Sqlite
        """
        if ids is None:
            yield None
        else:
            ids = list(ids)  # need a list to extract parts of it
            for i in range(0, len(ids), chunk_size):
                yield ids[i:i + chunk_size]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach (defining CHUNK_SIZE in settings.py)

group_concat = "string_agg(parents.parent::text, ',') AS parents "
else:
initial = f"VALUES({person_id},0)"
group_concat = "group_concat(parents.parent) AS parents "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead make a migration and declare the group_concat aggregate in postgresql (that had been my initial plan, at least) ? "CREATE AGGREGATE group_concat()". I must say I never quite remember the syntax.
Something along the untested:

CREATE OR REPLACE FUNCTION comma_concat(text, text) RETURNS text
    LANGUAGE plpgsql
    AS $_$
begin
 if $1 = '' then
  return $2;
 else
  return  $1 || ',' || $2;
 end if;
end;$_$;

CREATE AGGREGATE group_concat(text) (
   SFUNC = textcat, STYPE = text, INITCOND = '');
CREATE AGGREGATE br_concat(text) (
        SFUNC = comma_concat, STYPE = text, INITCOND = '');

Then we can avoid the test for postgresql in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql

Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it some more, and perhaps I should remove the aggregate altogether, do it in python

"FROM ancestors LEFT JOIN parents "
"ON parents.main_id=ancestors.main_id "
f"{sk}"
"GROUP BY ancestors.main_id, ancestors.generation"
)
logger.debug(f'get_ancestors() query: {q}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove that, since we can see all queries by enabling DEBUG in the settings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I used that while testing and intended to remove it before I pushed. :}

group_concat = "string_agg(children.child::text, ',') AS children "
else:
initial = f"VALUES({person_id},0)"
group_concat = "group_concat(children.child) AS children "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

"FROM descendants LEFT JOIN children "
"ON children.main_id=descendants.main_id "
f"{sk}"
"GROUP BY descendants.main_id, descendants.generation"
)
logger.debug(f'get_descendants() query: {q}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@briot
Copy link
Owner

briot commented Mar 28, 2019 via email

@changeling
Copy link
Contributor Author

Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ? Thanks

I'll look at those, yep! Another thing, with the file that goes back many, many generations, both quilts and stats chew gigabyte upon gigabyte of disk space (>15GB) and consume all of my available RAM on a 16GB MacBook Pro, until the machine dies. I'll put that in an issue.

@changeling
Copy link
Contributor Author

(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.

@briot
Copy link
Owner

briot commented Mar 29, 2019 via email

@changeling
Copy link
Contributor Author

I'll look at doing VACUUM ANALYZE when I get back to postgres. Though the automatic VACUUM helpers are very active processes as the RAM dwindles. The sqlite stats test is still proceeding from the time of my last comment, currently having used about 20 GB of disk space. The memory management seems drastically better, without much impact. I started with around 80% free, and am hovering around 70% throughout the run.

@changeling
Copy link
Contributor Author

I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.

@briot
Copy link
Owner

briot commented Mar 29, 2019 via email

@changeling
Copy link
Contributor Author

The ManyMany file is around 220,000 people, but many generations, and no descendants, only ancestors. I'm actually no longer sure how many generations (This would be a neat feature for the dashboard, deepest generation in the tree). It might also be a good idea to set the bounds for zooming in the different graphics to the selected generation level. When I have extreme charts, the zoom level is insufficient to get the entire image onscreen.

The GrandSchemeTree, with ~million people, doesn't seem to have any problems, though it's only about 8 generations deep, with 8 generations of descendants.

Pedigree is no problem with the preset themes, one of the ~30 second queries, as are all the others, set to their max. I've even tweaked the various Side.tsx files to go way, way out, and they run fine. I'll try them with custom themes (very cool feature, by the way).

Radial can get weird at the later generations, likely due to problems in the data, or descendants that overlap, as do the Fan charts.

Here are examples of deep Radial and Fan charts:

Radial:
Screen Shot Radial

Fan:
Screen Shot Fanchart

Move `CHUNK_SIZE` to `settings.py`.
Remove extraneous `logger` statements from `queries.py`.
Put database_in_use declaration back until group_concat resolved.
@briot
Copy link
Owner

briot commented Mar 30, 2019 via email

@changeling
Copy link
Contributor Author

Sorry I had that ongoing thread in here. I moved most of it into #59.

@briot
Copy link
Owner

briot commented Apr 2, 2019

I just commit significant patches that create conflicts here. Mostly this is because queries.py was moved to a new subdirectory sql/sqlsets.py. I had been waiting to check whether you could submit a revised PR first, but after a few days I feel better if I can push my changes... Sorry for the extra work, let me know if you need help

@briot
Copy link
Owner

briot commented Apr 4, 2019

You provided a different patch in another PR

@briot briot closed this Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants