-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix postgresql functionality in queries.py
.
#56
Conversation
Add check for database engine. Add conditional `CHUNK_SIZE` for `_sql_split()` for database backend. Add `c.name` to `GROUP BY` clause in `_query_get_sex()` as per postgres requirement. Add conditional `initial` and `group_concat` declarations (`string_agg` for postgres) for postgres functionality in `get_ancestorss()` and `get_descendants()`. Minor change to query execution in `get_descendants()` to match `get_ancestorss()` structure.
Excellent and timely PR ! I am doing some work to optimize the Source View page by doing similar recursive queries and also using group_concat, so I'll take your patch into account for that. Will review soon |
I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset: Next, I'll load in the ManyMany generation-deep data and take a look. The debounce and paged list views are really great! I have some thoughts regarding interface that I'll add to an issue after I sleep for a bit. Cheers! |
I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:
80 to 190 seconds seems like a huge amount of time to display a page.
We still need to do quite a lot of tweaking if that’s the numbers you have.
I was hoping for much much faster times.
|
I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds. |
CHUNK_SIZE = 8900 | ||
else: | ||
CHUNK_SIZE = 996 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nicer to directly add a key in the settings file, and use it from here. That way it is possible to change it more easily, and avoids hard-coding this check for databases here.
https://docs.djangoproject.com/en/2.1/topics/settings/
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have done that to begin with. :}
How about this, in the DATABASES setting (I'm using variables at the top of settings for the particulars):
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'CONN_MAX_AGE': 10,
'NAME': PG_DATABASE_NAME,
'USER': PG_DATABASE_USER,
'PASSWORD': PG_DATABASE_PASSWORD,
'HOST': 'localhost',
'PORT': PG_DATABASE_PORT,
'CHUNK_SIZE': 8900
}
}
With this at the top of queries.py
:
# Add max query size for other database backends
CHUNK_SIZE = settings.DATABASES['default']['CHUNK_SIZE']
And then:
def _sql_split(self, ids, chunk_size=CHUNK_SIZE):
"""
Generate multiple tuples to split a long list of ids into more
manageable chunks for Sqlite
"""
if ids is None:
yield None
else:
ids = list(ids) # need a list to extract parts of it
for i in range(0, len(ids), chunk_size):
yield ids[i:i + chunk_size]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach (defining CHUNK_SIZE in settings.py)
group_concat = "string_agg(parents.parent::text, ',') AS parents " | ||
else: | ||
initial = f"VALUES({person_id},0)" | ||
group_concat = "group_concat(parents.parent) AS parents " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead make a migration and declare the group_concat aggregate in postgresql (that had been my initial plan, at least) ? "CREATE AGGREGATE group_concat()". I must say I never quite remember the syntax.
Something along the untested:
CREATE OR REPLACE FUNCTION comma_concat(text, text) RETURNS text
LANGUAGE plpgsql
AS $_$
begin
if $1 = '' then
return $2;
else
return $1 || ',' || $2;
end if;
end;$_$;
CREATE AGGREGATE group_concat(text) (
SFUNC = textcat, STYPE = text, INITCOND = '');
CREATE AGGREGATE br_concat(text) (
SFUNC = comma_concat, STYPE = text, INITCOND = '');
Then we can avoid the test for postgresql in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql
Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it some more, and perhaps I should remove the aggregate altogether, do it in python
backend/geneaprove/views/queries.py
Outdated
"FROM ancestors LEFT JOIN parents " | ||
"ON parents.main_id=ancestors.main_id " | ||
f"{sk}" | ||
"GROUP BY ancestors.main_id, ancestors.generation" | ||
) | ||
logger.debug(f'get_ancestors() query: {q}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove that, since we can see all queries by enabling DEBUG in the settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I used that while testing and intended to remove it before I pushed. :}
group_concat = "string_agg(children.child::text, ',') AS children " | ||
else: | ||
initial = f"VALUES({person_id},0)" | ||
group_concat = "group_concat(children.child) AS children " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
backend/geneaprove/views/queries.py
Outdated
"FROM descendants LEFT JOIN children " | ||
"ON children.main_id=descendants.main_id " | ||
f"{sk}" | ||
"GROUP BY descendants.main_id, descendants.generation" | ||
) | ||
logger.debug(f'get_descendants() query: {q}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.
Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ?
Thanks
|
I'll look at those, yep! Another thing, with the file that goes back many, many generations, both |
(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite. |
(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.
BTW, can you also try running
VACCUM ANALYZE
in postgresql ? In my experience, this results in significant improvements when you just loaded
a large table
|
I'll look at doing VACUUM ANALYZE when I get back to postgres. Though the automatic VACUUM helpers are very active processes as the RAM dwindles. The sqlite stats test is still proceeding from the time of my last comment, currently having used about 20 GB of disk space. The memory management seems drastically better, without much impact. I started with around 80% free, and am hovering around 70% throughout the run. |
I had to stop the sqlite |
I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.
For the quilts view, I need to improve the code a bit: even though you might have requested only 10 generation
in the GUI, the backend is still going to compute full depth, so that’s no good.
You likely also have a similar issue with the pedigree if you use the “Custom Theme Demo” colors. They need
to compute what persons are in you tree, and this has to be the slow query here (it is also needed by the
Stats view). I am not sure how much it could be improved, but 1 million persons should be no problem
for Postgres.
I do have the ManyMany files, though I don’t think I had 1 million persons there. I’ll do some testing
in a few days
|
The ManyMany file is around 220,000 people, but many generations, and no descendants, only ancestors. I'm actually no longer sure how many generations (This would be a neat feature for the dashboard, deepest generation in the tree). It might also be a good idea to set the bounds for zooming in the different graphics to the selected generation level. When I have extreme charts, the zoom level is insufficient to get the entire image onscreen. The GrandSchemeTree, with ~million people, doesn't seem to have any problems, though it's only about 8 generations deep, with 8 generations of descendants. Pedigree is no problem with the preset themes, one of the ~30 second queries, as are all the others, set to their max. I've even tweaked the various Side.tsx files to go way, way out, and they run fine. I'll try them with custom themes (very cool feature, by the way). Radial can get weird at the later generations, likely due to problems in the data, or descendants that overlap, as do the Fan charts. Here are examples of deep Radial and Fan charts: |
Move `CHUNK_SIZE` to `settings.py`. Remove extraneous `logger` statements from `queries.py`.
Put database_in_use declaration back until group_concat resolved.
I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql
Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.
I was thinking about it, and perhaps that group_by can be done in python instead. I’ll have a look.
|
Sorry I had that ongoing thread in here. I moved most of it into #59. |
I just commit significant patches that create conflicts here. Mostly this is because queries.py was moved to a new subdirectory sql/sqlsets.py. I had been waiting to check whether you could submit a revised PR first, but after a few days I feel better if I can push my changes... Sorry for the extra work, let me know if you need help |
You provided a different patch in another PR |
Add check for database engine.
Add conditional
CHUNK_SIZE
for_sql_split()
for database backend.Add
c.name
toGROUP BY
clause in_query_get_sex()
as per postgres requirement.Add conditional
initial
andgroup_concat
declarations (string_agg
for postgres) for postgres functionality inget_ancestorss()
andget_descendants()
.Minor change to query execution in
get_descendants()
to matchget_ancestorss()
structure.