New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix postgresql functionality in `queries.py`. #56

Closed

changeling wants to merge 4 commits into briot:master from changeling:postgres-fixes

Contributor

changeling commented Mar 28, 2019

Add check for database engine.
Add conditional CHUNK_SIZE for _sql_split() for database backend.
Add c.name to GROUP BY clause in _query_get_sex() as per postgres requirement.
Add conditional initial and group_concat declarations (string_agg for postgres) for postgres functionality in get_ancestorss() and get_descendants().
Minor change to query execution in get_descendants() to match get_ancestorss() structure.


          Fix postgresql functionality in queries.py.

a434fcf

Add check for database engine.
Add conditional `CHUNK_SIZE` for `_sql_split()` for database backend.
Add `c.name` to `GROUP BY` clause in `_query_get_sex()` as per postgres requirement.
Add conditional `initial` and `group_concat` declarations (`string_agg` for postgres) for postgres functionality in `get_ancestorss()` and `get_descendants()`.
Minor change to query execution in `get_descendants()` to match `get_ancestorss()` structure.

Owner

briot commented Mar 28, 2019

Excellent and timely PR !
I was wondering whether you would have time to look at Postgresql, since sqlite is no match given the size of your database. I wonder how much of an improvement you saw by moving to Postgresql ?

I am doing some work to optimize the Source View page by doing similar recursive queries and also using group_concat, so I'll take your patch into account for that. Will review soon

Contributor Author

changeling commented Mar 28, 2019

I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:

Next, I'll load in the ManyMany generation-deep data and take a look.

The debounce and paged list views are really great! I have some thoughts regarding interface that I'll add to an issue after I sleep for a bit. Cheers!

Owner

briot commented Mar 28, 2019 via email

I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:

80 to 190 seconds seems like a huge amount of time to display a page. We still need to do quite a lot of tweaking if that’s the numbers you have. I was hoping for much much faster times.

Contributor Author

changeling commented Mar 28, 2019

I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.

briot reviewed

View reviewed changes

backend/geneaprove/views/queries.py

+                  CHUNK_SIZE = 8900
+              else:
+                  CHUNK_SIZE = 996

Owner

briot Mar 28, 2019

I think it would be nicer to directly add a key in the settings file, and use it from here. That way it is possible to change it more easily, and avoids hard-coding this check for databases here.
https://docs.djangoproject.com/en/2.1/topics/settings/

What do you think ?

Contributor Author

changeling Mar 30, 2019

I should have done that to begin with. :}

How about this, in the DATABASES setting (I'm using variables at the top of settings for the particulars):

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'CONN_MAX_AGE': 10,
        'NAME': PG_DATABASE_NAME,
        'USER': PG_DATABASE_USER,
        'PASSWORD': PG_DATABASE_PASSWORD,
        'HOST': 'localhost',
        'PORT': PG_DATABASE_PORT,
        'CHUNK_SIZE': 8900
    }
}

With this at the top of queries.py:

# Add max query size for other database backends
CHUNK_SIZE = settings.DATABASES['default']['CHUNK_SIZE']

And then:

    def _sql_split(self, ids, chunk_size=CHUNK_SIZE):
        """
        Generate multiple tuples to split a long list of ids into more
        manageable chunks for Sqlite
        """
        if ids is None:
            yield None
        else:
            ids = list(ids)  # need a list to extract parts of it
            for i in range(0, len(ids), chunk_size):
                yield ids[i:i + chunk_size]

Owner

briot Apr 2, 2019

I think this is the right approach (defining CHUNK_SIZE in settings.py)

backend/geneaprove/views/queries.py

+                              group_concat = "string_agg(parents.parent::text, ',') AS parents "
+                          else:
+                              initial = f"VALUES({person_id},0)"
+                              group_concat = "group_concat(parents.parent) AS parents "

Owner

briot Mar 28, 2019

Can we instead make a migration and declare the group_concat aggregate in postgresql (that had been my initial plan, at least) ? "CREATE AGGREGATE group_concat()". I must say I never quite remember the syntax.
Something along the untested:

CREATE OR REPLACE FUNCTION comma_concat(text, text) RETURNS text
    LANGUAGE plpgsql
    AS $_$
begin
 if $1 = '' then
  return $2;
 else
  return  $1 || ',' || $2;
 end if;
end;$_$;

CREATE AGGREGATE group_concat(text) (
   SFUNC = textcat, STYPE = text, INITCOND = '');
CREATE AGGREGATE br_concat(text) (
        SFUNC = comma_concat, STYPE = text, INITCOND = '');

Then we can avoid the test for postgresql in the code

Contributor Author

changeling Mar 30, 2019

I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql

Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.

Owner

briot Apr 2, 2019

I was thinking about it some more, and perhaps I should remove the aggregate altogether, do it in python

backend/geneaprove/views/queries.py Outdated

                               "FROM ancestors LEFT JOIN parents "
                               "ON parents.main_id=ancestors.main_id "
                               f"{sk}"
                               "GROUP BY ancestors.main_id, ancestors.generation"
                           )
+                          logger.debug(f'get_ancestors() query: {q}')

Owner

briot Mar 28, 2019

Let's remove that, since we can see all queries by enabling DEBUG in the settings

Contributor Author

changeling Mar 30, 2019

Yep. I used that while testing and intended to remove it before I pushed. :}

backend/geneaprove/views/queries.py

+                              group_concat = "string_agg(children.child::text, ',') AS children "
+                          else:
+                              initial = f"VALUES({person_id},0)"
+                              group_concat = "group_concat(children.child) AS children "

Owner

briot Mar 28, 2019

Same as above

backend/geneaprove/views/queries.py Outdated

                               "FROM descendants LEFT JOIN children "
                               "ON children.main_id=descendants.main_id "
                               f"{sk}"
                               "GROUP BY descendants.main_id, descendants.generation"
                           )
+                          logger.debug(f'get_descendants() query: {q}')

Owner

briot Mar 28, 2019

Remove the log

Contributor Author

changeling Mar 30, 2019

Yep.

Owner

briot commented Mar 28, 2019 via email

I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.

Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ? Thanks

Contributor Author

changeling commented Mar 29, 2019

Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ? Thanks

I'll look at those, yep! Another thing, with the file that goes back many, many generations, both quilts and stats chew gigabyte upon gigabyte of disk space (>15GB) and consume all of my available RAM on a 16GB MacBook Pro, until the machine dies. I'll put that in an issue.


          Merge remote-tracking branch 'upstream/master' into postgres-fixes

612d674

Contributor Author

changeling commented Mar 29, 2019

(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.

Owner

briot commented Mar 29, 2019 via email

(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.

BTW, can you also try running VACCUM ANALYZE in postgresql ? In my experience, this results in significant improvements when you just loaded a large table

Contributor Author

changeling commented Mar 29, 2019

I'll look at doing VACUUM ANALYZE when I get back to postgres. Though the automatic VACUUM helpers are very active processes as the RAM dwindles. The sqlite stats test is still proceeding from the time of my last comment, currently having used about 20 GB of disk space. The memory management seems drastically better, without much impact. I started with around 80% free, and am hovering around 70% throughout the run.

Contributor Author

changeling commented Mar 29, 2019

I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.

Owner

briot commented Mar 29, 2019 via email

I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.

For the quilts view, I need to improve the code a bit: even though you might have requested only 10 generation in the GUI, the backend is still going to compute full depth, so that’s no good. You likely also have a similar issue with the pedigree if you use the “Custom Theme Demo” colors. They need to compute what persons are in you tree, and this has to be the slow query here (it is also needed by the Stats view). I am not sure how much it could be improved, but 1 million persons should be no problem for Postgres. I do have the ManyMany files, though I don’t think I had 1 million persons there. I’ll do some testing in a few days

Contributor Author

changeling commented Mar 29, 2019

The ManyMany file is around 220,000 people, but many generations, and no descendants, only ancestors. I'm actually no longer sure how many generations (This would be a neat feature for the dashboard, deepest generation in the tree). It might also be a good idea to set the bounds for zooming in the different graphics to the selected generation level. When I have extreme charts, the zoom level is insufficient to get the entire image onscreen.

The GrandSchemeTree, with ~million people, doesn't seem to have any problems, though it's only about 8 generations deep, with 8 generations of descendants.

Pedigree is no problem with the preset themes, one of the ~30 second queries, as are all the others, set to their max. I've even tweaked the various Side.tsx files to go way, way out, and they run fine. I'll try them with custom themes (very cool feature, by the way).

Radial can get weird at the later generations, likely due to problems in the data, or descendants that overlap, as do the Fan charts.

Here are examples of deep Radial and Fan charts:

Radial:

Fan:

changeling added 2 commits

March 29, 2019 19:22


          Move CHUNK_SIZE to settings.py.

48dee40

Move `CHUNK_SIZE` to `settings.py`.
Remove extraneous `logger` statements from `queries.py`.


          Put database_in_use declaration back until group_concat resolved.

9e72a2f

Put database_in_use declaration back until group_concat resolved.

Owner

briot commented Mar 30, 2019 via email

I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.

I was thinking about it, and perhaps that group_by can be done in python instead. I’ll have a look.

changeling mentioned this pull request

Charting issues and thoughts. #59

Open

Contributor Author

changeling commented Apr 1, 2019

Sorry I had that ongoing thread in here. I moved most of it into #59.

Owner

briot commented Apr 2, 2019

I just commit significant patches that create conflicts here. Mostly this is because queries.py was moved to a new subdirectory sql/sqlsets.py. I had been waiting to check whether you could submit a revised PR first, but after a few days I feel better if I can push my changes... Sorry for the extra work, let me know if you need help

Owner

briot commented Apr 4, 2019

You provided a different patch in another PR

briot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet