Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade the staging environment to use PostgreSQL 9.6 #3046

Closed
jeancochrane opened this issue Dec 11, 2018 · 8 comments
Closed

Upgrade the staging environment to use PostgreSQL 9.6 #3046

jeancochrane opened this issue Dec 11, 2018 · 8 comments

Comments

@jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Dec 11, 2018

  • Execute an in-place upgrade via the console
  • Update the PostgreSQL version strings in /deployment/cfn/data_plane.py to maintain consistency
  • Upgrade PostGIS using:
ALTER EXTENSION POSTGIS UPDATE TO '...'

For all of the specific steps, see: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html

See: #3044

@hectcastro hectcastro added queue and removed tech-debt labels Jun 6, 2019
@hectcastro hectcastro changed the title Upgrade the staging environment to use PostgreSQL 9.5 Upgrade the staging environment to use PostgreSQL 9.6 Jun 6, 2019
@hectcastro hectcastro added tech-debt and removed queue labels Jun 6, 2019
@rajadain

This comment has been minimized.

Copy link
Member

@rajadain rajadain commented Aug 15, 2019

One of the things we may need to watch out for when doing the Postgres upgrade is how long it takes to re-index some of the tables. By far the largest table is core_job, which has multiple entries for every single Analyze and Model run, which can be multiple times per user, per project, even per page load. On staging this table currently has 74,098 rows, while on production it has 2,031,829 rows. We may have to add simulated (or copied from production) values to simulate on staging the re-indexing on production.

@rbreslow rbreslow self-assigned this Aug 20, 2019
@rbreslow rbreslow added in progress and removed queue labels Aug 20, 2019
@rbreslow

This comment has been minimized.

Copy link
Member

@rbreslow rbreslow commented Aug 20, 2019

I opened a shell on the Bastion and added the PostgreSQL APT repository to install psql:

sudo apt-get install curl ca-certificates gnupg
curl https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get update
sudo apt-get install postgresql-client-9.4

Next, I opened a psql shell:

PGPASSWORD= \
    psql -h database.service.mmw.internal -U modelmywatershed -d modelmywatershed

I identified how many rows were in the core_job table:

SELECT
   COUNT(*) 
FROM
   core_job;

 count
-------
 74098
(1 row)

I identified rows that are referenced from modeling_project for subbasin results:

SELECT COUNT(*)
FROM core_job
WHERE uuid IN (
    SELECT mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE mapshed_job_uuid_id IS NOT NULL
    UNION
    SELECT subbasin_mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
);

 count
-------
   147
(1 row)

I deleted the remainder:

DELETE
FROM core_job
WHERE uuid NOT IN (
    SELECT mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE mapshed_job_uuid_id IS NOT NULL
    UNION
    SELECT subbasin_mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
);

DELETE 73687

This took approximately 5 minutes and 30 seconds.

There is a discrepancy in the number of rows deleted. I would have expected the query to delete 73,951 (74098 - 147) vs. 73,687. It's possible miscellaneous interaction with the app added rows to this table between the COUNT() and DELETE operations.

I increased resource allocation for maintenance operations and ensured that all tables were appropriately vacuumed:

modelmywatershed=> SET maintenance_work_mem='256MB';
modelmywatershed=> VACUUM (FREEZE, ANALYZE, VERBOSE);

This took approximately 25 minutes.

After completing these prerequisite tasks, I followed the steps from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html to complete the upgrade.

I created a postgresql9.6-compatible parameter group in the console and compared the values to the values indataplane-3b0b2ccb6182d7efd9217b-dbpgdatabaseserver-1kc33ao2csrgm. log_min_duration_statement had to be overridden to 500ms. This parameter is disabled by default.

Stepping through the docs, I verified there was no unsupported usage.

At this point, if we were in production, I would take a snapshot of the database. Since an automated snapshot was taken around 7:00 AM, and RDS takes a pre-upgrade snapshot, I decided the stakes were low enough to begin the upgrade.

I ran this script to check for connectivity:

while true;
do
    date
    PGPASSWORD= \
        psql -h database.service.mmw.internal -U modelmywatershed -d modelmywatershed -c "SELECT version();"
    sleep 5
done

I bumped the engine version to 9.5.18, via the AWS Console, and switched to a temporary parameter group I created for the first hop. The database shut down 2 minutes and 21 seconds after I hit Apply immediately to kick off the upgrade. The database was unavailable for 6 minutes and 34 seconds.

While waiting for RDS to finish post-upgrade operations, I upgraded the Bastion to postgresql-client-9.5, verified we weren't using any extensions, and upgraded PostGIS:

modelmywatershed=> select * from pg_available_extensions where name like 'postgis%';
          name          | default_version | installed_version |                               comment
------------------------+-----------------+-------------------+---------------------------------------------------------------------
 postgis                | 2.2.5           | 2.1.5             | PostGIS geometry, geography, and raster spatial types and functions
 postgis_tiger_geocoder | 2.2.5           |                   | PostGIS tiger geocoder and reverse geocoder
 postgis_topology       | 2.2.5           |                   | PostGIS topology spatial types and functions
(3 rows)

modelmywatershed=> ALTER EXTENSION postgis UPDATE;
WARNING:  'postgis.backend' is already set and cannot be changed until you reconnect
ALTER EXTENSION

modelmywatershed=> select postgis_full_version();
                                                                        postgis_full_version
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 POSTGIS="2.2.5 r15298" GEOS="3.5.1-CAPI-1.9.1 r4246" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.0.3, released 2016/07/01" LIBXML="2.9.1" LIBJSON="0.12" RASTER
(1 row)

I was unable to make modifications to the database for 28 minutes and 53 seconds while RDS finished post-upgrade operations.

I bumped the engine version to 9.6.14, via the AWS Console, and switched to the postgresql9.6-compatible parameter group. The database shut down 2 minutes and 24 seconds after I hit Apply immediately to kick off the upgrade. The database was unavailable for 6 minutes and 35 seconds.

While waiting for RDS to finish post-upgrade operations, I upgraded the Bastion to postgresql-client-9.6, verified we weren't using any extensions, and upgraded PostGIS:

modelmywatershed=> select * from pg_available_extensions where name like 'postgis%';
          name          | default_version | installed_version |                               comment
------------------------+-----------------+-------------------+---------------------------------------------------------------------
 postgis                | 2.3.7           | 2.2.5             | PostGIS geometry, geography, and raster spatial types and functions
 postgis_tiger_geocoder | 2.3.7           |                   | PostGIS tiger geocoder and reverse geocoder
 postgis_topology       | 2.3.7           |                   | PostGIS topology spatial types and functions
(3 rows)

modelmywatershed=> ALTER EXTENSION postgis UPDATE;
WARNING:  'postgis.backend' is already set and cannot be changed until you reconnect
ALTER EXTENSION

modelmywatershed=> select postgis_full_version();
                                                                               postgis_full_version
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 POSTGIS="2.3.7 r16523" PGSQL="96" GEOS="3.6.2-CAPI-1.10.2 4d2925d6" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.1.4, released 2017/06/23" LIBXML="2.9.1" LIBJSON="0.12" RASTER
(1 row)

I was unable to make modifications to the database for 2 minutes and 24 seconds while RDS finished post-upgrade operations.

I rebooted the instance to apply the new parameter group. The connectivity check never lost connection to the database, and RDS claims the reboot took 10 seconds.

I created a parameter group within the production AWS account in preparation for that upgrade. Also, after things are set in production, we want to remember to update the PostgreSQL version strings in CloudFormation.

The database upgrade, minus the prerequisites of pruning core_job and vacuuming, took 49 minutes and 21 seconds. @hectcastro what do you think of requesting a 1 hour maintenance window? I'm assuming that with more maintenance_work_mem available in production, the upgrade would take less time.


I think there were a couple areas where this process could have been better:

  • I forgot to include the use of \timing to get a better sense for how long maintenance operations would take.
  • I forgot to vacuum during the 9.5 to 9.6 hop, and I think this is reflected by the duration of the post-upgrade operations after the 9.5 hop. I think the maintenance window estimate isn't compromised, because we would have swallowed those ~20 minutes in a separate vacuum operation anyway.
@rbreslow

This comment has been minimized.

Copy link
Member

@rbreslow rbreslow commented Aug 20, 2019

I was able to confirm that the app is working while still using 9.4 client libraries:

image

This means we should be able to upgrade the database in production independent of cutting a release. I would appreciate if someone with more familiarity with the app could verify things are a-OK.

cc: @mmcfarland (since Terrence is on vacation)

@rbreslow rbreslow added in review and removed in progress labels Aug 20, 2019
@mmcfarland

This comment has been minimized.

Copy link
Member

@mmcfarland mmcfarland commented Aug 21, 2019

I did a comprehensive test of https://staging.modelmywatershed.org and found it all to be working, including new and previously created projects.

I investigated the source of the discrepancy in the result counts you noted. I experimented in my local dev database and found that there was a similar difference between the counts reported in the IN vs NOT IN queries. I speculated that uuids which were null may not be captured in either of those queries and, in fact, checking the count of null uuid records in my core_jobs table did match the discrepancy between the two queries.

select count(*) from core_job where uuid is null;

I logged onto the production bastion to see if the counts aligned in the same way, but was confused to see the number of rows in core_job was substantially less than reported in the comment above, nearly 5k instead of 2MM+

modelmywatershed=> SELECT COUNT(*) FROM core_job;
 count 
-------
  4489
(1 row)

Is it possible that this table was already reduced with the method described above for staging?

@rbreslow

This comment has been minimized.

Copy link
Member

@rbreslow rbreslow commented Aug 21, 2019

Thank you for taking a look at this.

Is it possible that this table was already reduced with the method described above for staging?

I believe Hector and Terrence already ran through this: https://azavea.slack.com/archives/C044CG212/p1566235838075400

@mmcfarland

This comment has been minimized.

Copy link
Member

@mmcfarland mmcfarland commented Aug 21, 2019

I believe Hector and Terrence already ran through this

Great. At any rate, I don't believe the count discrepancy resulted in any valuable data getting lost.

@hectcastro

This comment has been minimized.

Copy link
Contributor

@hectcastro hectcastro commented Aug 21, 2019

Nice job going through all of the steps in this process and outlining your findings.

Some comments:

  • I think that we can omit the steps to prune the core_job table. That work was already done and cut the record count down significantly. Doing it again during the release will only provide a minor benefit.
  • We should ensure that we vacuum the database prior to executing the first (and each) upgrade process though.
  • I don't think that we need to take a snapshot prior to the database upgrades because RDS will do that for us. We can depend on their snapshot.
  • We should encode the steps necessary to upgrade the client libraries into the checklist (9.4 -> 9.5 -> 9.6). I think that we're safe in applying the 9.5 client library upgrade in advance of the release.
  • We need to remember to vacuum the database after all of the upgrades are complete.

Some clarifications:

I was unable to make modifications to the database for 28 minutes and 53 seconds while RDS finished post-upgrade operations.

Does this imply you couldn't make modifications to the RDS instance parameters, not the PostgreSQL database or its associated tables?

What do you think of requesting a 1 hour maintenance window?

I think that we should communicate 3 hours for the entire release and plan for the database upgrade step to consume roughly 1/3 to 1/2 of it.

@rbreslow

This comment has been minimized.

Copy link
Member

@rbreslow rbreslow commented Aug 21, 2019

Does this imply you couldn't make modifications to the RDS instance parameters, not the PostgreSQL database or its associated tables?

Right, I couldn't make modifications to the RDS instance parameters. The "Modify now" button was disabled. This prevented me from beginning the next hop from 9.5 to 9.6, and prevented me from rebooting to apply the parameter group after the 9.6 upgrade.

I was able to interact with the database over psql, and tested that the app was working between each hop.

I think that we should communicate 3 hours for the entire release and plan for the database upgrade step to consume roughly 1/3 to 1/2 of it.

👍

@hectcastro hectcastro closed this Aug 22, 2019
@hectcastro hectcastro removed the in review label Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.