Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Size of core_job for the Postgres Upgrade #3146

Closed
rajadain opened this issue Aug 15, 2019 · 8 comments
Closed

Reduce Size of core_job for the Postgres Upgrade #3146

rajadain opened this issue Aug 15, 2019 · 8 comments
Assignees
Labels
+ NSF

Comments

@rajadain
Copy link
Member

@rajadain rajadain commented Aug 15, 2019

Postgres Upgrades can be sped up if the data they are upgrading is small. Currently there are ~2M rows which take up about 58GB in core_job on production. Not all of these rows are useful: many of them are simply cache values for Celery Geoprocessing processes. Some rows are important though, those that are referenced from modeling_project for subbasin results.

Make a migration that removes extraneous rows from core_job ahead of the Postgres Upgrade.

@rajadain rajadain added the NSF label Aug 15, 2019
@rajadain rajadain self-assigned this Aug 15, 2019
@rajadain rajadain added the + label Aug 15, 2019
@rajadain

This comment has been minimized.

Copy link
Member Author

@rajadain rajadain commented Aug 16, 2019

On production:

SELECT COUNT(*)
FROM core_job
WHERE uuid IN (
    SELECT mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE mapshed_job_uuid_id IS NOT NULL
    UNION
    SELECT subbasin_mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
);

 count
-------
  1961
(1 row)

So almost all the rest (about 2M rows) can be safely deleted.

To delete those rows, the following query should be used:

DELETE
FROM core_job
WHERE uuid NOT IN (
    SELECT mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE mapshed_job_uuid_id IS NOT NULL
    UNION
    SELECT subbasin_mapshed_job_uuid_id AS uuid
    FROM modeling_project
    WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
);
@rajadain

This comment has been minimized.

Copy link
Member Author

@rajadain rajadain commented Aug 16, 2019

Tagging @hectcastro who may use the above query in the Postgres Upgrade.

@rajadain

This comment has been minimized.

Copy link
Member Author

@rajadain rajadain commented Aug 16, 2019

To further explain the above, these are the only outside references to core_job:

\d core_job;
                                     Table "public.core_job"
    Column    |           Type           |                       Modifiers
--------------+--------------------------+-------------------------------------------------------
 id           | integer                  | not null default nextval('core_job_id_seq'::regclass)
 uuid         | uuid                     |
 model_input  | text                     | not null
 created_at   | timestamp with time zone | not null
 result       | text                     | not null
 delivered_at | timestamp with time zone |
 error        | text                     | not null
 traceback    | text                     | not null
 status       | character varying(255)   | not null
 user_id      | integer                  |
Indexes:
    "core_job_pkey" PRIMARY KEY, btree (id)
    "core_job_uuid_5f077d4579939327_uniq" UNIQUE CONSTRAINT, btree (uuid)
    "core_job_e8701ad4" btree (user_id)
Foreign-key constraints:
    "core_job_user_id_65faa30363ca6a76_fk_auth_user_id" FOREIGN KEY (user_id) REFERENCES auth_user(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "modeling_project" CONSTRAINT "modeling__mapshed_job_uuid_id_754f17390f2966e1_fk_core_job_uuid" FOREIGN KEY (mapshed_job_uuid_id) REFERENCES core_job(uuid) DEFERRABLE INITIALLY DEFERRED
    TABLE "modeling_project" CONSTRAINT "subbasin_mapshed_job_uuid_id_7c80a82e7ee70dca_fk_core_job_uuid" FOREIGN KEY (subbasin_mapshed_job_uuid_id) REFERENCES core_job(uuid) DEFERRABLE INITIALLY DEFERRED
@rajadain

This comment has been minimized.

Copy link
Member Author

@rajadain rajadain commented Aug 16, 2019

Testing Instructions

  • Go to :8000/ and log in
  • Create a TR-55 project, and a MapShed project, and a Subbasin project
  • Open a database shell ./scripts/manage.sh dbshell
  • Run the above DELETE query
  • Make sure that those projects still load correctly
  • Make sure that new analyses work correctly
@rajadain rajadain assigned caseycesari and rajadain and unassigned rajadain Aug 16, 2019
@caseycesari

This comment has been minimized.

Copy link
Member

@caseycesari caseycesari commented Aug 16, 2019

I don't have a full enough understanding of the MMW DB to know wether or not running the DELETE query is "safe", but I followed the steps above, and everything worked as described. Here are the results of queries when I ran them locally. Note the discrepancy in the count.

mmw=# SELECT COUNT(*)
mmw-# FROM core_job
mmw-# WHERE uuid IN (
mmw(#     SELECT mapshed_job_uuid_id AS uuid
mmw(#     FROM modeling_project
mmw(#     WHERE mapshed_job_uuid_id IS NOT NULL
mmw(#     UNION
mmw(#     SELECT subbasin_mapshed_job_uuid_id AS uuid
mmw(#     FROM modeling_project
mmw(#     WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
mmw(# );
 count
-------
     3
(1 row)

mmw=#
mmw=# DELETE
mmw-# FROM core_job
mmw-# WHERE uuid NOT IN (
mmw(#     SELECT mapshed_job_uuid_id AS uuid
mmw(#     FROM modeling_project
mmw(#     WHERE mapshed_job_uuid_id IS NOT NULL
mmw(#     UNION
mmw(#     SELECT subbasin_mapshed_job_uuid_id AS uuid
mmw(#     FROM modeling_project
mmw(#     WHERE subbasin_mapshed_job_uuid_id IS NOT NULL
mmw(# );
DELETE 90
@rajadain

This comment has been minimized.

Copy link
Member Author

@rajadain rajadain commented Aug 16, 2019

The discrepancy comes from the WHERE clause. In the first query, we're only counting those rows that are IN modeling_project. In the second query, we're deleting those that are NOT IN modeling_project.

@hectcastro

This comment has been minimized.

Copy link
Contributor

@hectcastro hectcastro commented Aug 19, 2019

A table dump of core_job and modeling_project is ~34GB compressed and takes about 2+ hours to create. I practiced capturing the production tables over the weekend, and plan to kick off another capture on Monday morning. Shortly after the process is complete, I'll execute the SQL statements above to reduce the size of the core_job table in production.

@caseycesari caseycesari removed their assignment Aug 19, 2019
@hectcastro

This comment has been minimized.

Copy link
Contributor

@hectcastro hectcastro commented Aug 20, 2019

Looks like the deletion and vacuum process is now complete. I also ran a manual vacuum process last night to hit all of the other tables.

The table itself won’t report using any less data, but all of the deleted records are now marked as such, so new writes to the table will repurpose that space.

@jwalgran jwalgran closed this Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.