Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arches v7: Relational Views and SQL ETL Methods Bug #9049

Open
ekansa opened this issue Oct 24, 2022 · 24 comments
Open

Arches v7: Relational Views and SQL ETL Methods Bug #9049

ekansa opened this issue Oct 24, 2022 · 24 comments
Labels
Type: Bug Something isn't working

Comments

@ekansa
Copy link

ekansa commented Oct 24, 2022

I’m running into trouble with Arches v7 using the SQL import methods described here: Resource Import/Export — 7.1.0

I’ve got a test dataset of 46,000 archaeological sites with some simple flat attributes. I’ve been testing a SQL based import of this table into Historic England’s resource models / graphs / branches as provided here: GitHub - archesproject/arches-her

I’ve made a staging table with my site data in the Arches Postgress database. In this staging table, my ID columns are of type UUID and the string literals are in the JSONB datatype with JSON that looks like:

{
    "en": {
        "value": "44FX1000 from https://opencontext.org/subjects/daac6033-d4ef-4720-514e-3c49f82938e9", 
        "direction": "ltr"
    }
}

And, I seem to have structured the JSON properly, since I can do SQL INSERT queries on small numbers of rows that go slowly but work, letting me see my data in the Arches user interface (esp. after I force a reindex).

What works:

  1. I can import resource instances, all 46K just fine.
  2. I can import data to populate descriptive data (associated with these resource instances) for different branches (a string attribute and a GeoJSON that gets converted into a geometry) attribute, but ONLY if I limit to something like a hundred records or so.

What doesn’t work:

  • I cannot populate the descriptions of the resource instances for more than 100 or so records without getting a database connection error after something like 10 minutes. 10 minutes is practically forever anyway, so I must be doing something fundamentally wrong.

I’ve tried multiple times, even with variations where I included my own minted UUIDs for the tileid column in the SQL INSERT

@ekansa ekansa added the Type: Bug Something isn't working label Oct 24, 2022
@ekansa ekansa changed the title Arches v7: Relational Views and SQL ETL Methods Arches v7: Relational Views and SQL ETL Methods Bug Oct 24, 2022
@ekansa
Copy link
Author

ekansa commented Oct 24, 2022

PS. Here's a copy of a non-sensitive dataset that I'm experimenting with. It originates here: Open Context: Virginia Site Files. I used the Python Pandas library to dump the data into a CSV. Pandas handled the export of the Python dictionaries in many columns into their string expressions. Let me know if this works, I can do some pre-processing to make sure these dictionaries are first converted into JSON formatted strings prior to export as a CSV if that helps.

https://drive.google.com/file/d/1a_eAOqK9G8NqiJJlZD7FJO6sWDRKGe64/view?usp=sharing

@robgaston
Copy link
Member

robgaston commented Nov 4, 2022

@ekansa I was able to load a resource instance and a single tile for each row from staged data using the following SQL (where I loaded that CSV to a table called staging.sites) in about 3 minutes:

INSERT INTO heritage_asset.instances(
        resourceinstanceid,
        transactionid
    )
select distinct resourceinstanceid,
    transactionid
from staging.sites;

INSERT INTO heritage_asset.heritage_asset_names (
        asset_name_use_type,
        asset_name_currency,
        asset_name_type,
        asset_name,
        asset_name_metatype,
        asset_name_use_metatype,
        asset_name_currency_metatype,
        resourceinstanceid,
        nodegroupid,
        transactionid
    )
SELECT '2df285fa-9cf2-45e7-bc05-a67b7d7ddc2f',
    'c2051d53-40e7-4a2d-a4b4-02a31da37fd1',
    '8f985e91-6cd2-4d70-a03c-b49a50d09a3b',
    REPLACE(col_52_smithsonian_trinomial_identifier, '''', '"')::jsonb,
    'a0e096e2-f5ae-4579-950d-3040714713b4',
    '04a4c4d5-5a5e-4018-93aa-65abaa53fb53',
    '5a88136a-bf3a-4b48-a830-a7f42000dd24',
    resourceinstanceid,
    '676d47f9-9c1c-11ea-9aa0-f875a44e0e11',
    transactionid
from staging.sites;

I did this via VSCode SQL Tools; it was a little faster with psql at a command line.

This leads me to believe that something else is at play here. I was using Arches 7.2.0 on an Apple M1 Pro with 16gb memory. I wonder how this compares to your system and if something about your local environment may be the issue...

@bferguso
Copy link
Contributor

bferguso commented Nov 4, 2022

@ekansa - sorry, jumping in late here so I may be missing something.

Just wondering if you've turned off the validation triggers when you're doing the inserts (I'm not sure if this has made it into the latest docs). I've found that with the validation triggers on it can slow things down significantly.

If you make this call before running the insert:call __arches_prepare_bulk_load(); - it disables the cardinality validation triggers that fire every time a row is inserted into the tiles table.

When you've done your insert(s) and/or updates you can make this call: call __arches_complete_bulk_load(); which re-enables the triggers and checks for cardinality violations.

For context - I have a resource model with ~200k entries and it only takes a couple of minutes to run.

@ekansa
Copy link
Author

ekansa commented Nov 7, 2022

Thanks! @robgaston @bferguso.

This works very fast, as expected.


SELECT __arches_create_resource_model_views('076f9381-7b00-11e9-8d6b-80000b44d1d9');
INSERT INTO heritage_asset.instances(
        resourceinstanceid,
        transactionid
    )
select distinct resourceinstanceid,
    transactionid
from oc_sites;

I have my data loaded appropriately with JSONB columns for the columns with (potentially) multilingual text, although I'm just working with English now. I can do queries against the staged data, and the JSON seems to work fine. Here's an example:

SELECT oc_sites.resourceinstanceid, oc_sites.col_52_smithsonian_trinomial_identifier,
oc_sites.col_52_smithsonian_trinomial_identifier::json->'en'->'value' as en_value
FROM oc_sites
LIMIT 10;

Here are the results (illustrating how the data in the JSONB field col_52_smithsonian_trinomial_identifier seems to behave as expected:

10 rows returned

resourceinstanceid
uuid
col_52_smithsonian_trinomial_identifier
jsonb
en_value
json
1	6d44220f-3d99-4241-c5ac-e46a4f3a57a1	{"en":{"value":"44FX322","direction":"ltr"}}	"44FX322"
2	2535590d-3935-4727-18f9-79b5c9b9a3d2	{"en":{"value":"44FX320","direction":"ltr"}}	"44FX320"
3	902ab72a-b033-4070-287b-5df49c258fc3	{"en":{"value":"44FX325","direction":"ltr"}}	"44FX325"
4	e013fcf9-ae89-4469-bfd1-54e9f8de33e5	{"en":{"value":"44FX324","direction":"ltr"}}	"44FX324"
5	0e979294-aa54-4593-0378-ed4370d85717	{"en":{"value":"44FX323","direction":"ltr"}}	"44FX323"
6	90132ea9-00cc-4bc8-45e5-c5243b405b95	{"en":{"value":"44FX329","direction":"ltr"}}	"44FX329"
7	afe61dd1-08da-4c91-896b-a222df2a3e4d	{"en":{"value":"44FX335","direction":"ltr"}}	"44FX335"
8	d1b06158-da2e-491f-161e-a0a701e19a89	{"en":{"value":"44FX337","direction":"ltr"}}	"44FX337"
9	22c7c5f3-627c-46f6-baa2-047ddedad3ea	{"en":{"value":"44FX336","direction":"ltr"}}	"44FX336"
10	fb5f24f1-c78f-45ea-5bc1-ba37416b2393	{"en":{"value":"44FX333","direction":"ltr"}}	"44FX333"

Now when I try this, the database just hangs.

call __arches_prepare_bulk_load();

INSERT INTO heritage_asset.heritage_asset_names (
        asset_name_use_type,
        asset_name_currency,
        asset_name_type,
        asset_name,
        asset_name_metatype,
        asset_name_use_metatype,
        asset_name_currency_metatype,
        resourceinstanceid,
        nodegroupid,
        transactionid
    )
SELECT '2df285fa-9cf2-45e7-bc05-a67b7d7ddc2f',
    'c2051d53-40e7-4a2d-a4b4-02a31da37fd1',
    '8f985e91-6cd2-4d70-a03c-b49a50d09a3b',
    oc_sites.col_52_smithsonian_trinomial_identifier,
    'a0e096e2-f5ae-4579-950d-3040714713b4',
    '04a4c4d5-5a5e-4018-93aa-65abaa53fb53',
    '5a88136a-bf3a-4b48-a830-a7f42000dd24',
    oc_sites.resourceinstanceid,
    '676d47f9-9c1c-11ea-9aa0-f875a44e0e11',
    oc_sites.transactionid
from public.oc_sites;

call __arches_complete_bulk_load();

@ekansa
Copy link
Author

ekansa commented Nov 7, 2022

@robgaston @bferguso I'm just spit balling here, but I wonder if there's something going on to make the JSONB fields go so slowly. I wish I had some error message or other view as to what is happening internally. Strangely, while my long insert query runs, I can query to check the heritage_asset.heritage_asset_names table, and it always comes up empty.

SELECT * FROM heritage_asset.heritage_asset_names LIMIT 1;

And I get an empty table.

@ekansa
Copy link
Author

ekansa commented Nov 8, 2022

So I'm doing an experiment to try to break this up into smaller chunks.

 call __arches_prepare_bulk_load();

        INSERT INTO heritage_asset.heritage_asset_names (
                asset_name_use_type,
                asset_name_currency,
                asset_name_type,
                asset_name,
                asset_name_metatype,
                asset_name_use_metatype,
                asset_name_currency_metatype,
                resourceinstanceid,
                nodegroupid,
                transactionid
            )
        SELECT '2df285fa-9cf2-45e7-bc05-a67b7d7ddc2f',
            'c2051d53-40e7-4a2d-a4b4-02a31da37fd1',
            '8f985e91-6cd2-4d70-a03c-b49a50d09a3b',
            oc_sites.col_52_smithsonian_trinomial_identifier,
            'a0e096e2-f5ae-4579-950d-3040714713b4',
            '04a4c4d5-5a5e-4018-93aa-65abaa53fb53',
            '5a88136a-bf3a-4b48-a830-a7f42000dd24',
            oc_sites.resourceinstanceid,
            '676d47f9-9c1c-11ea-9aa0-f875a44e0e11',
            oc_sites.transactionid
        from public.oc_sites
        ORDER BY oc_sites.resourceinstanceid
        LIMIT 100
        OFFSET 0;

        call __arches_complete_bulk_load();

I'm now getting a more usefully informative error message, but I don't know how to handle it:

(psycopg2.errors.ObjectInUse) cannot ALTER TABLE "tiles" because it has pending trigger events
CONTEXT:  SQL statement "alter table tiles enable trigger __arches_check_excess_tiles_trigger"
PL/pgSQL function __arches_complete_bulk_load() line 5 at SQL statement

@ekansa
Copy link
Author

ekansa commented Nov 9, 2022

For context, here is my setup for playing with Arches (I got this working to test internationalization features back in August):

(1) I'm running Ubuntu 20.04 in a Windows subsystem for Linux (32 GB of ram)
(2) I'm running the arches-dependency-containers, main branch (up to date), Arches 7 docker containers.
(3) Arches is set to branch stable/7.2.0 (also up to date)
(4) I'm using arches-her, branch dev 7.1.x

@ekansa
Copy link
Author

ekansa commented Nov 9, 2022

I’ve isolated part of the problem. It seems that the functions

call __arches_prepare_bulk_load();
and

call __arches_complete_bulk_load();
are causing problems. If I run the SQL insert without those functions called, then the insert runs successfully, if kinda slowly. It takes around 15 seconds to for the SQL to execute (100 records!). But at least it works?

@robgaston
Copy link
Member

robgaston commented Feb 2, 2023

@ekansa revisiting this for another project I was able to write some SQL scripts (which in turn do use those __arches_prepare_bulk_load and __arches_complete_bulk_load functions) to execute a modest bulk load that inserts 500 resource instances and inserts/updates around 2000 tiles. This particular system has around 200k resource instances and 1.5M tiles loaded. I've executed this on multiple environments and it always completes in a few seconds. What I haven't tried is running it using the docker containers that you described...

I wonder if the issue here might somehow lie with those docker containers and how they are running PostgreSQL... what do you think? have you tried your scripts in any environment other than the docker one you mentioned?

@ekansa
Copy link
Author

ekansa commented Feb 2, 2023

@robgaston Thanks for looking into this!

Yes, I made a dump of the database and ran the queries on Postgres running on "bare metal" (not Docker). I coordinated with @ryan86 and sent him a dump of the Postgres database itself (it is here by the way: https://drive.google.com/file/d/1jP1xwdpP4WFZKsIkkKqFxprxn5tmzvS1/view). He also verified unworkably slow performance (but I think his Postgres instance also ran in Docker).

I suspect the database itself may be the problem. I'm wondering if the Docker set up that I described (I first used that setup to help test internalization features of V7 way back in September) led to the creation of a somehow broken database in Postgres? Or perhaps there is some subtle problem with the HER package in v7 (I'm trying to load data into the HER resource models/graphs)? I wonder if @aarongundel has run into similar issues?

I should note that Arches ran perfectly well on my localhost, the only problem I encountered was with the ETL.

@robgaston
Copy link
Member

@ekansa I restored from your db backup and was able to load into the views as per your script above from your staged Virginia sites CSV (the entire file, no limit applied) in about 3 minutes. So it would appear that something environmental (rather than native to the DB itself) is at play here.

@ekansa
Copy link
Author

ekansa commented Feb 3, 2023

@robgaston Just to clarify, you ran the query to insert into the heritage_asset.heritage_asset_names view also, not just the heritage_asset.instances ? The heritage_asset.instances inserts have always been very fast for me, but no matter what environment I use (Ubuntu 20 in WSL, Ubuntu 20 on a remote server), with or without Docker, the heritage_asset.heritage_asset_names query runs too slowly.

@robgaston
Copy link
Member

robgaston commented Feb 3, 2023

@ekansa yes. I inserted one instance (via heritage_asset.instances) and one tile (via heritage_asset.heritage_asset_names) per row in the staged CSV file. all told, these two queries ran in under 4 minutes (not fast IMHO but acceptable and not as egregious as you appear to be experiencing).

@ekansa
Copy link
Author

ekansa commented Feb 3, 2023

OK thanks @robgaston! That sounds like a pretty convincing case that I have something very wrong with my environment(s). I just spun up another Ubuntu 20.4 server, with Postgresql 14.6 (and all the Postgis stuff), and am having the same query trouble.

I guess it's back to the drawing board for me.

@robgaston
Copy link
Member

@ekansa sounds frustrating. I'm happy to schedule a time to show you what I'm working with here if that would be helpful. you know where to find me 😄

@ekansa
Copy link
Author

ekansa commented Feb 3, 2023

@robgaston Just spit balling here... Are there external (not Postgres) dependencies to the SQL insert ETL? Does Elastic or RabbitMQ need to be up and connected to Postgres? I'm asking because it may explain the behavior in Docker and on stand-alone bare metal (outside of Docker) installs of Postgres.

@robgaston
Copy link
Member

@ekansa no, there are no dependencies for this functionality outside of PostgreSQL, so I don't think that would be a factor here.

@ekansa
Copy link
Author

ekansa commented Feb 3, 2023

@robgaston Thank you for clarifying that!

The only other thing I can think of is that somehow I'm creating a database and restoring a database that is some how subtly wrong, and this has nothing to do with Docker or my environment, since I've tried running this now with Postgres 14, 13, and 12 on different cloud virtual machines and I run into exactly the same problem.

The database was originally created with the Arches setup_db command. In my experiments with Postgres in different environments, I follow the model from the Arches setup_db command and I create the DB with SQL as:

CREATE DATABASE arches_her
    WITH OWNER = postgres
    ENCODING = 'UTF8'
    CONNECTION LIMIT=-1
    TEMPLATE = template_postgis;

Then use pg_restore to load my dump data:
pg_restore -v -d arches_her -U postgres -h localhost -Fc arches_her_ekansa.dump

Go into the restored database:
psql -U postgres -d arches_her

Happy make the relational view for the resource_model:
SELECT __arches_create_resource_model_views('076f9381-7b00-11e9-8d6b-80000b44d1d9');

Quickly make resource instances:

INSERT INTO heritage_asset.instances(
        resourceinstanceid,
        transactionid
    )
select distinct resourceinstanceid,
    transactionid
from staging.oc_sites;

Call the function to prep a bulk load:
call __arches_prepare_bulk_load();

Then the next query to insert into heritage_asset.heritage_asset_names times out. Although once it worked, but it required 1.5 hours.

@robgaston
Copy link
Member

I'm starting to suspect that this is OS related...

@bferguso what operating system are you using? I have seen this work on MacOS, Windows, and with postgresql running in AWS RDS, but I have seen performance issues as @ekansa mentions above in Alpine and Debian (running in docker containers), so any additional information from users about their operating system and the success/failure of such SQL scripts would likely be useful. Thanks!

@ekansa
Copy link
Author

ekansa commented Feb 17, 2023

As @robgaston correctly noted, all my attempts to run these SQL scripts were ultimately on Ubuntu and Debian (either in Docker or "bare metal" on a cloud computing instance). It seems that Alpine is also a slow environment for this.

@bferguso, we will all be grateful for any new piece of the puzzle that you can bring (!). I suspect lots of Arches deployments will use databases running in these common environments (esp. with Docker).

@robgaston
Copy link
Member

robgaston commented Feb 17, 2023

As a data point, the insert trigger for names took about 50x longer on Docker/Alpine than it did on my laptop with MacOS, here's an analysis on a query to insert 1 name on my laptop...

QUERY PLAN
Insert on heritage_asset_names  (cost=0.00..0.22 rows=0 width=0) (actual time=35.655..35.656 rows=0 loops=1)
  ->  Subquery Scan on ""*SELECT*""  (cost=0.00..0.22 rows=1 width=192) (actual time=0.034..0.037 rows=1 loops=1)
        ->  Limit  (cost=0.00..0.21 rows=1 width=288) (actual time=0.032..0.033 rows=1 loops=1)
              ->  Seq Scan on sites  (cost=0.00..8959.30 rows=42474 width=288) (actual time=0.031..0.031 rows=1 loops=1)
Planning Time: 0.363 ms
Trigger heritage_asset_names_insert: time=33.799 calls=1
Execution Time: 35.708 ms

...and here's the analysis of the same query on Docker/Alpine:

QUERY PLAN
Insert on heritage_asset_names  (cost=0.00..0.22 rows=0 width=0) (actual time=1694.121..1694.123 rows=0 loops=1)
  ->  Subquery Scan on ""*SELECT*""  (cost=0.00..0.22 rows=1 width=192) (actual time=2.208..2.360 rows=1 loops=1)
        ->  Limit  (cost=0.00..0.21 rows=1 width=288) (actual time=2.135..2.285 rows=1 loops=1)
              ->  Seq Scan on sites  (cost=0.00..8812.30 rows=42474 width=288) (actual time=2.008..2.008 rows=1 loops=1)
Planning Time: 11.129 ms
Trigger heritage_asset_names_insert: time=1674.805 calls=1
Execution Time: 1696.505 ms

I think that auto_explain needs to be used to see an analysis of the trigger itself, so that should be a next step...

@ekansa
Copy link
Author

ekansa commented Feb 17, 2023

@robgaston Wow! What a difference an OS makes!

@bferguso
Copy link
Contributor

bferguso commented Jun 21, 2023

@ekansa, @robgaston - I'm seeing the same behviour in my docker install of our apps.
OK - I've tracked it down to this line in the trigger:
https://github.com/archesproject/arches/blob/e888e44ccce9edcaaf926b95be91ca70e3f47a8d/arches/app/models/migrations/8403_relational_data_model.py#LL100C1-L100C1
If that line is commented out the insert performs much better, more in line with a non-docker install. I'll have a look at the refresh_tile_geojson_geometries function to see if there is something we can do about that, but if not, perhaps we should look at disabling that call for each row insert in the call __arches_prepare_bulk_load(); and then running the refresh_geojson_geometries() as part of the __arches_complete_bulk_load() call.

@robgaston - Please weigh in on the above strategy - if you're OK with it I'll see about putting a PR together.
I'm heading for the airport and may be out of communication for a couple of weeks, but I may get it done in my off-time.

@robgaston
Copy link
Member

robgaston commented Jun 21, 2023

@bferguso sorry I think we should have linked here to the discourse thread where @ekansa and I arrived at this same workaround in this ticket

this is also documented in the warning under "Example Usage" here in the docs

we, of course, want to rework these functions in a future version so that this workaround is not necessary (thus this ticket remaining open at this time). A PR on this would be welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants