New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arches v7: Relational Views and SQL ETL Methods Bug #9049
Comments
PS. Here's a copy of a non-sensitive dataset that I'm experimenting with. It originates here: Open Context: Virginia Site Files. I used the Python Pandas library to dump the data into a CSV. Pandas handled the export of the Python dictionaries in many columns into their string expressions. Let me know if this works, I can do some pre-processing to make sure these dictionaries are first converted into JSON formatted strings prior to export as a CSV if that helps. https://drive.google.com/file/d/1a_eAOqK9G8NqiJJlZD7FJO6sWDRKGe64/view?usp=sharing |
@ekansa I was able to load a resource instance and a single tile for each row from staged data using the following SQL (where I loaded that CSV to a table called INSERT INTO heritage_asset.instances(
resourceinstanceid,
transactionid
)
select distinct resourceinstanceid,
transactionid
from staging.sites;
INSERT INTO heritage_asset.heritage_asset_names (
asset_name_use_type,
asset_name_currency,
asset_name_type,
asset_name,
asset_name_metatype,
asset_name_use_metatype,
asset_name_currency_metatype,
resourceinstanceid,
nodegroupid,
transactionid
)
SELECT '2df285fa-9cf2-45e7-bc05-a67b7d7ddc2f',
'c2051d53-40e7-4a2d-a4b4-02a31da37fd1',
'8f985e91-6cd2-4d70-a03c-b49a50d09a3b',
REPLACE(col_52_smithsonian_trinomial_identifier, '''', '"')::jsonb,
'a0e096e2-f5ae-4579-950d-3040714713b4',
'04a4c4d5-5a5e-4018-93aa-65abaa53fb53',
'5a88136a-bf3a-4b48-a830-a7f42000dd24',
resourceinstanceid,
'676d47f9-9c1c-11ea-9aa0-f875a44e0e11',
transactionid
from staging.sites; I did this via VSCode SQL Tools; it was a little faster with psql at a command line. This leads me to believe that something else is at play here. I was using Arches 7.2.0 on an Apple M1 Pro with 16gb memory. I wonder how this compares to your system and if something about your local environment may be the issue... |
@ekansa - sorry, jumping in late here so I may be missing something. Just wondering if you've turned off the validation triggers when you're doing the inserts (I'm not sure if this has made it into the latest docs). I've found that with the validation triggers on it can slow things down significantly. If you make this call before running the insert: When you've done your insert(s) and/or updates you can make this call: For context - I have a resource model with ~200k entries and it only takes a couple of minutes to run. |
Thanks! @robgaston @bferguso. This works very fast, as expected.
I have my data loaded appropriately with JSONB columns for the columns with (potentially) multilingual text, although I'm just working with English now. I can do queries against the staged data, and the JSON seems to work fine. Here's an example:
Here are the results (illustrating how the data in the JSONB field
Now when I try this, the database just hangs.
|
@robgaston @bferguso I'm just spit balling here, but I wonder if there's something going on to make the JSONB fields go so slowly. I wish I had some error message or other view as to what is happening internally. Strangely, while my long insert query runs, I can query to check the
And I get an empty table. |
So I'm doing an experiment to try to break this up into smaller chunks.
I'm now getting a more usefully informative error message, but I don't know how to handle it:
|
For context, here is my setup for playing with Arches (I got this working to test internationalization features back in August): (1) I'm running Ubuntu 20.04 in a Windows subsystem for Linux (32 GB of ram) |
I’ve isolated part of the problem. It seems that the functions
|
@ekansa revisiting this for another project I was able to write some SQL scripts (which in turn do use those I wonder if the issue here might somehow lie with those docker containers and how they are running PostgreSQL... what do you think? have you tried your scripts in any environment other than the docker one you mentioned? |
@robgaston Thanks for looking into this! Yes, I made a dump of the database and ran the queries on Postgres running on "bare metal" (not Docker). I coordinated with @ryan86 and sent him a dump of the Postgres database itself (it is here by the way: https://drive.google.com/file/d/1jP1xwdpP4WFZKsIkkKqFxprxn5tmzvS1/view). He also verified unworkably slow performance (but I think his Postgres instance also ran in Docker). I suspect the database itself may be the problem. I'm wondering if the Docker set up that I described (I first used that setup to help test internalization features of V7 way back in September) led to the creation of a somehow broken database in Postgres? Or perhaps there is some subtle problem with the HER package in v7 (I'm trying to load data into the HER resource models/graphs)? I wonder if @aarongundel has run into similar issues? I should note that Arches ran perfectly well on my localhost, the only problem I encountered was with the ETL. |
@ekansa I restored from your db backup and was able to load into the views as per your script above from your staged Virginia sites CSV (the entire file, no limit applied) in about 3 minutes. So it would appear that something environmental (rather than native to the DB itself) is at play here. |
@robgaston Just to clarify, you ran the query to insert into the |
@ekansa yes. I inserted one instance (via |
OK thanks @robgaston! That sounds like a pretty convincing case that I have something very wrong with my environment(s). I just spun up another Ubuntu 20.4 server, with Postgresql 14.6 (and all the Postgis stuff), and am having the same query trouble. I guess it's back to the drawing board for me. |
@ekansa sounds frustrating. I'm happy to schedule a time to show you what I'm working with here if that would be helpful. you know where to find me 😄 |
@robgaston Just spit balling here... Are there external (not Postgres) dependencies to the SQL insert ETL? Does Elastic or RabbitMQ need to be up and connected to Postgres? I'm asking because it may explain the behavior in Docker and on stand-alone bare metal (outside of Docker) installs of Postgres. |
@ekansa no, there are no dependencies for this functionality outside of PostgreSQL, so I don't think that would be a factor here. |
@robgaston Thank you for clarifying that! The only other thing I can think of is that somehow I'm creating a database and restoring a database that is some how subtly wrong, and this has nothing to do with Docker or my environment, since I've tried running this now with Postgres 14, 13, and 12 on different cloud virtual machines and I run into exactly the same problem. The database was originally created with the Arches
Then use pg_restore to load my dump data: Go into the restored database: Happy make the relational view for the resource_model: Quickly make resource instances:
Call the function to prep a bulk load: Then the next query to insert into |
I'm starting to suspect that this is OS related... @bferguso what operating system are you using? I have seen this work on MacOS, Windows, and with postgresql running in AWS RDS, but I have seen performance issues as @ekansa mentions above in Alpine and Debian (running in docker containers), so any additional information from users about their operating system and the success/failure of such SQL scripts would likely be useful. Thanks! |
As @robgaston correctly noted, all my attempts to run these SQL scripts were ultimately on Ubuntu and Debian (either in Docker or "bare metal" on a cloud computing instance). It seems that Alpine is also a slow environment for this. @bferguso, we will all be grateful for any new piece of the puzzle that you can bring (!). I suspect lots of Arches deployments will use databases running in these common environments (esp. with Docker). |
As a data point, the insert trigger for names took about 50x longer on Docker/Alpine than it did on my laptop with MacOS, here's an analysis on a query to insert 1 name on my laptop...
...and here's the analysis of the same query on Docker/Alpine:
I think that auto_explain needs to be used to see an analysis of the trigger itself, so that should be a next step... |
@robgaston Wow! What a difference an OS makes! |
@ekansa, @robgaston - I'm seeing the same behviour in my docker install of our apps. @robgaston - Please weigh in on the above strategy - if you're OK with it I'll see about putting a PR together. |
@bferguso sorry I think we should have linked here to the discourse thread where @ekansa and I arrived at this same workaround in this ticket this is also documented in the warning under "Example Usage" here in the docs we, of course, want to rework these functions in a future version so that this workaround is not necessary (thus this ticket remaining open at this time). A PR on this would be welcome! |
I’m running into trouble with Arches v7 using the SQL import methods described here: Resource Import/Export — 7.1.0
I’ve got a test dataset of 46,000 archaeological sites with some simple flat attributes. I’ve been testing a SQL based import of this table into Historic England’s resource models / graphs / branches as provided here: GitHub - archesproject/arches-her
I’ve made a staging table with my site data in the Arches Postgress database. In this staging table, my ID columns are of type
UUID
and the string literals are in the JSONB datatype with JSON that looks like:And, I seem to have structured the JSON properly, since I can do
SQL INSERT
queries on small numbers of rows that go slowly but work, letting me see my data in the Arches user interface (esp. after I force a reindex).What works:
What doesn’t work:
I’ve tried multiple times, even with variations where I included my own minted UUIDs for the tileid column in the
SQL INSERT
The text was updated successfully, but these errors were encountered: