How to import production (larger than toy size) data #4120

sdgathman · 2023-07-29T14:48:36Z

sdgathman
Jul 29, 2023

The import process use an in memory file uploaded by the client. The largest such csv file in the case I am working on is 100M. That is easily handled in the JVM after increasing max memory option. However, the web client is unable to do such large uploads through http.

My immediate workaround is to patch the import process to use a stream from a fixed local (to Adempiere server) file. I can rsync the data being worked on.

As a simple more general solution, it should work like the EDI import, where you can upload an edi document, it fetches from an ftp url (and eventually http), or you enter the full path to a copy of the document on the Adempiere server.

Answered by yamelsenih

Jul 31, 2023

Hello @sdgathman the main problem is that the file should be split and process each file separately.

This is a nice change for import loader.

Other problem is that if you have many records to import then the import process is very very slowly because this use the Persistence Object and all business logic of ADempiere, I personally think that a way for resolve it is process by batch instead all but is a big work change all import process.

Best regards

View full answer

e-Evolution · 2023-07-29T18:08:28Z

e-Evolution
Jul 29, 2023
Maintainer

@sdgathman needs to compromise ZK configuration https://forum.zkoss.org/question/39230/how-to-set-the-max-upload-size-for-fileupload/

Or use the Swing interface

kind regards,
Victor

0 replies

yamelsenih · 2023-07-31T13:35:33Z

yamelsenih
Jul 31, 2023
Maintainer

Hello @sdgathman the main problem is that the file should be split and process each file separately.

This is a nice change for import loader.

Other problem is that if you have many records to import then the import process is very very slowly because this use the Persistence Object and all business logic of ADempiere, I personally think that a way for resolve it is process by batch instead all but is a big work change all import process.

Best regards

2 replies

sdgathman Jul 31, 2023
Author

So there are 2.4 million records to import - 100Mbyte. Do you think 10 batches of 240K records / 10M is reasonable?
100 batches of 24K records / 1M ?

yamelsenih Jul 31, 2023
Maintainer

Maybe you can test it. Other way is improve the process to process each batch but is a hard work also

e-Evolution · 2023-07-31T20:16:58Z

e-Evolution
Jul 31, 2023
Maintainer

@sdgathman Hi , if you want to improve the performance and you can use multiple cpus, you can do a refactory in the import of products to work in parallel as a result of the order.

https://github.com/adempiere/adempiere/blob/develop/base/src/org/compiere/process/ImportOrder.java#L228

kind regards
Victor

1 reply

sdgathman Jul 31, 2023
Author

This is the initial import of the product list - not a huge number of items on a sales/purchase order.

sdgathman · 2023-07-31T23:12:16Z

sdgathman
Jul 31, 2023
Author

Can I just use a postgresql queries to populate the I_Product table? Isn't that a staging table that is then transferred to M_Product and others? I guess there is some verification logic for I_Product as well.

7 replies

sdgathman Aug 2, 2023
Author

Just for fun, I kicked off import of all 2 million products. It looks like the entire thing is loaded into a single ginormous transaction. :-(

Definitely need an easy way to split it up. I'm thinking a 3rd state for the isimported flag on i_product.

piracio Aug 2, 2023
Collaborator

Hello, if you like to speed the process let's schedule a meeting when you have time, I can take a look at database level what is going on with the import, usually commits are done/should be done each X records, stream load methods should be used to reduce the memory usage, GC/cpu usage.

At transport level optimize the http layer to utilize chunk sizes of 8K as peer standard and perhaps we should change the approach split the load into X numbers of records, heavy loads or large databases are my specialty. If you have time, we can use the adempiere demo server for that.

sdgathman Aug 3, 2023
Author

For now, I am modifying my sql script to copy from t_parts to i_product in small batches. Then, I can run the import, and records finished are flagged with isimported. Lather, rinse, repeat.

I will ask the client if the parts list can be shared for development on demo server. Otherwise, we can use my demo server.

sdgathman Aug 8, 2023
Author

@piracio I set ad_client_id to 0 for all but 46 records in i_product (so that only 46 products are imported instead of 2 million).
It has been 45 minutes. I can see which step it is on with "select pid,query from pg_stat_activity;"

I intended to work up to larger batches and time the imports. But this is ridiculous for just 46 records. I suspect part of the problem is that i_product needs some indexes. I think an index on ad_client_id may be important when using ad_client_id to select records to be imported. Or else, delete records from i_product and add them back in batches.

Update: it is done - took about a minute per product. The webui hasn't noticed yet.

sdgathman Aug 25, 2023
Author

@piracio First PR for the low handing fruit: #4139

piracio · 2023-08-08T02:45:19Z

piracio
Aug 8, 2023
Collaborator

I did not play with the imports of the Adempiere, I did some migration indexes improvements. I am not functional if someone is able to replicate the issue that you are having to take a look will be great to fix this missing indexes and/or the logic of the Imports.

…

On 7/08/2023, at 9:52 PM, Stuart D. Gathman ***@***.***> wrote: @piracio <https://github.com/piracio> I set ad_client_id to 0 for all but 46 records in i_product (so that only 46 products are imported instead of 2 million). It has been 45 minutes. I can see which step it is on with "select pid,query from pg_stat_activity;" I intended to work up to larger batches and time the imports. But this is ridiculous for just 46 records. I suspect part of the problem is that i_product needs some indexes. I think an index on ad_client_id may be important when using ad_client_id to select records to be imported. Or else, delete records from i_product and add them back in batches. — Reply to this email directly, view it on GitHub <#4120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQHBRNMNDCR7QTINS3DDQTXUGLVJANCNFSM6AAAAAA24RQBEY>. You are receiving this because you were mentioned.

2 replies

sdgathman Aug 8, 2023
Author

Doing the import of 50 products with only 107 records in i_product. Still taking over a minute per product. Is there a tool to log the queries being done, with the time? I vaguely recall something. That would indicate which queries to optimize.

sdgathman Aug 8, 2023
Author

This query (error check) is taking quite some time:
UPDATE I_Product SET I_IsImported='E', I_ErrorMsg=I_ErrorMsg||'ERR=UPC not unique,' WHERE I_IsImported<>'Y' AND UPC IN (SELECT UPC FROM I_Product ii WH ERE I_Product.AD_Client_ID=ii.AD_Client_ID GROUP BY UPC HAVING COUNT(*) > 1) AND AD_Client_ID=1000001

yamelsenih · 2023-08-08T17:53:15Z

yamelsenih
Aug 8, 2023
Maintainer

Very bad query, I think that is better if you change the IN for a EXISTS clause

5 replies

sdgathman Aug 8, 2023
Author

So now I have to setup a debug cycle where I can compile the java class (where the query is hardwired). The batched import is important, but only with a reasonable time per record.

sdgathman Aug 21, 2023
Author

So this is a test query to just list the parts that would get an error (should be 0). There are 107 records to check. For debugging purposes, it checks all records, even those already imported.

SELECT value,I_IsImported from I_Product
WHERE /* I_IsImported<>'Y' AND  */
EXISTS(SELECT UPC from I_Product ii WHERE I_Product.AD_Client_ID=ii.AD_Client_ID AND ii.UPC = I_Product.UPC AND ii.I_Product_id != I_Product.I_Product_id)
/* UPC IN (SELECT UPC FROM I_Product ii WHERE I_Product.AD_Client_ID=ii.AD_Client_ID GROUP BY UPC HAVING COUNT(*) > 1) */
AND AD_Client_ID=1000001;

My first change to EXISTS did not improve anything.

sdgathman Aug 21, 2023
Author

On the postscript side, it is all CPU (unsurprisingly with only 107 records). I interrupted after 6.5 mins. This should be instant with that number of records.

sdgathman Aug 26, 2023
Author

@yamelsenih see PR above. IN/EXISTS by itself do not change query plan: https://www.percona.com/blog/sql-optimizations-in-postgresql-in-vs-exists-vs-any-all-vs-join/
The issue with that (and similar) queries was self reference for AD_Client_ID making time proportional to N². I changed to a constant (which is already known). There are more such problems I am sure.

yamelsenih Aug 26, 2023
Maintainer

Hello @sdgathman si think that is better change a IN with a query for a EXISTS, the IN clause should be used only for static values.

IN('CO', 'CL')

But a query inside a IN is very lazy for query

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to import production (larger than toy size) data #4120

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 17 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to import production (larger than toy size) data #4120

sdgathman Jul 29, 2023

Replies: 6 comments · 17 replies

e-Evolution Jul 29, 2023 Maintainer

yamelsenih Jul 31, 2023 Maintainer

sdgathman Jul 31, 2023 Author

yamelsenih Jul 31, 2023 Maintainer

e-Evolution Jul 31, 2023 Maintainer

sdgathman Jul 31, 2023 Author

sdgathman Jul 31, 2023 Author

sdgathman Aug 2, 2023 Author

piracio Aug 2, 2023 Collaborator

sdgathman Aug 3, 2023 Author

sdgathman Aug 8, 2023 Author

sdgathman Aug 25, 2023 Author

piracio Aug 8, 2023 Collaborator

sdgathman Aug 8, 2023 Author

sdgathman Aug 8, 2023 Author

yamelsenih Aug 8, 2023 Maintainer

sdgathman Aug 8, 2023 Author

sdgathman Aug 21, 2023 Author

sdgathman Aug 21, 2023 Author

sdgathman Aug 26, 2023 Author

yamelsenih Aug 26, 2023 Maintainer

sdgathman
Jul 29, 2023

Replies: 6 comments 17 replies

e-Evolution
Jul 29, 2023
Maintainer

yamelsenih
Jul 31, 2023
Maintainer

sdgathman Jul 31, 2023
Author

yamelsenih Jul 31, 2023
Maintainer

e-Evolution
Jul 31, 2023
Maintainer

sdgathman Jul 31, 2023
Author

sdgathman
Jul 31, 2023
Author

sdgathman Aug 2, 2023
Author

piracio Aug 2, 2023
Collaborator

sdgathman Aug 3, 2023
Author

sdgathman Aug 8, 2023
Author

sdgathman Aug 25, 2023
Author

piracio
Aug 8, 2023
Collaborator

sdgathman Aug 8, 2023
Author

sdgathman Aug 8, 2023
Author

yamelsenih
Aug 8, 2023
Maintainer

sdgathman Aug 8, 2023
Author

sdgathman Aug 21, 2023
Author

sdgathman Aug 21, 2023
Author

sdgathman Aug 26, 2023
Author

yamelsenih Aug 26, 2023
Maintainer