Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a SQL query or function to change text_lang to 'en' #737

Closed
alawvt opened this issue Nov 9, 2020 · 9 comments
Closed

Create a SQL query or function to change text_lang to 'en' #737

alawvt opened this issue Nov 9, 2020 · 9 comments

Comments

@alawvt
Copy link
Contributor

alawvt commented Nov 9, 2020

Continues #476

In the metadata table, create a SQL query or function to change text_lang from NULL OR '' OR 'en_US' to 'en'.

@alawvt
Copy link
Contributor Author

alawvt commented Nov 9, 2020

Command to add to crontabs:

update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;

Test:

  • install 2020-10-10 db dump on latest commit in vt_6_x_dev branch
  • install Adminer 4.7.7
  • run Adminer, get text_lang value counts before
  • run SQL command
  • run Adminer, get text_lang value counts after

@alawvt
Copy link
Contributor Author

alawvt commented Nov 9, 2020

Before SQL command

Check count of items in metadatavalue table

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id

Result: count: 2,343,963

Check counts for values of text_lang, ex. 'en'

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id
where mv.text_lang='en';

Check for unique values of text_lang

SELECT distinct text_lang
FROM "metadatavalue"

Result: text_lang [count]

NULL    [310,527]
'' (empty string) [421,582]
en    [371927]
es    [56]
*    [9,053]
en_US    [1,229,990]
en US    [812]
de    [1]
fr    [5]

(There were a few other oddball text_lang values that I fixed manually.)

Run the SQL update (took about 10 minutes in LDE)

update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;

UPDATE 3362880

After SQL command

in metadatavalue table:

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id

[2,343,963 rows]
SELECT distinct text_lang
FROM "metadatavalue"

text_lang [count]

en    [2,343,891 rows]

Check for other values of text_lang

NULL    [0 rows]
'' (empty string) [0 rows]
(And the few other oddball text_lang values that I fixed manually.)

@alawvt
Copy link
Contributor Author

alawvt commented Nov 9, 2020

@keithgee, this seems to work ok in the LDE. I'd appreciate it if you and @soumikgh could develop a way to add it to the daily crontab. I've submitted this as Request RITM0051332.

@keithgee
Copy link
Member

keithgee commented Nov 11, 2020

Anne, thanks. I have a few questions.

  • It looks like the command will update the language code to "en" even for metadata that's in other languages if the language code wasn't previously set in the metadata. I don't know if we have any such items, but if we do, is that an acceptable side effect?

  • I think the command will also set language codes for values of metadata fields where the language isn't relevant, for example, fields that contain dates. Is this intended?

  • It looks like when you tested it, it took about ten minutes to run. Are you sure this something that you want to run every day?

@alawvt
Copy link
Contributor Author

alawvt commented Nov 11, 2020

@keithgee, than you very much for reviewing this. Yes, this will change the metadata language code to en for everything not explicitly another language code, even dates and URLs where language is not relevant. We have decided on this to simply our metadata and enable the DSpace REST Query Client.

As we discussed, please time the query for the first run. Then change the metadata for some items and time how long the query takes susequently.

@keithgee
Copy link
Member

keithgee commented Nov 23, 2020

@alawvt, you're correct that it runs much faster on subsequent runs. Here is a timing for the first run from a VM on my new-ish desktop with SSD:

UPDATE 3362880

real 2m8.454s
user 0m0.021s
sys 0m0.005s

Here is the timing on the second run:

UPDATE 2

real 0m1.381s
user 0m0.024s
sys 0m0.001s

I really thought it would still be slow after the first run!

@alawvt
Copy link
Contributor Author

alawvt commented Nov 23, 2020

@keithgee, thank you for testing this and the good news.

@keithgee
Copy link
Member

I've created a merge request in the vtlibans_dspace project with changes to deploy a script that gets the database configuration from DSpace and executes the SQL, and put instructions for testing in the merge details.
An alternative approach is the (pg_cron tool)[https://github.com/citusdata/pg_cron], but I think the script approach I used may work better for our current deployment practices.

@alawvt Can you or @kdweeks test, and can we close this issue and track from vtlibans_dspace from here on?

@alawvt
Copy link
Contributor Author

alawvt commented Nov 24, 2020

Continued in Add metadata language code updater.

@alawvt alawvt closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants