Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddrindex not indexing av/visual history correctly #186

Closed
sarabeckman opened this issue Jun 26, 2020 · 9 comments
Closed

ddrindex not indexing av/visual history correctly #186

sarabeckman opened this issue Jun 26, 2020 · 9 comments
Assignees
Labels
bug fixed Issue addressed and awaiting closure important

Comments

@sarabeckman
Copy link
Collaborator

sarabeckman commented Jun 26, 2020

Indexed ddr-chi-1 and ddr-densho-400 to ddrstage. Neither are displaying correctly.

For ddr-chi-1 - the video download links are not working, the player is the old player.

For ddr-densho-400 the type is an AV object that is audio only. It should appear as this CSUJAD interview on production http://ddr.densho.org/ddr-csujad-9-1/. I indexed ddr-csujad-9 to ddrstage to compare and it is not displaying.

For comparison, here is a good doc from the production ES cluster (indexed at 2019-03-11T11:55:53):

GOODddr-csujad-9-1-ESdoc.json.txt

And here is a bad doc from the stage ES cluster that was indexed with ddr-cmdln v5.0.4 on master:

BADddr-csujad-9-1-ESdoc.json.txt

This behavior is also causing ddr-public to use the incorrect version of the av templates (i.e., the old segment template that uses the deprecated embedded IA player). Here is a bad doc from the production ES index for ddr-chi-1-1:

BADddr-chi-1-1-1-ESdoc.json.txt

@GeoffFroh
Copy link
Member

GeoffFroh commented Jun 26, 2020

This is the same behavior reported in #155

Note that we confirmed that the problem entities are present at IA and have the proper metadata (see: #123)

@GeoffFroh
Copy link
Member

GeoffFroh commented Jun 27, 2020

Confirmed that archivedotorg.py is apparently functioning:

>>> from DDR import archivedotorg, models, identifier
>>> e = models.Entity.from_identifier(Identifier("ddr-chi-1-1-1","/media/qnfs/kinkura/gold/"))
>>> iameta = archivedotorg.get_ia_meta(e)
>>> iameta
{'id': 'ddr-chi-1-1-1', 'xml_url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1_files.xml', 
'http_status': 200, 'original': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'mimetype': 
'video/mpeg', 'files': {'mp3': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp3', 'format': 'mp3', 
'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-87bc67df89.mp3', 
'mimetype': 'audio/mpeg', 'encoding': None, 'sha1': 
'd91cb8611509815d52db60381b7375db2260620d', 'size': '1744963', 'length': '145.15', 'height': 
'0', 'width': '0', 'title': ''}, 'mp4': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mp4', 'format': 
'mp4', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mp4', 'mimetype': 'video/mp4', 'encoding': None, 'sha1': 
'21a31c0e996d9da39c7c275504262ab433533a9d', 'size': '15178544', 'length': '145.15', 'height': 
'480', 'width': '853', 'title': ''}, 'mpg': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.mpg', 'format': 
'mpg', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.mpg', 'mimetype': 'video/mpeg', 'encoding': None, 'sha1': 
'87bc67df89c2f45a0e1555c52afab8bf1fa433f8', 'size': '653512708', 'length': '145.13', 'height': 
'1080', 'width': '1920', 'title': ''}, 'png': {'name': 'ddr-chi-1-1-1-mezzanine-87bc67df89.png', 
'format': 'png', 'url': 'https://archive.org/download/ddr-chi-1-1-1/ddr-chi-1-1-1-mezzanine-
87bc67df89.png', 'mimetype': 'image/png', 'encoding': None, 'sha1': 
'58b81cc20f4414c2336235cb0e64aed2e0703cf2', 'size': '29827', 'length': '', 'height': '', 'width': '', 
'title': ''}}}   

@GeoffFroh
Copy link
Member

GeoffFroh commented Jun 28, 2020

This issue is the result of changes to Python 3's configparser class:

"Config parsers do not guess datatypes of values in configuration files, always storing them internally as strings." (https://docs.python.org/3/library/configparser.html#supported-datatypes)

The problem code is in the config module:

OFFLINE = CONFIG.get('debug', 'offline')

(https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/config.py#L44)

OFFLINE = CONFIG.getboolean('debug', 'offline')

The template attribute is generated by processing data from IA. This data is retrieved by the archivedotorg.get_ia_meta() function which is invoked by DDRObject.to_esobject() at:

https://github.com/denshoproject/ddr-cmdln/blob/master/ddr/DDR/models/common.py#L414

The logic checks that the value of the config var OFFLINE (set in the [debug] section of the app configs). The default value in ddrlocal.cfg is offline=False, and b/c the configparser.get() is used instead of .getboolean(), the resulting value of config.OFFLINE is the string, 'False'. Therefore, the expression:

if not config.OFFLINE:

always evaluates to boolean False, archivedotorg.get_ia_meta() is never invoked, the template attribute is not set, and the resulting ES doc is invalid.

Note that this configparser behavior in Python 3 may affect other Django projects in our portfolio

@gjost
Copy link
Member

gjost commented Jun 29, 2020

configparser boolean behavior updated in ddr-local commit f8d9c1c and ddr-cmdln commit 8e0bdeb for package ddrlocal-master_5.0.5~deb10.

@GeoffFroh
Copy link
Member

configparser boolean behavior updated in ddr-local commit f8d9c1c and ddr-cmdln commit 8e0bdeb for package ddrlocal-master_5.0.5~deb10.

This fixes the issue with the archivedotorg.get_ia_meta() code being skipped (see: #186 (comment)), but the underlying issue still exists with some entities.

Indexing ddr-csujad-9-1 with the patched code did work (see: ddrstage.densho.org/ddr-csujad-9-1)

Indexing ddr-densho-400-20 with the patched code now hits the get_ia_meta() function, but throws this error:

(cmdln) ddr@kinkura:/home/densho$ ddrindex publish --hosts 192.168.0.20:9200 -r /media/qnfs/kinkura/gold/ddr-densho-400/files/ddr-densho-400-1
2020-06-29 12:08:14.549164-07:00 | 1/4 POST ddr-densho-400-1-transcript-bb74aa023d 
2020-06-29 12:08:14.602200-07:00 | 2/4 SKIP ddr-densho-400-1-master-70dda47d00 unpublishable
2020-06-29 12:08:14.611803-07:00 | 3/4 POST ddr-densho-400-1-mezzanine-70dda47d00 
2020-06-29 12:08:14.643226-07:00 | 4/4 POST ddr-densho-400-1 
Traceback (most recent call last):
  File "/opt/ddr-cmdln/venv/cmdln/bin/ddrindex", line 33, in <module>
    sys.exit(load_entry_point('ddr-cmdln==3.0.0.post1', 'console_scripts', 'ddrindex')())
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/cli/ddrindex.py", line 311, in publish
    path, recursive=recurse, force=force
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 649, in post_multi
    created = self.post(document, parents=parents, force=True)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/docstore.py", line 565, in post
    d = document.to_esobject(public_fields=public_fields, public=public)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/models/common.py", line 415, in to_esobject
    d.ia_meta = archivedotorg.get_ia_meta(self)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 40, in get_ia_meta
    iaobject = IAObject(o.identifier.id)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 99, in __init__
    self._gather_files_meta()
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 134, in _gather_files_meta
    self.files[format_] = IAFile(self.id, format_, tag)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.7/site-packages/ddr_cmdln-3.0.0.post1-py3.7.egg/DDR/archivedotorg.py", line 177, in __init__
    setattr(self, field, tag.find(field).contents[0])
IndexError: list index out of range

Here's the IA meta for ddr-csujad-9-1 (which works):

https://ia803004.us.archive.org/23/items/ddr-csujad-9-1/ddr-csujad-9-1_files.xml

And for ddr-densho-400-20 (which does not):
https://ia802806.us.archive.org/6/items/ddr-densho-400-20/ddr-densho-400-20_files.xml

The only difference between the two sets of files appears to be the presence of an ogg file in the working entity (ddr-csujad-9-1).

Both of the underlying Entity files (i.e., entity.json) have genre set to interview and format set to av as per spec, and both have an mp3 file as the mezzanine and master file in the file_groups attribute.

@gjost
Copy link
Member

gjost commented Jun 29, 2020

I dropped and recreated my local Elasticsearch index and now I'm seeing the IndexError.

@gjost
Copy link
Member

gjost commented Jun 29, 2020

In this particular case the error is because the title field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?

Update: Looks like the original MP3 has empty title, album, and creator tags.

@GeoffFroh
Copy link
Member

In this particular case the error is because the title field for the mp3 item in https://ia802804.us.archive.org/9/items/ddr-densho-400-4/ddr-densho-400-4_files.xml is blank. Is this something we care about?

Update: Looks like the original MP3 has empty title, album, and creator tags.

Looks like those are just the embedded ID3 tags which we don't use in the interface at all, so not important. The function should ignore if they're not present.

@gjost
Copy link
Member

gjost commented Jun 30, 2020

Empty tags coming from IA are now ignored.

Fixed in ddr-cmdln commit 711e303 for package ddrcmdln-master_5.0.5~deb10 / ddrlocal-master_5.0.5~deb10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fixed Issue addressed and awaiting closure important
Projects
None yet
Development

No branches or pull requests

3 participants