Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncutil - window size for reading datastream information can be too small #17

Closed
ghukill opened this issue Mar 16, 2016 · 3 comments
Closed
Labels

Comments

@ghukill
Copy link
Contributor

ghukill commented Mar 16, 2016

Creating new issue based on conversations from Issue #15.

The problem arises when the datastream information is particularly long (e.g. labels), causing it be longer than the moving window for reading datastream information.

Bumping the window size on line 206 and lines 252-255 from 200 / 250 to something like 750 worked for a particular set of objects with long datastream labels, but might not be a permanent solution.

@rlskoeser
Copy link
Contributor

@ghukill thanks for opening this; I think we may be adding some notes here soon with some other related issues and/or edge cases we've been running into.

@jayvarner
Copy link
Contributor

I'm just adding some errors I encountered:

Error importing emory:d743q to dev: 400 <?xml version="1.0" encoding="UTF-8"?><management:validation  xmlns:management="http://www.fedora.info/definitions/1/0/management/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.fedora.info/definitions/1/0/management/ http://www.fedora.info/definitions/1/0/validation.xsd" pid="unknown"  valid="true">
  <management:contentModels>
  </management:contentModels>
  <management:problems>
    <management:problem>Schematron validation failed:org.xml.sax.SAXParseException; lineNumber: 921; columnNumber: 2; The value of attribute "REF" associated with an element type "foxml:contentLocation" must not contain the '<' character.</management:problem>
  </management:problems>
  <management:datastreamProblems>
  </management:datastreamProblems>
</management:validation>

ChecksumMismatch even with --archive-xml and --requires-auth eg:

repo-cp --archive-xml --requires-auth prod dev emory:pg3k9

Traceback (most recent call last):
  File "/home/jsvarn/eulf/bin/repo-cp", line 137, in <module>
    repo_copy()
  File "/home/jsvarn/eulf/bin/repo-cp", line 121, in repo_copy
    requires_auth=args.requires_auth)
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 104, in sync_object
    export_data = export.object_data().getvalue()
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 298, in object_data
    dsinfo = self.get_datastream_info(previous_section)
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/syncutil.py", line 258, in get_datastream_info
    infomatch = self.dsinfo_regex.search(force_text(dsinfo))
  File "/home/jsvarn/eulf/lib/python2.7/site-packages/eulfedora/util.py", line 44, in force_text
    s = six.text_type(bytes(s), encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

When I actually catch the error for the one above, i get:

Unexpected error on emory:bcd79: <type 'exceptions.ValueError'> __len__() should return >= 0

@rlskoeser rlskoeser added the bug label Apr 15, 2016
rlskoeser added a commit that referenced this issue Jul 28, 2016
Also fixes omit-checksum filter for case when data is a generator
@rlskoeser
Copy link
Contributor

I think setting a larger size for the chunk used for datastream info should be fine, and it shouldn't cause an issue with the regex since we're splitting on datastream start and end - that chunk shouldn't ever include datastream info for a previous datastream. My testing indicated that it worked fine for objects that can be successfully synced (excepting the problem record mentioned above, which seems to have other issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants