-
Notifications
You must be signed in to change notification settings - Fork 98
Trouble reading Harvard Open Metadata MARC files (UTF-8 related?) #89
Comments
Can you isolate a single record that's displaying this problem? From the traceback it appears that there's a subfield code that's not ASCII, which is forbidden by the MARC21 spec. (If such records exist in the wild, though, pymarc should probably have a way to deal with them. This is one area where there's currently no workaround as far as I know...) |
Thank you! Working on isolating the record...it's unfortunately a massive binary file. :(. But from the python debugger, it does look like you are correct that there's unicode chars in a subfield, e.g.
|
Here's what the record looks like dumped to a text file by MarcEdit. It does indeed look like the 040 record has unicode in the subfield code. If the MARC21 spec indeed forbids this, then this issue should probably be closed, although more tolerant error handling might be helpful.
|
It looks like the record is coded as containing Unicode (leader position 9). I forget, why are you using reader = MARCReader(fh, utf8_handling='ignore') |
I was using 'ignore' because otherwise the processing would stop when it encountered encoding difficulties. (aside: I wonder if better error handling would be to skip offending records and keep going?). Right now when it throws an exception the script halts. (Caveat: I am a python newbie and likely doing something wrong). |
@viking2917: right sadly the ignore parameter doesn't ignore everywhere in the MARC field, and also doesn't ignore in areas where UTF-8 isn't permitted. ALEPH (the ILS Harvard is on) will let you save a unicode character to a subfield. I have a branch somewhere that does some of this error handling.this for another project In
https://github.com/gugek/pymarc/blob/leader-handling/pymarc/record.py |
@gugek Thank you! Will give that a go. |
That sounds like a reasonable patch to me (of course, it wouldn't have helped here, with utf8_handling='ignore'; I can't think of a good way to use ignore here since it implies ending up with the subfield code completely blank.) Decomposing the "ā" and throwing away diacritics would probably do the right thing in this particular case, since I can't see how that's possibly supposed to be anything but $a... But I don't know that's a good general solution :) |
Yes, 'replace' is probably the right option - I was happy to simply discard records with errors so was using 'ignore', but with 'replace' and this patch, I seem to be able to get pretty much everything. Thanks everyone! |
I am not sure how aggressively the Harvard Open Metadata project is being maintained, but Thanks. Mark On Mon, Apr 25, 2016 at 4:44 PM, Jim Nicholls notifications@github.com
Get great book recommendations at The Hawaii Project mark.watkins1@gmail.com |
I know this is an old issue, but I'm having the same problem than @viking2917 ... I'm trying to parse the Harvard Open Metadata db, and I'm running into exceptions. Is there any patch that I could do? My code breaks and I cannot catch the exception to skip the record. (I do not understand what @gugek suggested)... |
Can you isolate a record that displays the issue and attach it as MARC21? (I'd prefer to avoid trying to manually create a bad record by hand for testing, and I don't think my ILS can create one nor can I do it programmatically in pymarc :) ) |
@josepablog I did finally get around this problem. Here's how I got around it: First I altered record.py (new file attached), to add some error handling. Instead of this driver from the github page:
I changed the driver to this: (basically, importing codecs and sys.). As I recall I had to install
On my mac, I needed to install a few python package: sudo pip install unidecode (I am not 100% sure if I needed to install six or not. Your mileage may vary). I'm a python n00b and not sure this code is really production-ready, so I did not create a pull request. But it's been working for me. I only half-understand what I did, as I really don't know Python. Good luck! |
@Wooble the file is huge! And my understanding of Marc-21 is extremely limited I'll give a try to @viking2917 's solution, and hopefully it would work... Thank you to both! |
(Aside: I traded some emails with the good folks at Harvard and they said something to the effect that there are in fact the occasional invalid record due to the large, distributed nature of their libraries and data. They did correct the issues I brought to their attention but I think it's a good idea to protect against invalid data where possible.) |
@viking2917 I downloaded the new Harvard db, and I don't have those problems if I use the utf8_handling='ignore' flag ... |
@josepablog Interesting. That flag helped me get further but didn't solve all my issues. But glad it's working for you! Perhaps something has changed in the meantime.... |
I think I declared victory too early. PyMarc still breaks for these files: ab.bib.11.20160805.full.mrc Wished I knew how to isolate the record, to get some help from @edsu : ) |
So it sounds like we might need a way to catch all exceptions when reading a record, and keep moving through the records? |
@edsu Yes, I think so This is an example of the exception I get (I'm using utf8_handling='ignore', which I don't know if it makes sense, but reduces the number of errors):
Should I just wrap the whole thing in a catch? Or is there anything smarter to do? Thank you again for your help, Ed! |
It's fairly annoying to have to do it yourself, since it would require calling I don't think changing utf8_handling is likely to help if the problem is in the indicators or the leader. pymarc itself should probably have a way to recover from these better; personally I'm running my fork's "leader_encoding" branch in production because our database itself has broken records. It's probably a good start but not really something I'd want to merge with master at the moment since it's a bit sledgehammery; it just fixes the leaders we have problems with in the obvious way and prints a warning, with no way to select strict behavior. |
Also having this issue with USDA National Agricultural Library's marc21 downloads. Pymarc 2.9.2 handled these files fine on an old system. I can share a ~60mb file from this distribution if that helps test issues. |
@pbnjay Yes, sharing the data that can be used to demonstrate the problem is key. On another note, I myself work with MARC records only rarely now. Does it make sense to move pymarc over to the code4lib organization account here on GitHub so it can be maintained/developed without me being a bottleneck? |
(If you can isolate a single problem record instead of a 60MB file that would probably be better, though) |
In case it's useful, I previously had bodged together a permissive reader version of |
I'm not sure how python does character reading, but for large sets like this, I'd always recommend taking them out of marc and putting them into xml. You can't trust the encoding bits, and most ILS systems will do wonky things when exporting large sets (that you won't see with small sets). Additionally, they violate MARC21 rules (but not the general ISO 2701 rule structure) so you cannot code rules based on expected field values. When I process directly to XML (at least in how I do it in MarcEdit), I ignore character encoding completely, processing via a binary memory stream and sanitize characters for XML processing. This way I avoid these kinds of character issues. The other option (and MarcEdit does this as well) when MARC processing is have your stream do encoding swapping based on the record read -- but that requires having an algorithm that actually determines character encoding of the record so you can conform the reader to the individual record encoding block, and then convert the data into the encoding expected by the writer. |
I would still like to have a test record to play with that demonstrates the particular problem we are dealing with here. If we can't reproduce the problem it's really impossible to get enough traction to fix it. I do like @anarchivist's idea of adding an option to pymarc.Reader. I'm going to open a new issue for that. |
I essentially just commented out all the instances of I uploaded two problem files here: https://www.dropbox.com/sh/f4w7nv6e5ghnpmr/AACXD4L-GGqPhbc1YexBc6iea?dl=0 Since they're from the USDA they should be public domain, but just in case I'll unshare them once you have a copy to debug with. I'm not producing these files, just converting them to xml, so it'll probably be easier for someone else who knows what they're doing to isolate them. |
[I think this is the problem in edsu#89 and not really specific to Harvard Open Metadata]
I am trying to use pymarc to read the Harvard Open Metadata MARC files.
Most of the files process ok but some (for example ab.bib.14.20160401.full.mrc) produce errors when processing. The error I am getting is:
The driver code I am using is:
Other MARC processing tools (e.g. MarcEdit seem to process the file with no issues so I think the file is legitimate).
Am I doing something wrong? Is there an issue with pymarc, possibly UTF-8 processing related?
The text was updated successfully, but these errors were encountered: