More fields in the catalog.rdf that I realized #20

Closed
sethwoodworth opened this Issue Sep 28, 2012 · 21 comments

Comments

Projects
None yet
3 participants
Contributor

sethwoodworth commented Sep 28, 2012

TODO: integrate the following into readme and metadata.json as applicable

        self.rights = 
        self.toc = 
        self.alttitle = alttitle    
        self.friendlytitle = friendlytitle
        self.contribs = contribs
        self.pgcat = pgcat

According to @mpevner 's research

dc:publisher - generally &pg
dc:title* - warning, can be multiple! (eg: 3439)
dc:tableOfContents
dc:creator* - author
pgterms:friendlytitle - eg 2144
dc:language*
dcterms:LCSH* - subjects
dcterms:LCC* - LOC code
dc:created
dc:rights - IMPORTANT: some works /are/ copyrighted. eg 2144, generally &lic
dc:contributor eg 2077
dc:alternative -- this is an alternative title eg 3439

dc:rights alone is worth the effort to parse them all.

mpevner was assigned Sep 28, 2012

Contributor

mpevner commented Sep 28, 2012

OK, with research, I can work this out
With this as reference:
http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#
dc:* are the 1.1 items
dcterms:* are the /terms/ namespace stuff
and pgterms:* have been reverse engineered to find only five:
etext -- PG's book ID
file -- points to a file; eg: <pgterms:file rdf:about="&f;dirs/1/15-text.zip">
friendlytitle -- as it sounds, probably what their site displays instead of title/alternativetitle when available
downloads -- as it sounds
category -- as it sounds, but weird. This needs more analysis, as some are such things as:
<pgterms:category><rdf:value>Audio Book, human-read</rdf:value></pgterms:category>

Contributor

mpevner commented Sep 28, 2012

code used:

import re
catalog = open("catalog.rdf", "r")
types = []
reg1 = re.compile('pgterms:[A-Za-z]*? |pgterms:[A-Za-z]*?>')
for line in catalog:
    pgterms = reg1.findall(line)
    for word in pgterms:
        types.append(word)
        #print word
types = set(types)
for word in types:
    print word
Contributor

sigmavirus24 commented Sep 29, 2012

Is there a reason lxml cannot be used to parse this? It seems like xml to me.

Contributor

sethwoodworth commented Oct 1, 2012

No reason for it. I had found the rdfparse module in someone else's project. At first it meant I could get the content up faster, even if I didn't understand what all was going on in the parser code. but an lxml parser makes the most sense for maintainability and for updating the catalog with new Gutenberg data (metadata updates and new releases).

I'll open a ticket for a new parser.

Contributor

sigmavirus24 commented Oct 1, 2012

Ah yeah. I just saw him making changes and didn't check the file for copyright. I could be wrong about the lxml parser. Seems like W3C was just using xml as an example(?). I need to look a lot more into RDF files, but I don't exactly have the time now. Since the rdfparse module is GPL, as long as we keep our changes to it open we're fine.

Contributor

mpevner commented Oct 5, 2012

NB: books can/do have multiple LCC codes.

Contributor

mpevner commented Oct 5, 2012

NB: arrays of items can instead be in a 'bag', making seperation harder eg: 25930

Contributor

sethwoodworth commented Oct 5, 2012

I wouldn't mind seeing LCC codes separated by commas. Subjects I think
should be separated by new lines, as they are pretty long.

Anyone hoping to parse the data should be looking at the metadata.json
anyway.

Actually, to that end, should we not include empty fields in the README.rst
file?

On Fri, Oct 5, 2012 at 4:33 PM, Maxwell Pevner notifications@github.comwrote:

NB: books can/do have multiple LCC codes.


Reply to this email directly or view it on GitHubhttps://github.com/sethwoodworth/GITenberg/issues/20#issuecomment-9188867.

Contributor

mpevner commented Oct 5, 2012

regardless of how it's formatted, it means it will have to change from string to array object. As such, I'm not touching it until I deal with other fields. (or: I'll deal with multi LCC when I deal with multititle)

Contributor

sethwoodworth commented Oct 5, 2012

Sounds reasonable. Grab the things that are strings before you deal with
lists.

On Fri, Oct 5, 2012 at 4:55 PM, Maxwell Pevner notifications@github.comwrote:

regardless of how it's formatted, it means it will have to change from
string to array object. As such, I'm not touching it until I deal with
other fields. (or: I'll deal with multi LCC when I deal with multititle)


Reply to this email directly or view it on GitHubhttps://github.com/sethwoodworth/GITenberg/issues/20#issuecomment-9189531.

Contributor

mpevner commented Oct 5, 2012

Problem elements:
alternative
contributor

example:
from ebook 7215

<dc:alternative>
    <rdf:Bag>
      <rdf:li rdf:parseType="Literal">Deng Xi Zi (Thought of Deng Xi Zi)</rdf:li>
      <rdf:li rdf:parseType="Literal">Deng Xizi</rdf:li>
    </rdf:Bag>
  </dc:alternative>

from ebook 14749

<dc:contributor>
    <rdf:Bag>
      <rdf:li rdf:parseType="Literal">Brooke, Stopford Augustus, 1832-1916 [Author of introduction, etc.]</rdf:li>
      <rdf:li rdf:parseType="Literal">Reid, Stephen, 1873-1948 [Illustrator]</rdf:li>
    </rdf:Bag>
  </dc:contributor>

These are problematic as unlike LCSH and LCC where each element is preceded by their unique definer, these are only preceded by the generic rdf:li per element

Contributor

sigmavirus24 commented Oct 6, 2012

I tested the theory that it's all xml, the following workds but took up well over a gig of RAM (closer to 2) on the interpreter, so use with care.

>>> from lxml import etree
>>> tree = etree.parse(open('index/catalog.rdf'))
>>> dir(tree)
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_setroot', 'docinfo', 'find', 'findall', 'findtext', 'getiterator', 'getpath', 'getroot', 'iter', 'iterfind', 'parse', 'parser', 'relaxng', 'write', 'write_c14n', 'xinclude', 'xmlschema', 'xpath', 'xslt']
>>> tree.docinfo.doctype
u'<!DOCTYPE RDF>'
>>> tree.docinfo.root_name
'RDF'
>>> for p in tree.iter():
...    print p  # This will vomit everywhere
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at 0x7a00...>
# etc
Contributor

mpevner commented Oct 6, 2012

So, my XML knowledge is Le Suck. It does seem however, that folks are mentioning graphs can be produced from this, and so I'm wondering if instead of the current crazy parser, we can make something that takes each etext block, makes a chunk of it, and gives us data sensibly?

Contributor

sigmavirus24 commented Oct 6, 2012

You and I are both on the same page @mpevner. But, it shouldn't be too terribly difficult. You can access paths like so: tree.xpath('/top-level-block-indicator/next-level/next-level/next-level/@item') I think. I've used this only a little bit. You can also do for next_block in tree.xpath('/top-level-block'): next_level.get('foo') I've only ever used this with xml in a different setting. I haven't played with the catalog.rdf enough to have a good understanding of how it works or how to traverse it. Sorry.

Contributor

mpevner commented Oct 6, 2012

no worries. As it stands, the current parser is reasonable enough for current work. Long term (ie: we have plenty of time) we can figure out how to parse it better.

Contributor

sethwoodworth commented Oct 19, 2012

Max, what is our status on this? Do we have all of the fields? Or what are we still missing?

Contributor

mpevner commented Oct 19, 2012

well, we have a shiteton of them as it stands, prolly everything Wildly important. Lemme enumerate what we have vs. The Spec.

Contributor

mpevner commented Feb 19, 2013

I began some work on rebuilding the XML parser to properly walk the RDF file. This code here takes ~1gb ram to operate, but it moves rather quickly and gives us much better RDF access. On the upside, it does not have to be continuously active once everything is moved into a pickle/JSON files/db of our choice.
https://gist.github.com/mpevner/4980825

Contributor

mpevner commented Feb 19, 2013

definitive etext tags are:
creator -> author
publisher -> ignored (things like Project Gutenberg)
friendlytitle
created -> ignored
type -> indicates whether audio book or not EDIT: sod me, this is pgcat info
description
subject -> LCC and LCSH which are the LOC and SUBJ respectively
downloads -> ignored
contributor
title
rights
tableOfContents
alternative -> as in alternative titles
language

mpevner referenced this issue Feb 19, 2013

Closed

Parser Update #55

Contributor

mpevner commented Feb 20, 2013

So, problem: while we can grab ALL the elements now, and with surprising ease, I am realizing that many elements we thought, or treated, as singulars, are in fact multiple in cases. Do we want to treat everything like a list of elements?

Contributor

sethwoodworth commented Jul 29, 2014

PG updated their metadata format, much easier to parse now. This is no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment