Parser Update #55

mpevner · 2013-02-19T21:21:10Z

This is a nigh complete rewrite of the parser to do proper multi-element support while also being substantially easier to work with as well as being more pythonic.
It does have run-time overhead in that the parse(catalog) consumes ~1gb ram on a 250mb RDF file, but a reasonable computer should process this quickly and release quickly.
This should hopefully resolve Issue #20 while it's at it.

adding issues link

sethwoodworth · 2013-02-19T21:25:45Z

gitenberg/rdfparse2.py

+        # get the function out of the lookup_table that matches 'tag'
+        #func = new_book.lookup_table[tag]
+        # call the function on the child element
+        #func(child)


delete my example code that you commented out

statics/fixes is_bag strips comments, etc

forgot to set pgcat data (type elements)

This fixes the issue of bags in two ways: one, I was passing the 'root' element and not the iterated item two, I generated a static function leaf_element -- this returns the outermost 'first' element, so be careful if your branch splits on the way and you care

mpevner · 2013-02-20T17:43:12Z

oops, wrong button. traversal issue also fixed.

This implements __setitem__ and __getitem__ for an Ebook item, while also dumping all the set_ methods. It looks a little kludgier, but it makes it much more flexible

mpevner · 2013-02-21T00:19:45Z

Ok, so this parser is
a) not yet a drop in replacement
b) has some minor(?) differences
concerning A, to make it a drop-in would require replicating the Gutenberg class inasmuch as GITenberg.py cares, this is not an issue
concerning B, it ignores file info right now, which is potentially important, given that the original culls books that have no associated filenames. This can be replicated, and even extended on, as a proper File class can be created, with links to/from files and ebooks, at which point you'd cull any book with no linked files.

headed to different workstation

this Should be able to drop-in replace RDFparse.py now. It generates the pickle differently, so here be dragons.

mpevner · 2013-02-21T16:52:46Z

Updated it to make it fit into GITenberg.py so it can just drop right in. In theory.

getitem now returns NoneType for items the object doesn't have

This does not yet set mdate/filename for a book though, but it comes close.

Still doesnt set ebook mdate/filename data

Added in book culling, so this should now operate as a true drop-in replacement of rdfparse.py

unsure how necessary this is now, but its in the original, and is EASILY removed.

Add navigational support to help orient newcomers

Fix link to web site in contributing template

Add more links

mpevner · 2014-08-20T19:20:42Z

Closing due to obsolescence.

sethwoodworth added 30 commits July 11, 2012 12:37

adding GPL'd gutenberg catalog

2905c00

Merge branch 'master' of github.com:sethwoodworth/GITenberg

71d5862

generate a catalog from PG's xml index

0adbbac

adding rdfparse external library from gutenpy project

6a90e29

checking in a pickle, bc PG wont let me fetch the xml file

bd94b1a

unique list of file endings present in the archive

81c986e

adding requirements.txt with GitPython

e6b324a

adding file endings frequency

9c6d15b

add freq howto to readme

e3eb064

update docs

c20b30a

adding TODO file

d13bd9a

drying a dry run of metadata.yaml creation

2593678

fix unicode decode errors and switch to json

994a0cd

Update README.rst

a6208ab

adding issues link

Update README.rst

ad584dc

adding issues link

Update README.rst

3c721d4

Update README.rst

11f62a1

Update README.rst

afb7ddb

Update README.rst

9da66b2

making git subprocesses

9fd0c36

Merge branch 'master' of github.com:sethwoodworth/GITenberg

761f06b

Merge branch 'master' of github.com:sethwoodworth/GITenberg

2042d92

ill advised attempts to do github3 api by hand

709b724

ill advised attempts to do github3 api by hand

9f9a982

now pushing repos to github

acc0acc

now pushing repos to github

a80b21a

update commit string and repo desc\/homepage, rename yaml > json

3ca3871

update commit string and repo desc\/homepage, rename yaml > json

06b731b

spelling *tenberg correctly in the script

276b457

spelling *tenberg correctly in the script

889276f

sethwoodworth reviewed Feb 19, 2013
View reviewed changes

mpevner added 3 commits February 19, 2013 17:10

assorted updates

d142ec1

statics/fixes is_bag strips comments, etc

added gutenberg category

084658c

forgot to set pgcat data (type elements)

multi-item fix

831a506

This fixes the issue of bags in two ways: one, I was passing the 'root' element and not the iterated item two, I generated a static function leaf_element -- this returns the outermost 'first' element, so be careful if your branch splits on the way and you care

mpevner closed this Feb 20, 2013

mpevner reopened this Feb 20, 2013

set/get modify

8d8a0bb

This implements __setitem__ and __getitem__ for an Ebook item, while also dumping all the set_ methods. It looks a little kludgier, but it makes it much more flexible

mpevner added 2 commits February 20, 2013 19:57

leaving work

b3115a3

headed to different workstation

implement Gitenberg class

3352023

this Should be able to drop-in replace RDFparse.py now. It generates the pickle differently, so here be dragons.

mpevner and others added 14 commits February 21, 2013 12:53

safety on getitem

93e363d

getitem now returns NoneType for items the object doesn't have

added file processing

3e534a9

This does not yet set mdate/filename for a book though, but it comes close.

cleanup

d9aab37

Still doesnt set ebook mdate/filename data

drop-in replacement

5ace365

Added in book culling, so this should now operate as a true drop-in replacement of rdfparse.py

unicode safety

1ef2b79

unsure how necessary this is now, but its in the original, and is EASILY removed.

Fix link to web site

46a6695

Add more links

743a730

Add navigational support to help orient newcomers

Merge pull request gitenberg-dev#56 from tfmorris/master

3d44707

Fix link to web site in contributing template

Merge pull request gitenberg-dev#59 from tfmorris/patch-1

89b688e

Add more links

s/PAge/Page/

26f7d50

Merge branch 'master' of https://github.com/sethwoodworth/GITenberg

3e1191c

syncing to seth/master

c4fe769

blank values return NoneType. Modified cleanup to return '' instead

85c4f39

semi-properly fixed nonetype issue

f8e01e6

sethwoodworth added the metadata label Apr 3, 2014

mpevner closed this Aug 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser Update #55

Parser Update #55

mpevner commented Feb 19, 2013

sethwoodworth Feb 19, 2013

mpevner commented Feb 20, 2013

mpevner commented Feb 21, 2013

mpevner commented Feb 21, 2013

mpevner commented Aug 20, 2014

Parser Update #55

Parser Update #55

Conversation

mpevner commented Feb 19, 2013

sethwoodworth Feb 19, 2013

Choose a reason for hiding this comment

mpevner commented Feb 20, 2013

mpevner commented Feb 21, 2013

mpevner commented Feb 21, 2013

mpevner commented Aug 20, 2014