How do I get the id of each book? #105

iamyihwa · 2018-06-20T11:47:37Z

Hello
I have been looking for ways to get ids of each book in an intuitive way.

Getting the id from the webpage of each book doesn't seem to work.
When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.

One way would be to look at catalogs.
http://www.gutenberg.org/dirs/GUTINDEX.1996
However these indices are not complete, and there are too many files.

I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.

c-w · 2018-06-20T15:42:21Z

Hi @iamyihwa and thanks for reaching out. Did you take a look at the get_etexts method to search for the IDs of texts by criteria such as author, title, etc.? That looks like it might fit your use-case. There's more information on the feature in the README: https://github.com/c-w/gutenberg#looking-up-meta-data

iamyihwa · 2018-06-21T06:44:09Z

Hi @c-w thanks for your reply. I have just tried using the functions that were in the link you sent.
However I receive invalid cache error.
I have attached the details below.

from gutenberg.query import get_metadata
print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])

AttributeError Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self)
63 self.graph.open(self.cache_uri, create=False)
---> 64 self._add_namespaces(self.graph)
65 self.is_open = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in _add_namespaces(graph)
131 """
--> 132 graph.bind('pgterms', PGTERMS)
133 graph.bind('dcterms', DCTERMS)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in bind(self, prefix, namespace, override)
917 """
--> 918 return self.namespace_manager.bind(
919 prefix, namespace, override=override)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in _get_namespace_manager(self)
330 if self.__namespace_manager is None:
--> 331 self.__namespace_manager = NamespaceManager(self)
332 return self.__namespace_manager

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in init(self, graph)
280 self.__log = None
--> 281 self.bind("xml", "http://www.w3.org/XML/1998/namespace")
282 self.bind("rdf", RDF)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in bind(self, prefix, namespace, override, replace)
361 prefix = ''
--> 362 bound_namespace = self.store.namespace(prefix)
363 # Check if the bound_namespace contains a URI

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/plugins/sleepycat.py in namespace(self, prefix)
443 prefix = prefix.encode("utf-8")
--> 444 ns = self.__namespace.get(prefix, None)
445 if ns is not None:

AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace'

During handling of the above exception, another exception occurred:

InvalidCacheException Traceback (most recent call last)
in ()
1 from gutenberg.query import get_metadata
----> 2 print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in get_etexts(feature_name, value)
55
56 """
---> 57 matching_etexts = MetadataExtractor.get(feature_name).get_etexts(value)
58 return frozenset(matching_etexts)
59

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/extractors.py in get_etexts(cls, requested_value)
40 @classmethod
41 def get_etexts(cls, requested_value):
---> 42 query = cls._metadata()[:cls.predicate():cls.contains(requested_value)]
43 results = (cls._uri_to_etext(result) for result in query)
44 return frozenset(result for result in results if result is not None)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in _metadata(cls)
113
114 """
--> 115 return load_metadata()
116
117 @classmethod

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in load_metadata(refresh_cache)
295
296 if not cache.is_open:
--> 297 cache.open()
298
299 return cache.graph

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self)
65 self.is_open = True
66 except Exception:
---> 67 raise InvalidCacheException('The cache is invalid or not created')
68
69 def close(self):

InvalidCacheException: The cache is invalid or not created

c-w · 2018-06-21T21:08:24Z

Did you ensure to create the metadata cache before running the query?

from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate()

This should only need to be done once since the results are cached on disk. If this doesn't work for you (due to the BerkelyDB setup on your machine), you can also try using the SQLite cache which works everywhere but is somewhat slower:

from gutenberg.acquire import set_metadata_cache
from gutenberg.acquire.metadata import SqliteMetadataCache

cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')
cache.populate()
set_metadata_cache(cache)

There's more documentation on this here: https://github.com/c-w/gutenberg#looking-up-meta-data

iamyihwa · 2018-06-25T07:42:18Z

Hi @c-w Thanks it worked with cache trick!
However it doesn't seem to work for the moby dick example but not for others.

I get something that says it is 'frozenset' .. any clues?

c-w · 2018-06-25T15:22:09Z

Hi @iamyihwa. The get_etexts function returns an immutable set which is why you're seeing frozenset() being returned for your query: there were no results. There were no results for the query since currently the get_etexts methods assume that you're querying for an exact match, e.g. you know the author's name and want to find all the books they wrote, or you know the name of a book and want to find all the copies in the corpus.

In order to do a fuzzy search on the titles like find all the books where the title contains "math", you might be able to use or adapt this snippet:

from gutenberg.acquire import get_metadata_cache
from gutenberg.query.api import MetadataExtractor

# define search parameters
search_term = 'math'
search_field = 'title'

# get a reference to the metadata graph
cache = get_metadata_cache()
cache.open()
graph = cache.graph

# execute the search
extractor = MetadataExtractor.get(search_field)
results = ((extractor._uri_to_etext(etext), value.toPython())
           for (etext, value) in graph[:extractor.predicate():]
           if search_term.lower() in value.toPython().lower())

# print the first result of the search: (25387, 'Mathematical Essays and Recreations')
result = next(results)
print(result)

iamyihwa · 2018-06-26T06:30:24Z

Thanks @c-w !!
I do get results! :-)
What I notice is that is there any way to sort the result like when i do search on the gutenberg website, (here the results are sorted according to the popularity).

What I would like to do eventually is to get some domain specific texts and do some training on it and use that classifier to later determine the domain of unseen text.
My domains of interest are like math, history, etc.
I would like to get for example math text books, rather than 'aftermath ... '.

I see to do this sorting by popularity could be one option, if you know of any other way it would be nice!

iamyihwa · 2018-06-26T08:41:46Z

Hi @c-w I have just tried the function, however with the index that I get, I cannot use it to retrieve the test.

I want to get the text out of the book 'Four Lecture on Mathematics', which has the index 29788.
(29788, 'Four Lectures on Mathematics, Delivered at Columbia University in 1911')

However I get error. What am I doing wrong?? Could you have a look?

c-w · 2018-06-26T12:29:24Z

In order to have meaningful search relevance, I'd suggest to do a rough filtering of the documents using the Gutenberg library and then ingest the document's full text into a real search engine like Elastic Search or Azure Search. In that way you'll get nice disambiguation.

If that approach is too heavy, you can also adjust the query condition in the text search snippet that I sent earlier if search_term.lower() in value.toPython().lower() to add some more checks, for example with a regex match to exclude words that have 'math' not at a word boundary.

Downloading book 29788 fails since it doesn't offer a textual download. I've updated the error message to make this clearer. You can check the available formats for a book like this:

from gutenberg.query import get_metadata

print(get_metadata('formaturi', 29788))
# frozenset({
#    'http://www.gutenberg.org/files/29788/29788-t/29788-t.tex',
#    'http://www.gutenberg.org/files/29788/29788-pdf.pdf',
#    'http://www.gutenberg.org/files/29788/29788-pdf.zip',
#    'http://www.gutenberg.org/ebooks/29788.rdf',
#    'http://www.gutenberg.org/files/29788/29788-t.zip'
# })

In order to download one of these non-textual formats, you can use this snippet:

from gutenberg.acquire.text import _etextno_to_uri_subdirectory
from gutenberg.acquire.text import _GUTENBERG_MIRROR

text = 29788
extension = '-pdf.pdf'

url = '{mirror}/{path}/{text}{extension}'.format(
  mirror=_GUTENBERG_MIRROR,
  path=_etextno_to_uri_subdirectory(text),
  text=text,
  extension=extension)

~~I'll also open a pull request later to make this functionality available as a single function.~~ The snippet above is now also available via the _format_download_uri_for_extension function in the gutenberg.acquire.text module on master.

The `load_etext` function currently throws an exception when there is no textual download candidate available for a given book. However, some users might want to use Gutenberg to download non-textual versions of books. All available formats of a book can already be looked up via the formaturi metadata extractor, so this change exposes a method to enable a client to format the download URL for an arbitrary extension. See #105

iamyihwa · 2018-06-27T10:26:08Z

Thanks @c-w for quick feedback and ways to make my way through!
Yes I could get the URL using the snippet.
I will download it using some external tool after that.
Thanks!

I have also tested the new function, however _format_download_uri_for_extension didn't work even after the update.

c-w · 2018-06-27T11:49:12Z

As I mentioned the new method just was published to master but we haven't made a new PyPI release yet. This means that you'll have to install the package from Github, e.g. via pip install https://github.com/c-w/gutenberg/archive/d3a98dce92daf2c0cac68e142962aef8cd37b9f0.zip.

c-w · 2018-07-18T19:47:51Z

@iamyihwa Closing this issue since all of your questions seem to have been addressed. Feel free to reopen if you have any additional questions.

iamyihwa · 2018-07-23T09:39:24Z

@c-w Thanks for the support. Yes right now all the doubts and problems have been solved. Thanks a lot again for all the help! Will surely get back when I have more issues.

c-w mentioned this issue Jun 26, 2018

Add _format_download_uri_for_extension method #108

Merged

c-w closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I get the id of each book? #105

How do I get the id of each book? #105

iamyihwa commented Jun 20, 2018

c-w commented Jun 20, 2018

iamyihwa commented Jun 21, 2018

c-w commented Jun 21, 2018

iamyihwa commented Jun 25, 2018

c-w commented Jun 25, 2018

iamyihwa commented Jun 26, 2018

iamyihwa commented Jun 26, 2018

c-w commented Jun 26, 2018 •

edited

Loading

iamyihwa commented Jun 27, 2018

c-w commented Jun 27, 2018

c-w commented Jul 18, 2018

iamyihwa commented Jul 23, 2018

How do I get the id of each book? #105

How do I get the id of each book? #105

Comments

iamyihwa commented Jun 20, 2018

c-w commented Jun 20, 2018

iamyihwa commented Jun 21, 2018

c-w commented Jun 21, 2018

iamyihwa commented Jun 25, 2018

c-w commented Jun 25, 2018

iamyihwa commented Jun 26, 2018

iamyihwa commented Jun 26, 2018

c-w commented Jun 26, 2018 • edited Loading

iamyihwa commented Jun 27, 2018

c-w commented Jun 27, 2018

c-w commented Jul 18, 2018

iamyihwa commented Jul 23, 2018

c-w commented Jun 26, 2018 •

edited

Loading