Skip to content
This repository has been archived by the owner on Jan 12, 2023. It is now read-only.

How do I get the id of each book? #105

Closed
iamyihwa opened this issue Jun 20, 2018 · 12 comments
Closed

How do I get the id of each book? #105

iamyihwa opened this issue Jun 20, 2018 · 12 comments

Comments

@iamyihwa
Copy link

Hello
I have been looking for ways to get ids of each book in an intuitive way.

Getting the id from the webpage of each book doesn't seem to work.
When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.

One way would be to look at catalogs.
http://www.gutenberg.org/dirs/GUTINDEX.1996
However these indices are not complete, and there are too many files.

I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.

@c-w
Copy link
Owner

c-w commented Jun 20, 2018

Hi @iamyihwa and thanks for reaching out. Did you take a look at the get_etexts method to search for the IDs of texts by criteria such as author, title, etc.? That looks like it might fit your use-case. There's more information on the feature in the README: https://github.com/c-w/gutenberg#looking-up-meta-data

@iamyihwa
Copy link
Author

Hi @c-w thanks for your reply. I have just tried using the functions that were in the link you sent.
However I receive invalid cache error.
I have attached the details below.

from gutenberg.query import get_metadata
print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])


AttributeError Traceback (most recent call last)
~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self)
63 self.graph.open(self.cache_uri, create=False)
---> 64 self._add_namespaces(self.graph)
65 self.is_open = True

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in _add_namespaces(graph)
131 """
--> 132 graph.bind('pgterms', PGTERMS)
133 graph.bind('dcterms', DCTERMS)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in bind(self, prefix, namespace, override)
917 """
--> 918 return self.namespace_manager.bind(
919 prefix, namespace, override=override)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/graph.py in _get_namespace_manager(self)
330 if self.__namespace_manager is None:
--> 331 self.__namespace_manager = NamespaceManager(self)
332 return self.__namespace_manager

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in init(self, graph)
280 self.__log = None
--> 281 self.bind("xml", "http://www.w3.org/XML/1998/namespace")
282 self.bind("rdf", RDF)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/namespace.py in bind(self, prefix, namespace, override, replace)
361 prefix = ''
--> 362 bound_namespace = self.store.namespace(prefix)
363 # Check if the bound_namespace contains a URI

~/anaconda3/envs/fastai/lib/python3.6/site-packages/rdflib/plugins/sleepycat.py in namespace(self, prefix)
443 prefix = prefix.encode("utf-8")
--> 444 ns = self.__namespace.get(prefix, None)
445 if ns is not None:

AttributeError: 'Sleepycat' object has no attribute '_Sleepycat__namespace'

During handling of the above exception, another exception occurred:

InvalidCacheException Traceback (most recent call last)
in ()
1 from gutenberg.query import get_metadata
----> 2 print(get_etexts('title', 'Moby Dick; Or, The Whale')) # prints frozenset([2701, ...])

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in get_etexts(feature_name, value)
55
56 """
---> 57 matching_etexts = MetadataExtractor.get(feature_name).get_etexts(value)
58 return frozenset(matching_etexts)
59

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/extractors.py in get_etexts(cls, requested_value)
40 @classmethod
41 def get_etexts(cls, requested_value):
---> 42 query = cls._metadata()[:cls.predicate():cls.contains(requested_value)]
43 results = (cls._uri_to_etext(result) for result in query)
44 return frozenset(result for result in results if result is not None)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/query/api.py in _metadata(cls)
113
114 """
--> 115 return load_metadata()
116
117 @classmethod

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in load_metadata(refresh_cache)
295
296 if not cache.is_open:
--> 297 cache.open()
298
299 return cache.graph

~/anaconda3/envs/fastai/lib/python3.6/site-packages/gutenberg/acquire/metadata.py in open(self)
65 self.is_open = True
66 except Exception:
---> 67 raise InvalidCacheException('The cache is invalid or not created')
68
69 def close(self):

InvalidCacheException: The cache is invalid or not created

@c-w
Copy link
Owner

c-w commented Jun 21, 2018

Did you ensure to create the metadata cache before running the query?

from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate()

This should only need to be done once since the results are cached on disk. If this doesn't work for you (due to the BerkelyDB setup on your machine), you can also try using the SQLite cache which works everywhere but is somewhat slower:

from gutenberg.acquire import set_metadata_cache
from gutenberg.acquire.metadata import SqliteMetadataCache

cache = SqliteMetadataCache('/my/custom/location/cache.sqlite')
cache.populate()
set_metadata_cache(cache)

There's more documentation on this here: https://github.com/c-w/gutenberg#looking-up-meta-data

@iamyihwa
Copy link
Author

Hi @c-w Thanks it worked with cache trick!
However it doesn't seem to work for the moby dick example but not for others.

image

I get something that says it is 'frozenset' .. any clues?

@c-w
Copy link
Owner

c-w commented Jun 25, 2018

Hi @iamyihwa. The get_etexts function returns an immutable set which is why you're seeing frozenset() being returned for your query: there were no results. There were no results for the query since currently the get_etexts methods assume that you're querying for an exact match, e.g. you know the author's name and want to find all the books they wrote, or you know the name of a book and want to find all the copies in the corpus.

In order to do a fuzzy search on the titles like find all the books where the title contains "math", you might be able to use or adapt this snippet:

from gutenberg.acquire import get_metadata_cache
from gutenberg.query.api import MetadataExtractor

# define search parameters
search_term = 'math'
search_field = 'title'

# get a reference to the metadata graph
cache = get_metadata_cache()
cache.open()
graph = cache.graph

# execute the search
extractor = MetadataExtractor.get(search_field)
results = ((extractor._uri_to_etext(etext), value.toPython())
           for (etext, value) in graph[:extractor.predicate():]
           if search_term.lower() in value.toPython().lower())

# print the first result of the search: (25387, 'Mathematical Essays and Recreations')
result = next(results)
print(result)

@iamyihwa
Copy link
Author

Thanks @c-w !!
I do get results! :-)
What I notice is that is there any way to sort the result like when i do search on the gutenberg website, (here the results are sorted according to the popularity).
image

What I would like to do eventually is to get some domain specific texts and do some training on it and use that classifier to later determine the domain of unseen text.
My domains of interest are like math, history, etc.
I would like to get for example math text books, rather than 'aftermath ... '.

I see to do this sorting by popularity could be one option, if you know of any other way it would be nice!

@iamyihwa
Copy link
Author

Hi @c-w I have just tried the function, however with the index that I get, I cannot use it to retrieve the test.

image

I want to get the text out of the book 'Four Lecture on Mathematics', which has the index 29788.
(29788, 'Four Lectures on Mathematics, Delivered at Columbia University in 1911')
image

However I get error. What am I doing wrong?? Could you have a look?

@c-w
Copy link
Owner

c-w commented Jun 26, 2018

In order to have meaningful search relevance, I'd suggest to do a rough filtering of the documents using the Gutenberg library and then ingest the document's full text into a real search engine like Elastic Search or Azure Search. In that way you'll get nice disambiguation.

If that approach is too heavy, you can also adjust the query condition in the text search snippet that I sent earlier if search_term.lower() in value.toPython().lower() to add some more checks, for example with a regex match to exclude words that have 'math' not at a word boundary.

Downloading book 29788 fails since it doesn't offer a textual download. I've updated the error message to make this clearer. You can check the available formats for a book like this:

from gutenberg.query import get_metadata

print(get_metadata('formaturi', 29788))
# frozenset({
#    'http://www.gutenberg.org/files/29788/29788-t/29788-t.tex',
#    'http://www.gutenberg.org/files/29788/29788-pdf.pdf',
#    'http://www.gutenberg.org/files/29788/29788-pdf.zip',
#    'http://www.gutenberg.org/ebooks/29788.rdf',
#    'http://www.gutenberg.org/files/29788/29788-t.zip'
# })

In order to download one of these non-textual formats, you can use this snippet:

from gutenberg.acquire.text import _etextno_to_uri_subdirectory
from gutenberg.acquire.text import _GUTENBERG_MIRROR

text = 29788
extension = '-pdf.pdf'

url = '{mirror}/{path}/{text}{extension}'.format(
  mirror=_GUTENBERG_MIRROR,
  path=_etextno_to_uri_subdirectory(text),
  text=text,
  extension=extension)

I'll also open a pull request later to make this functionality available as a single function. The snippet above is now also available via the _format_download_uri_for_extension function in the gutenberg.acquire.text module on master.

c-w added a commit that referenced this issue Jun 26, 2018
The `load_etext` function currently throws an exception when there is no
textual download candidate available for a given book. However, some
users might want to use Gutenberg to download non-textual versions of
books. All available formats of a book can already be looked up via the
formaturi metadata extractor, so this change exposes a method to enable
a client to format the download URL for an arbitrary extension.

See #105
@iamyihwa
Copy link
Author

Thanks @c-w for quick feedback and ways to make my way through!
Yes I could get the URL using the snippet.
I will download it using some external tool after that.
Thanks!

I have also tested the new function, however _format_download_uri_for_extension didn't work even after the update.

image

image

@c-w
Copy link
Owner

c-w commented Jun 27, 2018

As I mentioned the new method just was published to master but we haven't made a new PyPI release yet. This means that you'll have to install the package from Github, e.g. via pip install https://github.com/c-w/gutenberg/archive/d3a98dce92daf2c0cac68e142962aef8cd37b9f0.zip.

@c-w
Copy link
Owner

c-w commented Jul 18, 2018

@iamyihwa Closing this issue since all of your questions seem to have been addressed. Feel free to reopen if you have any additional questions.

@c-w c-w closed this as completed Jul 18, 2018
@iamyihwa
Copy link
Author

@c-w Thanks for the support. Yes right now all the doubts and problems have been solved. Thanks a lot again for all the help! Will surely get back when I have more issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants