Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc.save() writes to "metadatas.json" and "spacy_docs.bin" #52

Closed
ikarth opened this issue Nov 21, 2016 · 1 comment
Closed

Doc.save() writes to "metadatas.json" and "spacy_docs.bin" #52

ikarth opened this issue Nov 21, 2016 · 1 comment

Comments

@ikarth
Copy link

ikarth commented Nov 21, 2016

Expected Behavior

I was expecting it to write to "metadata.json" and "spacy_doc.bin".

Current Behavior

Instead it seems to be writing to files that have an extra "s" on the end: "metadatas.json" and "spacy_docs.bin".

Possible Solution

I can't figure out why; looking at the code certainly suggests that the string should be "metadata.json".

Steps to Reproduce (for bugs)

def getGutenbergMetadata(textno):
    meta = {
        'title': list(get_metadata('title', textno))[0],
        'author': list(get_metadata('author', textno)),
        'rights': list(get_metadata('rights', textno)),
        'subject': list(get_metadata('subject', textno)),
        'language': list(get_metadata('language', textno))[0],
        'guten_no': textno}
    return meta

def getGutenberg(filenumber):
    return open("./data/corpora/gutenberg/strip/{}.txt".format(filenumber), mode="r", encoding="utf_8").read()

for i in acttext:
        count += 1
        #if count % 10 is 0:
        print(".", end="")
        am = getGutenbergMetadata(i)
        ad = textacy.Doc(getGutenberg(i), None, "en")
        actdocs.append(ad)        
        actmeta.append(am)
        print("m", end="")
current_corpus = textacy.corpus.Corpus('en', docs=actdocs, metadatas=actmeta)
current_corpus.save("./data")
current_corpus = textacy.Doc.load("./data")

(Renaming the files lets it find it again, of course, but then I get this (possibly separate) error:)

Traceback (most recent call last):

  File "<ipython-input-5-5605bd27e87b>", line 1, in <module>
    loadCorpus()

  File "./excalibur/action_catalog.py", line 125, in loadCorpus
    current_corpus = textacy.Doc.load("./data")

  File "C:\tools\Anaconda3\envs\genmoenv\lib\site-packages\textacy\doc.py", line 219, in load
    metadata = list(fileio.read_json(meta_fname))[0]

  File "C:\tools\Anaconda3\envs\genmoenv\lib\site-packages\textacy\fileio\read.py", line 69, in read_json
    yield json.load(f)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)

  File "C:\tools\Anaconda3\envs\genmoenv\lib\json\decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)

JSONDecodeError: Extra data

Context

Your Environment

  • textacy and spacy versions: textacy: 0.3.2, spacy: 1.2.0
  • Python version: 3.5
  • Operating system and version: Windows 7 x64
@bdewilde
Copy link
Collaborator

Hi @ikarth , in your example code, you save a textacy.Corpus to disk via current_corpus.save("./data") then try to load it as a textacy.Doc via textacy.Doc.load("./data"). As you've noticed, that doesn't work! The immediate cause for failure is that Corpus and Doc instances are saved to disk with different filenames (one pluralized and the other not). But still, you shouldn't save a Corpus and expect to be able to load it back as a Doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants