Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot process a txt file with spacy #1851

Closed
ghost opened this issue Jan 16, 2018 · 4 comments
Closed

Cannot process a txt file with spacy #1851

ghost opened this issue Jan 16, 2018 · 4 comments
Labels
docs Documentation and website

Comments

@ghost
Copy link

ghost commented Jan 16, 2018

Hi all,

I am trying to load an english text file but failing to do so, I am following spacy website 101 tutorial:

ff = io.open('moby_dick.txt', 'r', encoding='utf-8')
nlp(ff)
ff.close()
TypeError                                 Traceback (most recent call last)
<ipython-input-20-f474742159a7> in <module>()
      1 ff = io.open(files[7], 'r', encoding='utf-8')
----> 2 nlp(ff)
      3 ff.close()

~/.local/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, disable)
    327             ('An', 'NN')
    328         """
--> 329         doc = self.make_doc(text)
    330         for name, proc in self.pipeline:
    331             if name in disable:

~/.local/lib/python3.5/site-packages/spacy/language.py in make_doc(self, text)
    355 
    356     def make_doc(self, text):
--> 357         return self.tokenizer(text)
    358 
    359     def update(self, docs, golds, drop=0., sgd=None, losses=None):

TypeError: Argument 'string' has incorrect type (expected str, got _io.TextIOWrapper)

Your Environment

  • spaCy version: 2.0.5
  • Platform: Linux-4.13.0-26-generic-x86_64-with-LinuxMint-18.3-sylvia
  • Models: en
  • Python version: 3.5.2
@bwj-GitHub
Copy link

I wasn't aware that nlp could take a file; you should read the file first: nlp(ff.read()).

@ghost
Copy link
Author

ghost commented Jan 16, 2018

Oh. I see. I took the idea from the example in the serialization section.

https://spacy.io/usage/spacy-101#serialization

@ines
Copy link
Member

ines commented Jan 16, 2018

Sorry – this is a mistake in the example, it's indeed missing the .read(). Fixing! We might also update the example to not use Moby Dick and a made-up filename instead (like customer_feedback_627.txt, which we use in the lightning tour example on the homepage). Loading and parsing a large document like this in one is actually not the best strategy, especially in spaCy v2.0. For better performance, you'd usually want to split up the text and use nlp.pipe, which returns a generator.

@ines ines added the docs Documentation and website label Jan 16, 2018
@ines ines closed this as completed in 67ba733 Jan 16, 2018
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website
Projects
None yet
Development

No branches or pull requests

2 participants