Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

You must download and parse an article before parsing it #52

Closed
bitliner opened this issue May 30, 2014 · 4 comments
Closed

You must download and parse an article before parsing it #52

bitliner opened this issue May 30, 2014 · 4 comments

Comments

@bitliner
Copy link

Here the stack trace:

[Parse lxml ERR] line 1045: Tag nav invalid
[Article parse ERR] http://www.cnet.com/products/apple-ipad-march-2012/
You must download and parse an article before parsing it!
Traceback (most recent call last):
  File "crawler.py", line 30, in <module>
    a.nlp()
  File "/root/.virtualenvs/cnet-crawler/local/lib/python2.7/site-packages/newspaper/article.py", line 276, in nlp
    raise ArticleException()
newspaper.article.ArticleException

I'm not using the concurrent version, I'm not building a newspaper from a url, but rather I have a list of all the articles and I build a new Article from them.

@codelucas
Copy link
Owner

I'll test this on my computer and get back to ya.

I've seen this error before. From my personal experience it occurs when the HTML you are trying to parse is much too "deformed" for lxml. (The error is complaining about a <nav> tag).

http://stackoverflow.com/questions/4967103/beautifulsoup-and-lxml-html-what-to-prefer

BeautifulSoup is preferred for "non well-formed" html. (You have the option of using both lxml or BeautifulSoup to parse, but lxml is much faster.

@Casyfill
Copy link

Casyfill commented Sep 2, 2016

But how can I switch to bf parser? Can't find any documentation on that

@cesarandreslopez
Copy link

Same here. Not sure how to change the parsing to BeautifulSoup

@go2dmny
Copy link

go2dmny commented Feb 21, 2017

Any update? Also looking for a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants