Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly set BeautifulSoup parser to html.parser to avoid warning #4551

Merged
merged 2 commits into from Apr 19, 2016

Conversation

@anchitjain1234
Copy link
Contributor

@anchitjain1234 anchitjain1234 commented Feb 1, 2016

For #4550.

@anchitjain1234
Copy link
Contributor Author

@anchitjain1234 anchitjain1234 commented Feb 8, 2016

@astrofrog please check the changes.

@mwcraig
Copy link
Contributor

@mwcraig mwcraig commented Feb 8, 2016

@anchitjain1234 -- thanks for this contribution. Please be aware that it often takes several days (sometimes even a few weeks) for the maintainers to review pull requests.

@mwcraig
Copy link
Contributor

@mwcraig mwcraig commented Feb 8, 2016

@anchitjain1234 -- I realized after my first comment that I should have added a caveat....certainly if you haven't received any feedback by the beginning of next week, give the pull request a bump. We do sometimes inadvertently take longer than we'd like to get these reviewed and merged (and this is a really short one).

@mwcraig mwcraig added the io.ascii label Feb 25, 2016
@hamogu
Copy link
Member

@hamogu hamogu commented Apr 1, 2016

Looks good to me.
@taldcroft
(Remember to close #4550 when this is merged.)

@eteq
Copy link
Member

@eteq eteq commented Apr 5, 2016

I'm don't think this fix is what we want. The underlying issue in #4550 is that bs4 guesses at the right kind of parser. The trouble is, I think we want it to guess. That is, On my system, for example, it's defaulting to lxml. I want it to do that because lxml is a fair bit faster than html.parser. So I don't think we want to say it must always use 'html.parser'.

I wonder if @taldcroft or @mdmueller can offer some insight into the best way to set this? My tilt would be towards a configuration option (i.e., http://docs.astropy.org/en/stable/config/), as it's sort of site-dependent. The default would be None, which is the current behavior of having beautifulsoup decide. But the configuration option would let you explicitly specify a parser if you definitely care.

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Apr 5, 2016

Agree with @eteq that this is not the right solution. This change in BS4 is kind of annoying and not really consistent with the docs:

"If you just need to parse some HTML, you can dump the markup into the BeautifulSoup constructor, and it’ll probably be fine. Beautiful Soup will pick a parser for you and parse the data. But there are a few additional arguments you can pass in to the constructor to change which parser is used. ...

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser. You can override this by specifying one of the following"

But anyway, I think that the config option is perhaps overkill and not ideal because it is global for your system, while in theory library availability is local (e.g. libraries can be available in one conda env but not another), etc etc.

I would suggest just blocking out this silly new warning from BS4. (I take this as a lesson for astropy, that many "informative/helpful warnings" are actually just annoying in practice).

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Apr 5, 2016

In other words, use an appropriate version of below to catch this particular warning:

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    soup = BeautifulSoup('\n'.join(lines))

BTW there is already an option in the HTML reader to explicitly specify the parser, so @astrofrog could avoid the warning by doing that.

@anchitjain1234 anchitjain1234 force-pushed the anchitjain1234:fix-4550 branch from 1838bae to 53ad4cf Apr 8, 2016
@anchitjain1234
Copy link
Contributor Author

@anchitjain1234 anchitjain1234 commented Apr 12, 2016

@taldcroft Could you please check the changes.

soup = BeautifulSoup('\n'.join(lines))
try:
import warnings
except ImportError:

This comment has been minimized.

@pllim

pllim Apr 12, 2016
Member

Is this really necessary? warnings come from standard Python library, so why would it not be there?

This comment has been minimized.

@taldcroft

taldcroft Apr 12, 2016
Member

Agreed, no need to catch an import error for warnings. This import should be at the top of the package (no performance hit since in all likelihood some other module will have imported it already).

soup = BeautifulSoup('\n'.join(lines))
else:
with warnings.catch_warnings():
warnings.simplefilter("ignore")

This comment has been minimized.

@taldcroft

taldcroft Apr 12, 2016
Member

Catch only the warning in question, not every warning. You might need to use filterwarnings instead of simplefilter, but you can figure that out.

@anchitjain1234 anchitjain1234 force-pushed the anchitjain1234:fix-4550 branch 2 times, most recently from fbc6813 to 8bc1590 Apr 14, 2016
@anchitjain1234
Copy link
Contributor Author

@anchitjain1234 anchitjain1234 commented Apr 14, 2016

@taldcroft @pllim Filtered warnings from UserWarning category, as warning log posted by @astrofrog in #4550 mentions it.

@pllim
Copy link
Member

@pllim pllim commented Apr 14, 2016

@astrofrog , does this work for you?

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Apr 15, 2016

@anchitjain1234 - can you filter on just the particular message in question, e.g. filter out only messages containing No parser was explicitly specified?

@anchitjain1234 anchitjain1234 force-pushed the anchitjain1234:fix-4550 branch 2 times, most recently from fa4263a to 11b27cd Apr 15, 2016
@anchitjain1234
Copy link
Contributor Author

@anchitjain1234 anchitjain1234 commented Apr 15, 2016

@taldcroft Please check the changes. Also, the travis failure is unrelated.

@taldcroft taldcroft added this to the v1.2.0 milestone Apr 18, 2016
@taldcroft taldcroft self-assigned this Apr 18, 2016
@taldcroft
Copy link
Member

@taldcroft taldcroft commented Apr 18, 2016

Almost done @anchitjain1234! Just need a change log entry. Put the following in CHANGES.rst, in the section 1.2 => New Features => astropy.io.ascii

  - Updated to filter out the default parser warning of BeautifulSoup. [#4551]

You can do this with a commit message that includes [skip ci] to avoid running CI again.

@anchitjain1234 anchitjain1234 force-pushed the anchitjain1234:fix-4550 branch from eea3c9e to bc616d7 Apr 19, 2016
@anchitjain1234
Copy link
Contributor Author

@anchitjain1234 anchitjain1234 commented Apr 19, 2016

@taldcroft Added changelog entry.

@taldcroft taldcroft merged commit 6fbb534 into astropy:master Apr 19, 2016
@anchitjain1234 anchitjain1234 deleted the anchitjain1234:fix-4550 branch Apr 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.