Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Import HTML Table with more complex table data elements #265

Closed
3 tasks done
billdenney opened this issue Jan 27, 2020 · 1 comment
Closed
3 tasks done

Cannot Import HTML Table with more complex table data elements #265

billdenney opened this issue Jan 27, 2020 · 1 comment

Comments

@billdenney
Copy link
Contributor

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

As you can probably tell with all the html-related activity, I'm trying to load a complex HTML document (with 589 tables in it) that is stretching rio's abilities.

The newest issue will come to a question of how you'd like to handle elements with more complex HTML structure within the <td> element. The example here is one where there are two paragraphs (<p>) inside both some of the header and some of the data parts.

How would you like to handle that? A simple option could be to paste them together with a user-defined separator. A more complex option could be to allow the user to provide the function. I'm sure that there are more options than that.

Changing this would be slightly backward incompatible because if unlist() resulted in an equal number of values previously, those would have been spread across columns. Handling it so that it will go into a single cell would change that functionality, but it seems more consistent with the underlying table, so the new functionality seems preferable to me.

Here is an example file: multi-elements-under-td.zip

## load package
library("rio")

import("multi-elements-under-td.zip")

## session info for your system
sessionInfo()

Sessioninfo is still the same as the last few issues.

@leeper leeper added the bug label Mar 1, 2020
@chainsawriot
Copy link
Collaborator

@billdenney Thank you very much for reporting this. I would argue that the HTML functionalities are more for exporting than importing. There is no standard HTML table and therefore, as you reported, it is easy to break the html import function.

As I explained in #307 , breaking changes should be avoided. Also, as there is no standard, any solution to this is prone to break. I think a fair approach to handle this is to explain in the documentation that the html import functionality is not robust. For complex html tables, one should write ones own solution with xml2 or rvest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants