Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: Couldn't resolve host name #13

Open
gordchan opened this issue Mar 7, 2016 · 5 comments
Open

ERROR: Couldn't resolve host name #13

gordchan opened this issue Mar 7, 2016 · 5 comments

Comments

@gordchan
Copy link

gordchan commented Mar 7, 2016

I had just installed this package, and tried to read a local html file:

require(htmltab)'

file <- file.path("Test", "01.html")
test <- htmltab(file)

However the following error was returned:
'Error in curl::curl_fetch_memory(url, handle = handle) : Couldn't resolve host name'

Not sure why curl is involved or did I missed something?

Thanks!

@crubba
Copy link
Owner

crubba commented Mar 7, 2016

It certainly shouldn't use curl for that operation.
Does that error appear when reading other html files from the local hard drive? What's the value of file? Have you tried parsing the file first (XML::htmlParse(file)) and then passing it to htmltab?

@gordchan
Copy link
Author

gordchan commented Mar 7, 2016

I found out that I could only read directly from an URL.

Even if I download the html from https://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom and read from the local copy I would get the same error message.

I should mention that I am using my R on Rstudio server. But I have never experienced this error before. Not even with the XML package.

NB. I have never tried htmlParse before, I'll see if it would work.

@gordchan
Copy link
Author

gordchan commented Mar 8, 2016

Thanks for the tip Christian!

I have parsed the html by htmlParse() before passing it to htmltab().
Now I could read the file without the curl error. However I keep getting weird dataset after reading in the html table.

Now that if I run this html file through the codes:
html.txt

        html2 <- file.path("Test", "html.html")
        parse2 <- htmlParse(html2)
        test2.xls <- htmltab(parse2, which = 4, header = 1:2)

All of the data columns are read as the column names:
column

@crubba
Copy link
Owner

crubba commented Mar 9, 2016

So, the problem with this table is that it is not very well constructed. A row tag (tr) that opens in the beginning only closes at the very end, and this makes the job very hard for htmltab. I will be looking into ways to detect and correct such malformedness in the future. For the moment, you can suppress the construction of a header by setting header = 0; that should help a bit.

htmltab(parse2, which = 4, header = 0)

It still shreds the last column though. Maybe rvest's html_table function does a better job here.

@gordchan
Copy link
Author

gordchan commented Mar 9, 2016

I see the problem. Thanks.

After some trial-and-error I've got the best out of the html with this:

test2.xls <- htmltab(parse2, which = "//table[@id='datatable']", header = 1:2, complementary = FALSE)
names(test2.xls) <- c(1:5)

This is workable with some processing:
column2

Thanks again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants