Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

htmldf

Build Status codecov CRAN status cran checks

Overview

The package htmldf contains a single function html_df() which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

  • page title
  • inferred language
  • RSS feeds
  • tables coerced to tibbles, where possible
  • hyperlinks
  • image links
  • twitter, github and linkedin profiles
  • the inferred programming language of any text with code tags
  • page size, generator and server
  • page accessed date
  • page published or last updated dates

Installation

To install the CRAN version of the package:

install.packages('htmldf')

To install the development version of the package:

remotes::install_github('alastairrushworth/htmldf')

Usage

First define a vector of URLs you want to gather information from. The function html_df() returns a tibble where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:

library(htmldf)
library(dplyr)

# An example vector of URLs to fetch data for
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
          "https://www.tensorflow.org/tutorials/images/cnn", 
          "https://www.robertmylesmcdonnell.com/content/posts/mtcars/")

# use html_df() to gather data
z <- html_df(urlx, show_progress = FALSE)
z
## # A tibble: 3 x 16
##   url   title lang  url2  links rss   tables images social code_lang   size
##   <chr> <chr> <chr> <chr> <lis> <chr> <list> <list> <list>     <dbl>  <int>
## 1 http… Visu… en    http… <tib… http… <lgl … <tibb… <tibb…     1      38445
## 2 http… Conv… en    http… <tib… <NA>  <name… <tibb… <tibb…    -0.936 113305
## 3 http… Robe… en    http… <tib… <NA>  <name… <tibb… <tibb…     1     291099
## # … with 5 more variables: server <chr>, accessed <dttm>, published <dttm>,
## #   generator <chr>, source <chr>

To see the page titles, look at the titles column.

z %>% select(title, url2)
## # A tibble: 3 x 2
##   title                              url2                                       
##   <chr>                              <chr>                                      
## 1 Visualising Tour De France Data I… https://alastairrushworth.github.io/Visual…
## 2 Convolutional Neural Network (CNN… https://www.tensorflow.org/tutorials/image…
## 3 Robert Myles McDonnell             https://www.robertmylesmcdonnell.com/conte…

Where there are tables embedded on a page in the <table> tag, these will be gathered into the list column tables. html_df will attempt to coerce each table to tibble - where that isn’t possible, the raw html is returned instead.

z$tables
## [[1]]
## [1] NA
## 
## [[2]]
## [[2]]$uncoercible
## [1] "<table class=\"tfo-notebook-buttons\" align=\"left\">\n<td>\n    <a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/images/cnn\">\n    <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\">\n    View on TensorFlow.org</a>\n  </td>\n  <td>\n    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb\">\n    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">\n    Run in Google Colab</a>\n  </td>\n  <td>\n    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb\">\n    <img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">\n    View source on GitHub</a>\n  </td>\n  <td>\n    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/images/cnn.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\">Download notebook</a>\n  </td>\n</table>\n"
## 
## 
## [[3]]
## [[3]]$`no-caption`
## # A tibble: 32 x 2
##    model             car  
##    <chr>             <lgl>
##  1 Mazda RX4         NA   
##  2 Mazda RX4 Wag     NA   
##  3 Datsun 710        NA   
##  4 Hornet 4 Drive    NA   
##  5 Hornet Sportabout NA   
##  6 Valiant           NA   
##  7 Duster 360        NA   
##  8 Merc 240D         NA   
##  9 Merc 230          NA   
## 10 Merc 280          NA   
## # … with 22 more rows

html_df does its best to find RSS feeds embedded in the page:

z$rss
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA                                            
## [3] NA

Social profiles embedded on the page. At present, Twitter, Facebook and Linkedin are extracted.

z$social
## [[1]]
## # A tibble: 3 x 3
##   site     handle                    profile                                    
##   <chr>    <chr>                     <chr>                                      
## 1 twitter  @rushworth_a              https://twitter.com/rushworth_a            
## 2 linkedin @alastair-rushworth-2531… https://linkedin.com/in/alastair-rushworth…
## 3 github   @alastairrushworth        https://github.com/alastairrushworth       
## 
## [[2]]
## # A tibble: 1 x 3
##   site    handle      profile                       
##   <chr>   <chr>       <chr>                         
## 1 twitter @tensorflow https://twitter.com/tensorflow
## 
## [[3]]
## # A tibble: 4 x 3
##   site     handle                   profile                                     
##   <chr>    <chr>                    <chr>                                       
## 1 twitter  @robertmylesmc           https://twitter.com/robertmylesmc           
## 2 linkedin @robert-mcdonnell-7475b… https://linkedin.com/in/robert-mcdonnell-74…
## 3 github   @coolbutuseless          https://github.com/coolbutuseless           
## 4 github   @robertmyles             https://github.com/robertmyles

Code language is inferred from <code> chunks using simple machine learning. The code_lang column contains score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:

z %>% select(code_lang, url2)
## # A tibble: 3 x 2
##   code_lang url2                                                                
##       <dbl> <chr>                                                               
## 1     1     https://alastairrushworth.github.io/Visualising-Tour-de-France-data…
## 2    -0.936 https://www.tensorflow.org/tutorials/images/cnn                     
## 3     1     https://www.robertmylesmcdonnell.com/content/posts/mtcars/

Comments? Suggestions? Issues?

Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.

About

🖥 ✂️ 📁 Simple scraping and tidy webpage summaries

Resources

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.