Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements to docs / error messages #13

Merged
merged 1 commit into from Dec 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 10 additions & 8 deletions R/bow.R
@@ -1,11 +1,11 @@
#' Introduce yourself to the host
#'
#' @param url url
#' @param user_agent character value passed to user_agent string
#' @param delay desired delay between scraping attempts. Final value will be the maximum of desired and mandated delay, as stipulated by robots.txt for relevant user agent
#' @param force refresh all memoised functions. Clears up all robotstxt and scrape cache. Default is FALSE.
#' @param url URL
#' @param user_agent character value passed to user agent string
#' @param delay desired delay between scraping attempts. Final value will be the maximum of desired and mandated delay, as stipulated by `robots.txt` for relevant user agent
#' @param force refresh all memoised functions. Clears up `robotstxt` and `scrape` caches. Default is `FALSE`
#' @param verbose TRUE/FALSE
#' @param ... other curl parameters wrapped into httr::config function
#' @param ... other curl parameters wrapped into `httr::config` function
#'
#' @return object of class `polite`, `session`
#'
Expand Down Expand Up @@ -100,15 +100,17 @@ bow <- function(url,

if(self$delay<5)
if(grepl("polite|dmi3kno", self$user_agent)){
stop(red("You can not scrape this fast. Please, reconsider delay period."), call. = FALSE)
stop(red("You cannot scrape this fast. Please reconsider delay period."), call. = FALSE)
} else{
warning("This is a little too fast. Are you sure you want to risk being banned?", call. = FALSE)
}

self
}

#' @param x object of class `polite session`
#' Print host introduction object
#'
#' @param x object of class `polite`, `session`
#' @param ... other parameters passed to methods
#' @importFrom crayon yellow bold blue green red
#' @export
Expand All @@ -124,7 +126,7 @@ print.polite <- function(x, ...) {
}
}

#' @param x object of class `polite session`
#' @param x object of class `polite`, `session`
#' @rdname bow
#' @export
is.polite <- function(x) inherits(x, "polite")
14 changes: 7 additions & 7 deletions R/html.R
@@ -1,12 +1,12 @@
#' Convert collection of html nodes into dataframe
#' Convert collection of html nodes into data frame
#'
#' @param x xml_nodeset object, containing text and attributes of interest
#' @param attrs character vector of attribute names. If missing, all attributes will be used.
#' @param trim if TRUE will trim leading and trailing spaces.
#' @param defaults character vector of default values to be passed to rvest::html_attr(). Recycled to match length of attrs
#' @param add_text if TRUE node content will be added as .text column (using rvest::html_text)
#' @param x `xml_nodeset` object, containing text and attributes of interest
#' @param attrs character vector of attribute names. If missing, all attributes will be used
#' @param trim if `TRUE`, will trim leading and trailing spaces
#' @param defaults character vector of default values to be passed to `rvest::html_attr()`. Recycled to match length of `attrs`
#' @param add_text if `TRUE`, node content will be added as `.text` column (using `rvest::html_text`)
#'
#' @return data frame one row per xml node, consisting of html_text column with text and additional columns with attributes
#' @return data frame with one row per xml node, consisting of an html_text column with text and additional columns with attributes
#' @export
#'
#' @examples
Expand Down
10 changes: 5 additions & 5 deletions R/nod.R
@@ -1,9 +1,9 @@
#' Agree modification of session path with the host
#'
#' @param bow object of class `polite`, `session` created by `polite::bow()`
#' @param path string value of path/url to follow. The function accepts both path (string part of url followin domain name) or a full url.
#' @param verbose TRUE/FALSE
#' @return object of class `polite`, `session` with modified url
#' @param path string value of path/URL to follow. The function accepts either a path (string part of URL following domain name) or a full URL
#' @param verbose `TRUE`/`FALSE`
#' @return object of class `polite`, `session` with modified URL
#'
#' @examples
#' \dontrun{
Expand All @@ -19,9 +19,9 @@
nod <- function(bow, path, verbose=FALSE){

if(!inherits(bow, "polite"))
stop("Please, bow before you nod")
stop("Please bow before you nod")

# if user supplied url instead of path
# if user supplied URL instead of path
if(grepl("://|www\\.", path)){
if(urltools::domain(path)!=bow$domain)
bow <- bow(url = path, user_agent = bow$user_agent, delay=bow$delay, bow$config)
Expand Down
14 changes: 7 additions & 7 deletions R/rip.R
@@ -1,15 +1,15 @@
#' Polite file download
#'
#' @param bow host introduction object of class polite, session created by bow() or nod
#' @param bow host introduction object of class `polite`, `session` created by `bow()` or `nod()`
#' @param new_filename optional new file name to use when saving the file
#' @param suffix optional characters added to file name
#' @param sep separator between file name and suffix. Default "__"
#' @param path path where file should be saved. Defults to folder named "downloads" created in the working directory
#' @param overwrite if TRUE will overwrite file on disk
#' @param mode character. The mode with which to write the file. Useful values are "w", "wb" (binary), "a" (append) and "ab". Not used for methods "wget" and "curl".
#' @param ... other parameters passed to download.file
#' @param sep separator between file name and suffix. Default `__`
#' @param path path where file should be saved. Defaults to folder named `downloads` created in the working directory
#' @param overwrite if `TRUE` will overwrite file on disk
#' @param mode character. The mode with which to write the file. Useful values are `w`, `wb` (binary), `a` (append) and `ab`. Not used for methods `wget` and `curl`.
#' @param ... other parameters passed to `download.file`
#'
#' @return Full path to file indicated by url saved on disk
#' @return Full path to file indicated by URL saved on disk
#' @export
#'
#' @examples
Expand Down
20 changes: 10 additions & 10 deletions R/scrape.R
Expand Up @@ -2,7 +2,7 @@
m_scrape <- function(bow, params=NULL, accept="html", content=NULL, verbose=FALSE) { # nolint

if(!inherits(bow, "polite"))
stop("Please, be polite: bow then scrape!")
stop("Please be polite: bow then scrape!")


if(!is.null(params))
Expand Down Expand Up @@ -51,12 +51,12 @@ m_scrape <- function(bow, params=NULL, accept="html", content=NULL, verbose=FALS
httr::content(response, type = content)
},
error=function(cond){
cat(yellow$bold("<polite session> Encountered an error, while parsing content.\n"), sep="")
cat(blue(" ","There seems to be mismatch of content type or encoding or both.\n"), sep="")
cat(yellow$bold("<polite session> Encountered an error while parsing content.\n"), sep="")
cat(blue(" ","There seems to be a mismatch of content type or encoding or both.\n"), sep="")
cat(blue(" ","The server says it is serving: '"), response$headers$`content-type`, blue("' \n"), sep="")
cat(blue(" ","Here's the text of the error generated by the httr::content(): \n"), sep="")
cat(blue(" ","Here's the text of the error generated by httr::content(): \n"), sep="")
message(cond); cat("\n")
cat(green(" ","But, please, do not despair! I will return a raw vector to you now,\n"), sep="")
cat(green(" ","But please do not despair! I will return a raw vector to you now,\n"), sep="")
cat(green(" ","which you can parse with rawToChar(). Good luck!\n"), sep="")
return(httr::content(response, as = "raw"))
}
Expand All @@ -68,13 +68,13 @@ m_scrape <- function(bow, params=NULL, accept="html", content=NULL, verbose=FALS
#' Scrape the content of authorized page/API
#'
#' @param bow host introduction object of class `polite`, `session` created by `bow()` or `nod()`
#' @param params character vector of parameters to be appended to url in the format "parameter=value"
#' @param accept character value of expected data type to be returned by host (e.g. "html", "json", "xml", "csv", "txt", etc)
#' @param params character vector of parameters to be appended to URL in the format `parameter=value`
#' @param accept character value of expected data type to be returned by host (e.g. `html`, `json`, `xml`, `csv`, `txt`, etc.)
#' @param content MIME type (aka internet media type) used to override the content type returned by the server.
#' See http://en.wikipedia.org/wiki/Internet_media_type for a list of common types. You can add `charset` parameter to override server's default encoding.
#' @param verbose extra feedback from the function. Defaults to FALSE
#' See http://en.wikipedia.org/wiki/Internet_media_type for a list of common types. You can add the `charset` parameter to override the server's default encoding
#' @param verbose extra feedback from the function. Defaults to `FALSE`
#'
#' @return Onbject of class `httr::response` which can be further processed by functions in `rvest` package
#' @return Object of class `httr::response` which can be further processed by functions in `rvest` package
#'
#' @examples
#' \dontrun{
Expand Down
14 changes: 7 additions & 7 deletions README.Rmd
Expand Up @@ -24,11 +24,11 @@ The goal of `polite` is to promote responsible web etiquette.
> Source: _Wiktionary, The free dictionary_
>

The package's two main functions `bow` and `scrape` define and realize web harvesting session. `bow` is used to introduce the client to the host and ask for permission to scrape (by inquiring against host's robots.txt file), while `scrape` is the main function for retrieving data from the remote server. Once the connection is established, there's no need to `bow` again. Rather, in order to adjust a scraping url the user can simply `nod` to the new path, which updates the session's url, making sure that the new location can be negotiated against robots.txt
The package's two main functions `bow` and `scrape` define and realize a web harvesting session. `bow` is used to introduce the client to the host and ask for permission to scrape (by inquiring against the host's `robots.txt` file), while `scrape` is the main function for retrieving data from the remote server. Once the connection is established, there's no need to `bow` again. Rather, in order to adjust a scraping URL the user can simply `nod` to the new path, which updates the session's URL, making sure that the new location can be negotiated against `robots.txt`.

The three pillars of `polite session` are **seeking permission, taking slowly and never asking twice**.
The three pillars of a `polite session` are **seeking permission, taking slowly and never asking twice**.

The package builds on awesome toolkit for defining and managing http session (`httr` and `rvest`), declaring useragent string and investigating site policies (`robotstxt`), utilizing rate-limiting and reponse caching (`ratelimitr` amd `memoise`).
The package builds on awesome toolkits for defining and managing http sessions (`httr` and `rvest`), declaring the user agent string and investigating site policies (`robotstxt`), and utilizing rate-limiting and response caching (`ratelimitr` amd `memoise`).

## Installation

Expand All @@ -42,7 +42,7 @@ devtools::install_github("dmi3kno/polite")
## Basic Example


This is a basic example which shows how to retrive the list of semi-soft cheeses from www.cheese.com. Here, we authenticate a session and then scrape the page with specified parameters. Behind the scenes `polite` retrieves `robots.txt`, checks the url and useragent string against it, caches the call to robots.txt and to the web page and enforces rate limiting.
This is a basic example which shows how to retrive the list of semi-soft cheeses from www.cheese.com. Here, we authenticate a session and then scrape the page with specified parameters. Behind the scenes `polite` retrieves `robots.txt`, checks the URL and user agent string against it, caches the call to `robots.txt` and to the web page and enforces rate limiting.

```{r example}
library(polite)
Expand All @@ -58,7 +58,7 @@ head(result)

## Extended Example

You can build your own functions that incorporate `bow`, `scrape` (and, if required, `nod`). Here we will extend our inquiry into cheeses and will download all cheese names and url's to their information pages. Lets retrieve number of pages per letter in the alphabetical list, keeping the number of results per page to 100 to minimize number of web requests.
You can build your own functions that incorporate `bow`, `scrape` (and, if required, `nod`). Here we will extend our inquiry into cheeses and will download all cheese names and URLs to their information pages. Let's retrieve the number of pages per letter in the alphabetical list, keeping the number of results per page to 100 to minimize number of web requests.

```{r}
library(polite)
Expand All @@ -79,7 +79,7 @@ pages_df <- tibble(letter = rep.int(letters, times=unlist(results)),
pages_df
```

Now that we know how many pages to retrieve from each letter page, lets rotate over letter pages and retrieve cheese names and underlying links to cheese details. We will need to write a helper function. Our session is still valid and we dont need to `nod` again, because we will not be modifying a page url, only its parameters (note that the field `url` is missing from `scrape` function).
Now that we know how many pages to retrieve from each letter page, let's rotate over letter pages and retrieve cheese names and underlying links to cheese details. We will need to write a helper function. Our session is still valid and we don't need to `nod` again, because we will not be modifying a page URL, only its parameters (note that the field `url` is missing from `scrape` function).

```{r}
get_cheese_page <- function(letter, pages){
Expand All @@ -94,6 +94,6 @@ df
```


Package logo is uses elements of free image by [pngtree.com](https://pngtree.com)
Package logo uses elements of a free image by [pngtree.com](https://pngtree.com)

[1] Wiktionary (2018), The free dictionary, retrieved from https://en.wiktionary.org/wiki/bow_and_scrape
16 changes: 8 additions & 8 deletions man/bow.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions man/html_attrs_dfr.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/nod.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions man/print.polite.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 7 additions & 7 deletions man/rip.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.