# Webscraping

**Webscraping**: Programmatically extract data from the HTML code of websites.

- Must read the rules on the websites
- Attempting to read too many pages quickly may get your IP address blocked.

### Reading data from Google Scholar

In [1]:
con <- url("http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
htmlCode <- readLines(con)
close(con)

In readLines(con): incomplete final line found on 'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'

In [2]:
htmlCode

In [3]:
# Use the XML library
library(XML)
url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, useInternalNodes = TRUE)

In [4]:
xpathSApply(html, "//title", xmlValue)

In [5]:
# Get all classes in <td class=...>
xpathSApply(html, "//td", xmlGetAttr, "class")

In [7]:
# Nnumber of citations
xpathSApply(html, "//td[@class='gsc_a_c']", xmlValue)

In [36]:
# Read all titles and store in a list
titles <- lapply(xpathSApply(html, "//td[@class='gsc_a_t']", xmlValue), FUN = function(x) x)
titles[1:2] # Print the first two

In [41]:
# Every first div element with gs_gray is the authors, the second is the publication
xpathApply(html, "//div[@class='gs_gray']", xmlValue)[c(2, 4, 6)] # This is the first three publication magazines

### GET  from the httr package

In [48]:
library(httr)
url # Look at the page name

In [49]:
html2 <- GET(url)
content2 <- content(html2, as="text")
parsedHtml <- htmlParse(content2, asText = TRUE)

In [51]:
class(parsedHtml)

In [54]:
parsedHtml # A prettier HTML to look at

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="referrer" content="always">
<meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<style>@viewport{width:device-width;min-zoom:1;max-zoom:2;}</style>
<meta name="format-detection" content="telephone=no">
<style>html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_top{position:relative;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:300px;}#gs_top.gs_nscl{position:fixed;width:100%;}body,td,input{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{background:#fff;color:#222;-webkit-text-size-adjust:100%;-moz-text-size-adjust:none;}.gs_g

In [55]:
xpathSApply(parsedHtml, "//title", xmlValue)

### Accesing websites with passwords

This is just an example
```r

pg2 <- GET("http://httpbin.org/basic-auth/user/passwd",
           authenticate("user", "passwd"))
```

### Using handles

```r
google <- handle("http://google.com")
pg1 <- GET(handle=google, path="/")
pg2 <- GET(handle=google, path="search")
```

# Application programming interfaces (APIs)

In general, look at the documentation.

- httr allows GET, POST, PUT, DELETE requests if you are authorized-
- You can authenticate with a user name or a password.
- Most modern APIs use something like **oauth**
- httr works well with Facebook, Google, Twitter, Github, etc

In [None]:
# Will try to get Twitter data using Python :)