<a href="https://colab.research.google.com/github/agroimpacts/USF/blob/main/Web_scraping_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web-scraping in R**

The rvest library will be used for web-scraping in R. It is one of the tidyverse libraries, and is similar to the the BeautifulSoup library for web-scraping in Python. We will first install and import rvest in the below chunk.


In [None]:
install.packages("rvest")
library(rvest)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



#**Webpage Structure**

Webpages are generally made up of three types of languages: Hypertext Markup Language (HTML), which makes up a webpage's structure and content, Cascasing Style Sheets (CSS), which makes up the style and look of a webpage, and Javascript, which provides different functions and interactive to elements of a webpage. Web-scraping focuses on reading the HTML code of a webpage.

HTML is structured into elements, which consistant of a start and end tag, some optional elements, and contents. There are over 100 types of HTML elements, which you can read more about on [MDN Web Docs reference sheet for HTML elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element). 

You can also view the html document of a web-page by right clicking on the web-page and selecting "Inspect", or "View page source" to open the document in a new window.

To start the scraping process, input the url of the webpage to the read_html() function. In this example, we will use the [NYPD Neighborhood Policing site](https://www1.nyc.gov/site/nypd/bureaus/patrol/neighborhood-coordination-officers.page). The following code will return an xml_document object that we will use in later steps.

In [74]:
url <- "https://www1.nyc.gov/site/nypd/bureaus/patrol/neighborhood-coordination-officers.page"
html <- read_html(url)
class(html)

To identify the elements of our xml document that contain our target information, rvest provides two options: CSS selectors and XPath expressions.This tutorial will cover CSS Selectors since they are simpler to work with. CSS Selectors help locate HTML elements by defining patterns for scraping elements. The some common selectors are:
  * *p*: selects all <p> elements, which typically include paragraphs of text
  **.title*: selects all elements with the class "title"
  **p.special*: selects all <p> elements with the class "special"

Let's look the *title* and *p* CSS selector in our wiki page.

In [75]:
#Webpage Title
html %>% html_elements("title")

#Paragraphs
html %>% html_elements("p")

{xml_nodeset (1)}
[1] <title>Neighborhood Policing - NYPD</title>\n

{xml_nodeset (9)}
[1] <p>The cornerstone of today's NYPD is Neighborhood Policing, a comprehens ...
[2] <p>The NYPD has long encouraged officers to strengthen bonds with the com ...
[3] <p>Neighborhood Policing divides precincts into four or five fully-staffe ...
[4] <p>Neighborhood Policing is sufficiently staffed to permit off-radio time ...
[5] <p>Supporting the sector officers and filling out each sector's team are  ...
[6] <p>NCOs are adding a new dimension to the NYPD's crime-fighting capabilit ...
[7] <p>City of New York. 2022 All Rights Reserved,</p>
[8] <p>NYC is a trademark and service mark of the City of New York</p>
[9] <p style="display: block; width: 100%;"><a title="Privacy Ploicy " href=" ...

The {xml_nodeset()} line shows how many of the target elements there are-- in the above example, we have one title element and 9 paragraph elements in this page. It also shows a preview of the text in each paragraph element.

#**Extracting Data**

##**Text**

Once you have identified which elements are of importance, you can retrieve the data from them in the form of text contents or an attribute. Let's retrieve the contents under the <p> elements of this page. We'll first set the element we want (p) and then retrieve the text using an html_text fucntion. There are two such functions-- *html_text()* and *html_text2()*. *html_text()* will return raw underlying text while *html_text2()* will simulate how the text looks in the webpage. We typically want to use *html_text2()*, but it can be slower than *html_text()* in some cases.

In [76]:
text <-html %>% 
  html_elements("p") %>%
  html_text2()
text

##**Attributes**

Attributes record things like hyperlinks, image sources, alternative text,etc. The two most common attributes, are the href attribute of a elements, which indicates a url, and the src attribute of an img element, which indicates the url for an image.

In [77]:
#Extract Links
links <- html %>% 
  html_elements("a") %>% 
  html_attr("href")
links

In [79]:
image <- html %>%
  html_elements("img") %>%
  html_attr("src")
image

##**Tables**

Another common way information is presented in html is the table format, which is composed of *table*, *tr* (rows), *th* (heading), and *td* (data) elements. We can retrieve data from tables using the *html_table()* function, which will convert the element into a dataframe in R. We will use this [New York Police  Disciplinary Records dataset](https://data.democratandchronicle.com/new-york-police-disciplinary-records/) for this example. 

In [81]:
url <- "https://data.democratandchronicle.com/new-york-police-disciplinary-records/"
html2 <- read_html(url)

table <- html2 %>%
  html_node("table") %>%
  html_table()
table

Officer(s),Incident,Incident Date,Final Outcome (Date Resolved),Employer,City,County
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
"Bradstreet, Adam",Other - Reckless driving,"Aug. 17, 2020","Written reprimand (Oct. 23, 2020)",Rochester City PD,Rochester,Monroe County
"Schneeberger, Hannah",Other - Reckless driving,"April 27, 2020","The officer received a one-day suspension without pay. (Aug. 6, 2020)",Rochester City PD,Rochester,Monroe County
"Tantalo, Mary",Other - Reckless driving,"March 15, 2020","The officer received a one-day suspension without pay. (July 7, 2020)",Rochester City PD,Rochester,Monroe County
"Iacuzzi, Christopher",Other - Reckless driving,"Feb. 6, 2020","Written reprimand (June 29, 2020)",Rochester City PD,Rochester,Monroe County
"Coughlin, James",Discourtesy - Insubordination,"Jan. 13, 2020","Other (Feb. 6, 2020)",Gates Town PD,Gates,Monroe County
"Callin, Ryan",Other - Failure to perform duty,"Dec. 25, 2019","Memorandum of record (Jan. 7, 2020)",Fairport Village PD,Fairport,Monroe County
"Woicyk, John",Discourtesy - Unprofessional,"Dec. 21, 2019","The officer received a five-day suspension without pay. (Feb. 6, 2020)",Rochester City PD,Rochester,Monroe County
"Mortillaro, Michael",Other - Reckless driving,"Dec. 7, 2019","The officer was suspended without pay for one day. (April 22, 2020)",Rochester City PD,Rochester,Monroe County
"Mortillaro, Michael",Force - Behavior unbecoming a police officer,"Dec. 3, 2019","The officer was suspended without pay for 15 days. (May 28, 2020)",Rochester City PD,Rochester,Monroe County
"Blodgett, Evan",Other - Failure to perform duty,"Nov. 9, 2019","Indefinite suspension (Nov. 14, 2019)",Brockport Village PD,Brockport,Monroe County


All web-pages are structured differently, so it isn't always apparent which elements contain what information across sites. As you scrape more sites, you will become familar with the various HTML elements, but we also recommend installing the [SelectorGadget](https://selectorgadget.com/) extension in your web browser, which can help quickly identify the elements in a web-page.

##**Web-scraping multiple web-pages**

To scrape multiple web-pages at once, we will install the *purrr* package, which is also a dependent package of the more commonly used devtools and tidyverse. 

In [21]:
install.packages("purrr")
library(purrr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



We will first create a list of the web-pages we want to scrape using the *c()* function. Let's say we want to pull the headlines news from a few sites. We first create the list of sites.

By briefly inspecting the site, we see that the headlines are under the h3 elements across the sites. Using the *map()* function, we can apply the same *html_elements()* and *html_text()* functions that we used before and they will be applied to the *h3* elements across both sites.

In [None]:
url <- c("https://www.bbc.com/", "https://www.npr.org/")
page <-map(url, ~read_html(.x) %>% html_elements("h3") %>% html_text2())
page


If we want to display our headlines a bit neater, we can separate each headline onto its own line.

In [None]:
print(page, sep="\n")