In [1]:
from IPython.display import Image

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We can see what this page looks [like](http://dataquestio.github.io/web-scraping-pages/simple.html)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We won't cover tags comprehensively here, but [the Mozilla Developer Network's (MDN) article on HTML basics](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) is a good resource for learning more HTML. (Check out [MDN's guide to the HTML element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of all possible HTML tags.) To scrape web pages effectively, we need to understand the various tags and how they work.

We can use a GET request with the `GET()` and `content()` functions of `httr` that we used in previous files to see the structure of webpages.

**Task**

![image.png](attachment:image.png)

**Answer**

`library(httr)
response  <-  GET("http://dataquestio.github.io/web-scraping-pages/simple.html")
content  <- content(response)
print(content)`

Downloading the page is the easy part. Above, we combined the `GET()` and `content()` functions from the `httr` package.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's use the same technique to get the text inside the title tag.

**Task**

1. Get the text inside the `title` tag. 
    * Assign the result to `title_text`.
    
**Answer**

`library(rvest)
new_content <- read_html("http://dataquestio.github.io/web-scraping-pages/simple.html")`

`# Type your answer below`

`library(dplyr)
title_text <- new_content %>% 
    html_nodes("title") %>%
    html_text()`

Let's consider a [new example](http://dataquestio.github.io/web-scraping-pages/simple_classes.html). The `b` tag creates bold text, and the `div` tag creates a divider that splits the page into units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, and horizontal menu.

![image.png](attachment:image.png)

Assuming that we got the content of this example and stored it in the variable `content_2`, then we can write the following code snippet to extract all the `p` tag elements.

![image.png](attachment:image.png)

Note that this is the same code we used above.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`content_2 <- read_html("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")`


`b_text <- content_2 %>% 
    html_nodes("b") %>%
    html_text()`


`first_outer_paragraph <- b_text[1]`

Let's consider a [new example](http://dataquestio.github.io/web-scraping-pages/simple_ids.html).

![image.png](attachment:image.png)

Notice the new elements `id="first"` and `id="second"` in the opening `p` tags. These represent IDs.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`content_3 <- read_html("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")`

`first_paragraph_text <- content_3 %>% 
    html_nodes("#first") %>%
    html_text()`

`second_paragraph_text <- content_3 %>% 
    html_nodes("#second") %>%
    html_text()`

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Look at [this page](http://dataquestio.github.io/web-scraping-pages/simple_classes.html) to see how we've used classes to style paragraphs.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`content_4 <- read_html("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")`


`outer_paragraph_text <- content_4 %>% 
    html_nodes(".outer-text") %>%
    html_text()`

Let's consider this [new example](http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html). The `table` tag is a structured set of data made up of rows and columns (tabular data). It allows displaying a dataset.



![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Our example is an excerpt from the [2014 Super Bowl](https://en.wikipedia.org/wiki/Super_Bowl_XLIX) box score, a [National Football League (NFL)](https://en.wikipedia.org/wiki/National_Football_League) game in which the New England Patriots played the Seattle Seahawks. The box score contains information on how many yards each team gained, how many turnovers each team had, and other statistics that pertain to the sport.

Check out the [web page](http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html) this HTML renders. The page renders as a table with column and row names. The first column is for the Seattle Seahawks (**SEA**), and the second column is for the New England Patriots (**NWE**). Each row represents a different statistic.

Suppose we want to get this tabular data. Assume that we got the content of this example and stored it in the variable `content_5`. We can use the same technique from the above to get this content.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`content_5 <- read_html("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")`


`library(dplyr)
super_bowl_df <- content_5 %>% 
    html_node("table") %>%
    html_table()`

`super_bowl_df`

Let's consider another [new example](http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

`content_6 <- read_html("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")`

`library(dplyr)
p_class_values <- content_6 %>% 
    html_nodes("p") %>%
    html_attr("class")`

`p_class_values`

We began scraping simple web pages, and we developed new skills for extracting web data.

We've covered the basics of HTML and how to select elements, which are key foundational blocks.

Web scraping is most useful when we need to gather a lot of information from many web pages quickly.

For example, if we wanted to find the job positions published on a website and filter only the interesting ones, we can use web scraping. We could do this manually, but it would take days. We could write a script to automate this in a couple of hours instead and have a lot more fun doing it.