# Web-Scraping

![](https://dq-content.s3.amazonaws.com/6ne0anS.png)

In [1]:
import requests

In [2]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

## Retrieving Elements from a Page
![](https://dq-content.s3.amazonaws.com/C7qmC17.png)

In [3]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.

print(p.text)
title_text = parser.title.text
title_text


Here is some simple content for this page.


'A simple example page'

## Using Find_ALL
* Apply the find_all method to get the text inside the title tag, and assign the result to title_text.

In [5]:
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

title_text = parser.find_all("title")[0].text
title_text

Here is some simple content for this page.


'A simple example page'

## Element IDs
![](https://dq-content.s3.amazonaws.com/WBG4aCQ.png)

In [6]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)

second_paragraph = parser.find_all("p", id="second")[0]
second_paragraph_text = second_paragraph.text
print(second_paragraph_text)


                First paragraph.
            


                Second paragraph.
            



## Elements Classes

![](https://dq-content.s3.amazonaws.com/T2TguLL.png)

* Get the text in the second inner paragraph, and assign the result to second_inner_paragraph_text.
* Get the text of the first outer paragraph, and assign the result to first_outer_paragraph_text.

In [7]:
# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)

second_inner_paragraph_text = parser.find_all("p", class_="inner-text")[1].text
first_outer_paragraph_text = parser.find_all("p", class_="outer-text")[0].text


                First paragraph.
            


## Cascading Style Sheets (CSS)

This CSS will make all of the text inside all paragraphs red:
```
p{
    color: red
 }
 ```

This CSS will change the text color to red for any paragraphs that have the class inner-text. We select classes with the period or dot symbol (.):
```
p.inner-text{
    color: red
 }
```

This CSS will change the text color to red for any paragraphs that have the ID first. We select IDs with the pound or hash symbol (#):
```
p#first{
    color: red
 }
```

You can also style IDs and classes without using any specific tags. For example, this CSS will make the element with the ID first red (not just paragraphs):

```
#first{
    color: red
 }
```
This CSS will make any element with the class inner-text red:
```
.inner-text{
    color: red
 }
 ```

![](https://dq-content.s3.amazonaws.com/uOaCMeY.png)
* Select all of the elements that have the class outer-text.
  * Assign the text of the first paragraph that has the class outer-text to first_outer_text.
* Select all of the elements that have the ID second.
  * Assign the text of the first paragraph that has the ID second to the variable second_text.

In [9]:
# Get the website that contains classes and IDs.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Select all of the elements that have the first-item class.
first_items = parser.select(".first-item")

# Print the text of the first paragraph (the first element with the first-item class).
print(first_items[0].text)
first_outer_text = parser.select(".outer-text")[0].text
second_text = parser.select("#second")[0].text



                First paragraph.
            


## Nesting CSS Selectors

This selector will target any paragraph inside a div tag:

```div p```

This selector will target any item inside a div tag that has the class first-item:

```div .first-item```

This one is even more specific. It selects any item that's inside a div tag inside a body tag, but only if it also has the ID first:

```body div #first```

This selector zeroes in on any items with the ID first that are inside any items with the class first-item:

```.first-item #first```

As you can see, we can nest CSS selectors in infinite ways. This allows us to extract data from websites with complex layouts. You can test selectors by using the .select method as you write them. Because it's easy to write a selector that doesn't work the way you expect, we highly recommend doing this.

## Using Nested CSS Selectors
![](https://dq-content.s3.amazonaws.com/H34hK8I.png)
* Find the Total Plays for the New England Patriots, and assign the result to patriots_total_plays_count.
* Find the Total Yards for the Seahawks, and assign the result to seahawks_total_yards_count.

In [10]:
# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)

patriots_total_plays_count = parser.select("tr#total-plays td")[2].text
seahawks_total_yards_count = parser.select("tr#total-yards td")[1].text

1
