##### About BeautifulSoup
Often times you’ll find the perfect website that has all the data you need, but there’s no way to download it.<br>
This is where BeautifulSoup comes in handy to scrape the HTML. <br>
If we find the data we want to analyze online, we can use BeautifulSoup to grab it and turn it into a structure we can understand. <br>
This Python library, which takes its name from a song in Alice in Wonderland, allows us to easily and quickly take information from a website and put it into a DataFrame.<br>

##### 1. Rules of Scraping
    A. Always check a website’s Terms and Conditions before scraping. Read the statement on the legal use of data. Usually, the data you scrape should not be used for commercial purposes.
    B. Do not spam the website with a ton of requests. A large number of requests can break a website that is unprepared for that level of traffic. As a general rule of good practice, make one request to one webpage per second.
    C. If the layout of the website changes, you will have to change your scraping code to follow the new structure of the  site.

##### 2. Requests
In order to get the HTML of the website, we need to make a request to get the content of the webpage.

In [1]:
import requests

In [2]:
webpage_response = requests.get("https://content.codecademy.com/courses/beautifulsoup/shellter.html")
webpage_response

<Response [200]>

In [3]:
webpage = webpage_response.content

In [4]:
webpage

b'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <title>Turtle Shellter</title>\n      <link href="https://fonts.googleapis.com/css?family=Poppins" rel="stylesheet">\n      <link rel=\'stylesheet\' type=\'text/css\' href=\'style.css\'>\n  <script async src=\'/cdn-cgi/bm/cv/669835187/api.js\'></script></head>\n\n  <body>\n      <div class="banner">\n        <h1>The Shellter</h1>\n        <span class="brag">The #1 Turtle Adoption website!</span>\n      </div>\n\n      <div class="about">\n        <p class="text">Click to learn more about each turtle</p>\n      </div>\n\n    <div class="grid">\n      <div class="box adopt">\n          <a href="aesop.html" class="more-info"><img src="aesop.png" class="headshot"></a>\n          <p>Aesop</p>\n      </div>\n\n      <div class="box adopt">\n          <a href="caesar.html" class="more-info"><img src="caesar.png" class="headshot"></a>\n          <p>Caesar</p>\n      </div>\n\n      <div class="box adopt">\n          <a href="

##### 3. The BeautifulSoup Object
When we printed out all of that HTML from our request, it seemed pretty long and messy.<br>
BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in.<br>
Then, all we have to do is convert the HTML document to a BeautifulSoup object!<br>
"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(webpage, "html.parser")

In [7]:
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Turtle Shellter</title>
<link href="https://fonts.googleapis.com/css?family=Poppins" rel="stylesheet"/>
<link href="style.css" rel="stylesheet" type="text/css"/>
<script async="" src="/cdn-cgi/bm/cv/669835187/api.js"></script></head>
<body>
<div class="banner">
<h1>The Shellter</h1>
<span class="brag">The #1 Turtle Adoption website!</span>
</div>
<div class="about">
<p class="text">Click to learn more about each turtle</p>
</div>
<div class="grid">
<div class="box adopt">
<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>
<p>Aesop</p>
</div>
<div class="box adopt">
<a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>
<p>Caesar</p>
</div>
<div class="box adopt">
<a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>
<p>Sulla</p>
</div>
<div class="box adopt">
<a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/

##### 4. Object Types
BeautifulSoup breaks the HTML page into several types of objects.
1. Tags<br>
- A Tag corresponds to an HTML Tag in the original document.<br>
- You can get the name of the tag using `.name` and a dictionary representing the attributes of the tag using `.attrs`
- We can get the children of a tag by accessing the `.children` attribute
- We can also navigate up the tree of a tag by accessing the `.parents` attribute
2. NavigableStrings<br>
- NavigableStrings are the pieces of text that are in the HTML tags on the page.
- You can get the string inside of the tag by calling `.string`


In [8]:
# Lets print out the first p tag on the shellter.html page.
soup.p

<p class="text">Click to learn more about each turtle</p>

In [9]:
soup.p.string

'Click to learn more about each turtle'

In [10]:
soup.h1

<h1>The Shellter</h1>

In [11]:
soup.span

<span class="brag">The #1 Turtle Adoption website!</span>

In [12]:
soup.span.string

'The #1 Turtle Adoption website!'

In [13]:
# Lets loop through all of the children of the first div and print out each one.
soup.div

<div class="banner">
<h1>The Shellter</h1>
<span class="brag">The #1 Turtle Adoption website!</span>
</div>

In [14]:
for children in soup.div.children:
    print(children)



<h1>The Shellter</h1>


<span class="brag">The #1 Turtle Adoption website!</span>




##### 5.  Website Structure
- When we’re telling our Python script what HTML tags to grab, we need to know the structure of the website and what we’re looking for.
- Many browsers, including Chrome, Firefox, and Safari, have Dev Tools that help you inspect a webpage and see what HTML elements it is composed of.
- Then, when you’re preparing to scrape a website, first inspect the HTML to see where the info you are looking for is located on the page.

##### 6. Find
- Beautiful Soup offers two methods for traversing the HTML tags on a webpage, .find() and .find_all(). Both methods can take just a tag name as a parameter but will return slightly different information.
- .find() returns the first tag that matches the parameter or None if there are no tags that match.
- .find_all() returns a list of all the tags that match — if no tags match, it returns an empty list.
- .find() and .find_all() are far more flexible than just accessing elements directly through the soup object. With these methods, we can use regexes, attributes, or even functions to select HTML elements more intelligently.

In [15]:
soup.find("p")

<p class="text">Click to learn more about each turtle</p>

In [16]:
soup.find_all("p")

[<p class="text">Click to learn more about each turtle</p>,
 <p>Aesop</p>,
 <p>Caesar</p>,
 <p>Sulla</p>,
 <p>Spyro</p>,
 <p>Zelda</p>,
 <p>Bandicoot</p>,
 <p>Hal</p>,
 <p>Mock</p>,
 <p>Captain Sparrow</p>]

In [17]:
# Find all of the a elements on the page
soup.find_all("a")

[<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>]

##### 7. Using Regex
- We can use the .compile() function from the re module.

In [18]:
import re

In [19]:
soup.find_all(re.compile("h[1-9]"))
# The expression "h[1-9]" means h and any number between 1 and 9.

[<h1>The Shellter</h1>]

In [20]:
soup.find_all(re.compile("[ou]l"))
# we want every <ol> and every <ul> that the page contains. But there isn't any list.

[]

##### 8. Using Lists
- We can also just specify all of the elements we want to find by supplying the function with a list of the tag names we are looking for

In [21]:
soup.find_all(["a", "p"])

[<p class="text">Click to learn more about each turtle</p>,
 <a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <p>Aesop</p>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <p>Caesar</p>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <p>Sulla</p>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <p>Spyro</p>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <p>Zelda</p>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <p>Bandicoot</p>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <p>Hal</p>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <p>Mock</p>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>,
 <p>Captain Sparrow</p>]

##### 9. Using Attributes
- We can also try to match the elements with relevant attributes. 
- We can pass a dictionary to the attrs parameter of find_all with the desired attributes of the elements we’re looking for

In [22]:
soup.find_all(attrs={'class':'banner'})

[<div class="banner">
 <h1>The Shellter</h1>
 <span class="brag">The #1 Turtle Adoption website!</span>
 </div>]

In [23]:
soup.find_all(attrs={'class':'banner', 'id':'jumbotron'})

[]

In [24]:
soup.find_all(attrs={'class':'more-info'})

[<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>]

In [25]:
soup.find_all(attrs={'class':'more-info', 'href':'aesop.html'})

[<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>]

##### 10. Select for CSS Selectors
- Another way to capture your desired elements with the soup object is to use CSS selectors. 
- The .select() method will take in all of the CSS selectors you normally use in a .css file!

In [26]:
soup.select(".more-info")

[<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>]

In [27]:
soup.select(".headshot")
# for class, use .

[<img class="headshot" src="aesop.png"/>,
 <img class="headshot" src="caesar.png"/>,
 <img class="headshot" src="sulla.png"/>,
 <img class="headshot" src="spyro.png"/>,
 <img class="headshot" src="zelda.png"/>,
 <img class="headshot" src="bandicoot.png"/>,
 <img class="headshot" src="hal.png"/>,
 <img class="headshot" src="mock.png"/>,
 <img class="headshot" src="sparrow.png"/>]

In [28]:
soup.select("#id")
# for id, use #

[]

##### Let’s say we wanted to loop through all of the links to these tutle types that we found from our search.

In [29]:
prefix = "https://content.codecademy.com/courses/beautifulsoup/"

In [30]:
turtle_links = soup.find_all("a")

In [31]:
links = []
for a in turtle_links:
    links.append(prefix+a["href"])

In [32]:
links

['https://content.codecademy.com/courses/beautifulsoup/aesop.html',
 'https://content.codecademy.com/courses/beautifulsoup/caesar.html',
 'https://content.codecademy.com/courses/beautifulsoup/sulla.html',
 'https://content.codecademy.com/courses/beautifulsoup/spyro.html',
 'https://content.codecademy.com/courses/beautifulsoup/zelda.html',
 'https://content.codecademy.com/courses/beautifulsoup/bandicoot.html',
 'https://content.codecademy.com/courses/beautifulsoup/hal.html',
 'https://content.codecademy.com/courses/beautifulsoup/mock.html',
 'https://content.codecademy.com/courses/beautifulsoup/sparrow.html']

In [33]:
turtle_data = {}

In [34]:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  #Add your code here:
  turtle_name = turtle.select(".name")[0]
  turtle_data[turtle_name] = []

In [35]:
turtle_data

{<p class="name">Aesop</p>: [],
 <p class="name">Caesar</p>: [],
 <p class="name">Sulla</p>: [],
 <p class="name">Spyro</p>: [],
 <p class="name">Zelda</p>: [],
 <p class="name">Bandicoot</p>: [],
 <p class="name">Hal</p>: [],
 <p class="name">Mock</p>: [],
 <p class="name">Sparrow</p>: []}

##### 11. Reading Text
- When we use BeautifulSoup to select HTML elements, we often want to grab the text inside of the element, so that we can analyze it. We can use .get_text() to retrieve the text inside of whatever tag we want to call it on.
- If we want to separate out the texts from different tags, we can specify a separator character. This command would use a | character to separate:

In [50]:
soup.find("p")

<p class="text">Click to learn more about each turtle</p>

In [51]:
soup.find("p").get_text()

'Click to learn more about each turtle'

In [37]:
soup.get_text("|")

'\n|\n|\n|\n|Turtle Shellter|\n|\n|\n|\n|\n|\n|The Shellter|\n|The #1 Turtle Adoption website!|\n|\n|\n|Click to learn more about each turtle|\n|\n|\n|\n|\n|Aesop|\n|\n|\n|\n|Caesar|\n|\n|\n|\n|Sulla|\n|\n|\n|\n|Spyro|\n|\n|\n|\n|Zelda|\n|\n|\n|\n|Bandicoot|\n|\n|\n|\n|Hal|\n|\n|\n|\n|Mock|\n|\n|\n|\n|Captain Sparrow|\n|\n|\n|\n|\n'

In [44]:
soup.get_text("|").split("|")

['\n',
 '\n',
 '\n',
 '\n',
 'Turtle Shellter',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'The Shellter',
 '\n',
 'The #1 Turtle Adoption website!',
 '\n',
 '\n',
 '\n',
 'Click to learn more about each turtle',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'Aesop',
 '\n',
 '\n',
 '\n',
 '\n',
 'Caesar',
 '\n',
 '\n',
 '\n',
 '\n',
 'Sulla',
 '\n',
 '\n',
 '\n',
 '\n',
 'Spyro',
 '\n',
 '\n',
 '\n',
 '\n',
 'Zelda',
 '\n',
 '\n',
 '\n',
 '\n',
 'Bandicoot',
 '\n',
 '\n',
 '\n',
 '\n',
 'Hal',
 '\n',
 '\n',
 '\n',
 '\n',
 'Mock',
 '\n',
 '\n',
 '\n',
 '\n',
 'Captain Sparrow',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

In [41]:
turtle.select(".name")[0]

<p class="name">Sparrow</p>

In [42]:
turtle.select(".name")[0].get_text()

'Sparrow'

In [61]:
turtle_data

{<p class="name">Aesop</p>: [],
 <p class="name">Caesar</p>: [],
 <p class="name">Sulla</p>: [],
 <p class="name">Spyro</p>: [],
 <p class="name">Zelda</p>: [],
 <p class="name">Bandicoot</p>: [],
 <p class="name">Hal</p>: [],
 <p class="name">Mock</p>: [],
 <p class="name">Sparrow</p>: []}

##### 11. Create Dataframe
- You can use Pandas’ .from_dict() method to transform a dictionary into a Pandas DataFrame.

In [62]:
import pandas as pd

In [70]:
turtle_df = pd.DataFrame.from_dict(turtle_data, orient="index")

In [71]:
turtle_df

p,p.1,p.2,p.3,p.4,p.5,p.6,p.7,p.8
Aesop,Caesar,Sulla,Spyro,Zelda,Bandicoot,Hal,Mock,Sparrow


##### Extra Examples

In [72]:
soup = BeautifulSoup("""
<h1>Syllabus</h1>
<div><h3>Unit 1: Variables</h3><p>Learn the basics!</p></div>
<div><h3>Unit 2: Loops</h3> <p>Repeat stuff!</p></div>
<div><h3>Unit 3: Review</h3></div>
""")
 
for child in soup.div.children:
  print(type(child))
 

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


In [73]:
soup = BeautifulSoup("<div class='tweet'><span>New year, new me. </span></div><div id='user'><p>MirandaRights</p></div>")
 
print(soup.div.get_text())

New year, new me. 


In [74]:
soup = BeautifulSoup("<div class='tweet'><span>New year, new me. </span></div><div id='user'><p>MirandaRights</p></div>")
 
print(soup.find(id="user").get_text())
 

MirandaRights
