# Parsing a Page with BeautifulSoup

In [1]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
page.status_code
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [2]:
from bs4 import BeautifulSoup

In [3]:
soup = BeautifulSoup(page.content,'html.parser')

In [4]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [6]:
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [8]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

The following code shows how to navigate a page:

In [9]:
html = list(soup. children)[2]
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [10]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [11]:
body = list(html.children)[3]
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [12]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [13]:
p = list(body.children)[1]
p

<p>Here is some simple content for this page.</p>

In [14]:
p.get_text()

'Here is some simple content for this page.'

The following code shows how to find all instances of a tag at once:

In [15]:
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

The find_all function returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [16]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

To find only the first instance of a tag, find method is used:

In [17]:
soup.find('p')

<p>Here is some simple content for this page.</p>

# Searching for tags by class and id

In [18]:
pages = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
pages
soup = BeautifulSoup(pages.content,'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Searching for 'p' tag having class 'outer-text':

In [19]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

Seaching for any tag having class 'outer-text':

In [20]:
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

Searching for elements by id:

In [21]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

# Examples of CSS Selectors
1. p a — finds all a tags inside of a p tag.
2. body p a — finds all a tags inside of a p tag inside of a body tag.
3. html body — finds all body tags inside of an html tag.
4. p.outer-text — finds all p tags with a class of outer-text.
5. p#first — finds all p tags with an id of first.
6. body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

In BeautifulSoup, we can search for objects in a page via CSS Selectors using the "select" method. An example is as follows:

Here we are finding all 'p' tags in our page that are inside of a div tag.

In [22]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

Note: The "select" method also returns a list like "find" and "find_all" methods.