This notebook is a very basic web scraping primer 
(Credits to DataQuest: https://www.dataquest.io/blog/web-scraping-tutorial-python/)

In [1]:
import requests

In [4]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [7]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [9]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [10]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

All of the items are BeautifulSoup objects. 

The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags.

The Tag object allows us to navigate through an HTML document, and extract other tags and text.

Select the html tag and its children by taking the third item in the list:

In [11]:
html = list(soup.children)[2]

We can find the children inside the html tag:

In [12]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

We want to extract the text inside the p tag, so we'll dive into the body:

In [13]:
body = list(html.children)[3]

In [14]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag:

In [15]:
p = list(body.children)[1]

In [16]:
p.get_text()

'Here is some simple content for this page.'

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [17]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [18]:
soup.find('p')

<p>Here is some simple content for this page.</p>

**Searching for tags by class and id:**

In [19]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [20]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:

In [21]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

**Using CSS Selectors**

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style.

**p a** — finds all a tags inside of a p tag.

**body p a** — finds all a tags inside of a p tag inside of a body tag.

**html body** — finds all body tags inside of an html tag.

**p.outer-text** — finds all p tags with a class of outer-text.

**p#first** — finds all p tags with an id of first.

**body p.outer-text** — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

In [22]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]