#  Beautiful Soup, so rich and green, waiting in a hot tureen!

(*The Lobster Quadrille*, Alice in Wonderland)

We are now ready to start scraping web pages. In order to do so we are going to use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful python package to parse web pages you already scraped. Normally you would use `requests` (to GET the page) and then `BeautifulSoup` to analyse the web page.


In [None]:
import bs4

We start by opening up the page and convert it to a `soup` object. Then, we're going to use the `find` method to find the page's `<title>` tag and print it.

In [None]:
with open("../../data/classdata/company_search/Amazon_query.html", "r", encoding="utf-8") as infile:
        soup = bs4.BeautifulSoup(infile.read())

#The soup is the entire page
soup

In [None]:
#There are a number of different functions of a soup
dir(soup)

In [None]:
#We're going to start with the `find` function. It will find the first tag of the given type.
title = soup.find('title')
print(title)

Note that the title is the entire html tag. If we want only the text within it, then we need to ask for the text.

In [None]:
title.text

The reason for this is that Beautiful Soup converts HTML tags into its own `Tag` objects.`Tag` objects have many useful attributes.

In [None]:
print(type(title))
print(title.text) # The text gives you the visible part of the tag
print(title.name) # The type of tag

If a tag has any html attributes, they can be accessed in a very "pythonic" way. That is, they are organized as a dictionary!

In [None]:
table = soup.find("table")

print(table.attrs)
print(table["class"])
print(table["summary"])

Instead of searching for `Tags` one by one, we can also retrieve them all at once.  As an example, let's find all level 2 headers. To this end, we use the `find_all` method.

In [None]:
divs = soup.find_all('div')

print(divs)

Too much information!  In order to get the only the information that we need, we must restrict to the desired attribute.

In [None]:
for div in divs:
    print(div.text[:100])
    print('------------')

Another `Tag` that that is useful and that demonstrate some of the other useful attributes is the one for webpages that our page points to:

In [None]:
links = soup.find_all('a')

for link in links[:10]:  # Showing just the first 10 links for brevity
    # href represents the target of the link
    # Where the link actually goes to!
    print('-----', link.text)
    print(link.get('href'))
    

### Searching using attribute information

Some `Tag` elements have attributes associated with them. These includes `id`, `class_`, `href`.  Our search can restrict results to attributes with a specific value or to results where the attribute type is included.

Note that we must use `class_` instead of `class` to avoid conflicts with Python's built-in keyword. 

In [None]:
# Retrieve the element with the attribute "id" equal to "breadCrumbs"
tag = soup.find(id="breadCrumbs")
print(tag)
print(tag.text)

In [None]:
# Retrieve all elements with an href attribute
all_links = soup.find_all(href=True)
print(len(all_links))

In [None]:
# Retrieve company match class 
soup.find_all(class_="companyMatch")

In [None]:
# Retrieve all tags with class=blueRow and and no id attribute
soup.find_all(attrs={"class": "blueRow", "id": False})

# Navigation

### Navigating the HTML tree 


Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's navigate to the first row result in the table



In [None]:
table = soup.find(class_="tableFile2")
print(table)
print(table.text)
table.contents

The `contents` attribute lets us access everything that is inside a given tag. In this case we find only the visible text of the tag.

Looking at the webpage snippet, we see that the tag `<tr>` is at the same level as all other `<tr>` tags, and they are all children of the table object.

In [None]:
tr_row = table.find('tr')

In [None]:
dead_row = tr_row.next_sibling
dead_row

In [None]:
data_row = dead_row.next_sibling
print(data_row.name)

In [None]:
data_row

So we've had to traverse two siblings to get to the data row.  The reason is that some of the siblings in the soup are not actual HTML elements. Some could simply be empty lines.

In [None]:
three_steps = data_row.next_sibling
print(three_steps.name)

In [None]:
four_steps = three_steps.next_sibling
print(four_steps.name)

In [None]:
print(four_steps.contents)

Besides the `find_next_sibling` method, there are also `find_previous_sibling`, `find_next_children`, `find_previous_children`, and many others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. There is no need to memorize all of them. It's more important to realize that, as with any programming language, there is more than one way to get any element of the html tree. The trick is to *pick a good starting point* from where to start the scraping.

# Scraping images

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [None]:
# Some modules that will allows us to display images and other media in the notebook itself
from IPython.display import display, Image

In [None]:
with open("../../data/erik_durm_wiki.html", "r", encoding="utf-8") as infile:
        soup = bs4.BeautifulSoup(infile.read())

for image in soup.find_all('img'):
    print(image)

We can pinpoint a specific image and get its attributes

In [None]:
images = soup.find_all('img')
img0 = images[0]
print(img0.attrs)

Then we can display the image using its `src` attribute

In [None]:
display(Image(url='../../data/' + img0['src']))

display(Image(url='../../data/' + images[1]['src']))

