# Scraping with Beautiful Soup

### Introduction

In this lesson, we'll see how we can use the beautiful soup library to scrape a website.  With beautiful soup, we cannot interact directly with a website, but we can use it to easily parse our HTML.

### Beautiful Soup Basics

With beautiful soup, we first use the Python `requests` library to get some HTML.  And then we can pass this HTML into beautiful soup for to select specific html.  Let's see a quick example, and then we'll use this in a lab.

Let's practice by trying to get the title of the Wikipedia page for the The Beatles.

<img src="./beatles.png" width="60%">

Here's how we do so.  First, we make a request to the appropriate url, then we get back the HTML, and finally we pass that data into beautiful soup so that we can search through it.

In [1]:
import requests

In [16]:
response = requests.get('https://en.wikipedia.org/wiki/The_Beatles')

html = response.text

Now if we look at what we captured in the `html` variable, it is just one long string.

In [5]:
html[:50]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en'

In [6]:
type(html)

str

Obviously, this would be no fun to search through.  However, we can pass this string into beautiful soup, and then use our CSS selectors to search through our HTML.

In [7]:
!pip install bs4 --quiet

Then we can import the `BeautifulSoup` function as bs, and specify the `html.parser`.

In [21]:
from bs4 import BeautifulSoup as bs
soup = bs(response.text, 'html.parser')

Now our soup variable is a beautiful soup object that can be searched.

So, for example, if we would like to find the html that has a class of `firstHeading`, we can do so with the following.

In [25]:
soup.findAll('h1', {'class': 'firstHeading'})

[<h1 class="firstHeading" id="firstHeading">The Beatles</h1>]

So with the `findAll` method, we pass through two arguments.  The first is `name` of the tag.  Above this is an `h1`.  And the second is any attributes.  

HTML attributes are any items with an equals sign within our tag.  
> Notice that they come in key value pairs just like our Python dictionary.

So if we would like to select the HTML below.

<img src="./td-span.png" width="60%">

We can try to do so through the `colspan` attribute.  Doing so would look like the following:

In [30]:
cols = soup.findAll('td', {'colspan': '2'})

In [31]:
cols[0]

<td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:The_Fabs.JPG" title="A square quartered into four head shots of young men with moptop haircuts. All four wear white shirts and dark coats."><img alt="A square quartered into four head shots of young men with moptop haircuts. All four wear white shirts and dark coats." data-file-height="1110" data-file-width="1110" decoding="async" height="220" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/The_Fabs.JPG/220px-The_Fabs.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/The_Fabs.JPG/330px-The_Fabs.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/df/The_Fabs.JPG/440px-The_Fabs.JPG 2x" width="220"/></a><div class="infobox-caption">The Beatles in 1964; clockwise from top left: <a href="/wiki/John_Lennon" title="John Lennon">John Lennon</a>, <a href="/wiki/Paul_McCartney" title="Paul McCartney">Paul McCartney</a>, <a href="/wiki/Ringo_Starr" title="Ringo Starr">Ringo Starr</a> and <a href

So as you can see, we can really use any key value pairs that we see in our HTML.

And once we have selected an outer selection of HTML, we can continue on by searching that HTML for more detailed information.  For example, to find the information particular to John Lennon.

In [36]:
john_lennon_hrefs = cols[0].findAll('a', {'title': 'John Lennon'})
john_lennon_hrefs

[<a href="/wiki/John_Lennon" title="John Lennon">John Lennon</a>]

And once we find the correct element, we can then access various attributes of that html directly.

In [37]:
john_lennon_hrefs[0].text

'John Lennon'

In [43]:
john_lennon_hrefs[0]['href']

'/wiki/John_Lennon'

In [44]:
john_lennon_hrefs[0]['title']

'John Lennon'

### Summary

In this lesson, we saw how we can use the beautiful soup libary.  We saw that we can do so by first making a request to the relevant url, and then extracting the text.

In [45]:
response = requests.get('https://en.wikipedia.org/wiki/The_Beatles')

html = response.text

This returns a string, which we can then pass into beautiful soup to more easily parse.

In [47]:
from bs4 import BeautifulSoup as bs
soup = bs(html, 'html.parser')

And from there, we can use the findAll method, which takes both the tag name and key value pairs of any attributes in the HTML.

In [49]:
cols = soup.findAll('td', {'colspan': '2'})

And we extract find specific attribute values in the selected HTML by specifying the relevant key.

In [51]:
cols[0]['class']

['infobox-image']

Or if we prefer, we can also retrieve the text.

In [52]:
cols[0].text

'The Beatles in 1964; clockwise from top left: John Lennon, Paul McCartney, Ringo Starr and George Harrison'