# Web scraping with BeautifulSoup



### Import BeautifulSoup

First off, you will need to import the BeautifulSoup library. BS is not part of the Python standard library (i.e. it needs to be installed separately).

In [1]:
# import beautiful soup library
from bs4 import BeautifulSoup

To work with BeautifulSoup, you first require some HTML. HTML can either be loaded from a locally stored file, or it can be \`requested' from a web server over HTTP.
To use the second approach, we will utilise another Python library called `requests`, which is able to make and handle HTTP requests and responses. 

In [2]:
# import requests library
import requests

We can use the get method in the requests library to retrieve an HTTP response object. An HTTP request contains header fields which may give the server some additional information about the request. One of the fields is called, `user-agent', and it tells the server what software is making the request on behalf of the user. It may be a good idea to set this header, to try to `fool' the server into believing the request is coming via a browser.

The response object has a property, text, which contains the HTML that was sent in the response.

In the following example, we can grab the Coursera homepage code. For these labs you will only be able to grab Coursera resources - if you want to explore other exercises you will have to run Jupyter on your own machine!

In [3]:
# set a user-agent to be sent with request
#headers = {
#    "user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
#}
# request a resource from a specific URL. Change this for your chosen website.
r  = requests.get("https://www.coursera.org/")#,headers)

# put the text that is returned in the response in a variable
data = r.text

# look...some HTML has been sent in the response!
data



The raw HTML is not very easy to work with, because it is in a semantic markup format. We also have lots of other bits of things in here like CSS! 

We need to \`parse' the HTML (i.e. split it into its component parts), which will make working with it much easier. For that we will create an object which is an instance of the BeautifulSoup class. The object will be a special kind of data structure. It will contain the HTML, but in a format we can work with.

In [4]:
from bs4 import BeautifulSoup
# parse the raw HTML into a `soup' object
soup = BeautifulSoup(data, "html.parser")
soup

<!DOCTYPE html>
<html dir="ltr" itemtype="http://schema.org" lang="en" xmlns:fb="http://ogp.me/ns/fb#"><head><link crossorigin="" href="https://d3njjcbhbojbot.cloudfront.net" rel="preconnect"/><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta charset="utf-8"/><meta content="Coursera" property="og:site_name"/><meta content="727836538,4807654" property="fb:admins"/><meta content="823425307723964" property="fb:app_id"/><meta content="Coursera" name="twitter:site"/><meta content="Coursera" name="twitter:app:name:iphone"/><meta content="Coursera" name="twitter:app:name:ipad"/><meta content="Coursera" name="twitter:app:name:googleplay"/><meta content="id736535961" name="twitter:app:id:iphone"/><meta content="id736535961" name="twitter:app:id:ipad"/><meta content="org.coursera.android" name="twitter:app:id:googleplay"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="https://d3njjcbhbojbot.cloudfront.net/web/images/favicons/apple-touch-icon

Now that we have parsed the HTML, we can call methods of the BeautifulSoup class to access specific elements in the data.

### Extract a single element by tag name
For example, the `find` method will return the first available element with a specified tag name:

In [5]:
h1 = soup.find("h1")
print(h1)
# If you can't find the h1 element you might want to right click the page and view source. See if you can identify 
# an element you can scrape e.g. <p> tags <img> tags or links <a href="">

<h1 class="cds-119 css-i3qo6r cds-121">Learn without limits</h1>


### Extract all of a certain element by tag name
The `find_all` method will return all the elements of a certain type:

In [6]:
# get all the th elements 
# e.g. for a table we could say
# table = soup.find_all("table")
text = soup.find_all("p")

# Not all web pages will have tables, p tags or classes. Markup is dependent on what the person building the 
# webpage decided should appear.
print(text)

[<p class="cds-119 css-16ln3yv cds-121">Start, switch, or advance your career with more than 5,400 courses, Professional Certificates, and degrees from world-class universities and companies.</p>, <p class="cds-119 css-kxjk3f cds-121">Oversee the planning and execution of projects to ensure theyâre successful</p>, <p class="cds-119 css-1bm1tdc cds-121"><span>Job openings: <strong><span>396,314</span><sup>**</sup></strong></span></p>, <p class="cds-119 css-1bm1tdc cds-121"><span>Projected 10 year growth: <strong><span>+<span>10.2</span></span>%<sup>***</sup></strong></span></p>, <p class="cds-119 css-1h8vaqd cds-121">* Employment, Wages, and Projected Change in Employment by Typical Entry-level Educationâ¯: U.S. Bureau of Labor Statistics. Sept. 2022, www.bls.gov/emp/tables/education-summary.htm 
** Median salary data (median with 0-2 years experience) and job opening data are sourced from United States Lightcastâ¢ Job Postings Report. Data for job roles relevant to featured program

### Filter elements by attribute

HTML elements can have attributes. These are key-value pairs defined inside the opening tag. For example, a hyperlink (anchor) tag has an href attribute specifying the URL to link to:

        <a href="http://www.somewhereoutthere.com">This is not a real URL!</a>
        
We can be more specific about which elements to retrieve with find all, by including an attribute value:

In [7]:
# extract all the th elements containing the scope attribute, with the value, `row'
# rows = table[0].find_all("tr")
# Let's get the element in our text variable at index 10
elements = text[10]
elements

<p data-e2e="degree-card-name"><a class="card-title-link" href="https://www.coursera.org/degrees/mcit-penn"><div style="overflow:hidden"><div>Master of Computer and Information Technology</div></div></a></p>

### Filter elements by contents
We may also decide which elements to extract based on their text contents. For example,

In [8]:
# extract all th elements containing the string, `Salt'
energy = table[0].find_all("td",string="Energy ")
energy
# Can you do this for text[10] or one of the other elements in our text variable?



NameError: name 'table' is not defined

### Extract the next sibling element
We might want to get at the element next to another element. 

For example, let's suppose I want the value contained in the `td` element proceding the previous `th`...

In [None]:
# get the text from the next element after something
# as a table you might do energy[0].findNext("td").text
# or we could do this
elements.findNext("").text

# Have a look at the rest of the page
Does this look easy to scrape? Are there any elements that look confusing or strange?

Select an element on the page and write a line of code that is able to scrape that element. Post your code on the discussion forums and explain how it works. Comment on a post from one of your colleagues and see if you can replicate their web scraping exercise.

You will notice that the page has lots of Javascript generating what we describe as 'dynamic events.' Post your thoughts on the discussion forums regarding:
<ul><li>What do you think the Javascript is doing?</li>
    <li>Why do you think it is used/useful?</li>
    <li>What challenges might you face in web scraping dynamic content?</li>
    </ul>
Reply to a post from one of your colleagues and see if you agree with their findings.

## Further reading
For a more detailed introduction to web scraping, you may find this [Webscraping article](https://blog.hartleybrody.com/web-scraping/) by Hartley Brody interesting.

The BeautifulSoup documentation can be found here: [BS Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)