Source: https://www.oreilly.com/library/view/web-scraping-with/9781491985564/

# 1. We import the libraries

You can read more about Beautiful soup in here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 2. We put the link of the url that we want to webscrape

In [2]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')

# 3. We webscrape

In [3]:
soup = BeautifulSoup(html, 'html.parser')

In [4]:
print(soup)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



# 4. We parse the data

In [5]:
soup.title

<title>A Useful Page</title>

In [6]:
title = soup.find("title").text

In [7]:
title

'A Useful Page'

In [8]:
soup.title.parent.name

'head'

In [9]:
soup.body

<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>

In [10]:
soup.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

In [11]:
body = soup.find("div").text

In [12]:
body

'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

# A little bit more complicated: let's get hyperlinks

Let's get all the hyperlinks found at the Wikipedia website of Kevin Bacon (https://en.wikipedia.org/wiki/Kevin_Bacon)

In [13]:
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')

In [14]:
for link in bs.find_all('a'):
    print(link)

<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>
<a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>
<a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>
<a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>
<a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>
<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>
<a href="/wiki/Help:Contents" title="Guidance on how to use and edit Wikipedia"><span>Help</span></a>
<a href="/wiki/Help:Introduction" title="Learn how to edit Wikipedia"><span>Learn to edit</span></a>
<a href="/wiki/Wikipedia:Community_portal" title="The hub for editors"><

Now we need to go to the href ones

In [15]:
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
/w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
/w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Other_ventures
#Six_Degrees_of_Kevin_B

In there we have to types of hyperlinks: a) references to other wikipedia articles and **b) urls**. We only want to get **urls**. Let's do that!

In [16]:
import re

In [17]:
for link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(http)')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

http://baconbros.com
https://web.archive.org/web/20090113222205/http://www.newenglandancestors.org/research/services/articles_gbr78.asp
http://www.newenglandancestors.org/research/services/articles_gbr78.asp
http://www.biography.com/people/kevin-bacon-9542173
https://www.hollywoodreporter.com/tv/tv-news/showtime-cancels-city-on-a-hill-3-seasons-1235250089/
https://www.theguardian.com/film/filmblog/2009/feb/19/best-actors-never-nominated-for-oscars
http://www.walkoffame.com/kevin-bacon
https://www.marketingweek.com/ee-unveils-six-degrees-of-bacon-launch-ads/
https://web.archive.org/web/20190403203113/https://www.biography.com/news/kevin-bacon-biography-facts
http://www.biography.com/news/kevin-bacon-biography-facts
https://philadelphia.cbslocal.com/top-lists/stars-from-philly-to-hollywood/
https://movies.yahoo.com/person/kevin-bacon/biography.html
https://web.archive.org/web/20141016202657/http://www.thebiographychannel.co.uk/biographies/kevin-bacon.html
http://www.thebiographychannel.c

And... success! To get the data inside all those urls we would need to do something more complicated: write a **web crawler** (https://en.wikipedia.org/wiki/Web_crawler). But: this is an introductory course so let's stop in here! ;)

# Exercise: 

Now, in this same notebook, try doing the same thing but using a different Wikipedia page that you may like (i.e. Taylor Swift, stephan lichtsteiner...)