## Import libraries

In [7]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

## Simple test using a scraping playground

In [2]:
html = urlopen('http://toscrape.com/')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

http://books.toscrape.com
http://books.toscrape.com
http://books.toscrape.com
http://quotes.toscrape.com/
http://quotes.toscrape.com
http://quotes.toscrape.com/
http://quotes.toscrape.com/scroll
http://quotes.toscrape.com/js
http://quotes.toscrape.com/tableful
http://quotes.toscrape.com/login
http://quotes.toscrape.com/search.aspx
http://quotes.toscrape.com/random


## Recursively crawling an entire site

In [4]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://toscrape.com/{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Protection_policy#extended
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Reliable_sources/Perennial_sources
/wiki/Wikipedia:Reliable_sources
/wiki/Wikipedia:RS_(disambiguation)
/wiki/Wikipedia:WikiProject_Radio_Stations
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:WPPP
/wiki/Wikipedia:WikiProject
/wiki/Wikipedia:Wikimedia_sister_projects
/wiki/Help:Interwikimedia_links
/wiki/Help:Interlanguage_links
/wiki/List_of_ISO_639-1_codes
/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Protection_policy#full
/wiki/Wikipedia:Party_and_person
/wiki/File:Essay.svg
/wiki/File:Essay.png
/wiki/

/wiki/Wikipedia_talk:WikiProject_Articles_for_creation/Participants/Old_Requests
/wiki/User:Nafsadh
/wiki/File:Wikimedia_Foundation_RGB_logo_with_text.svg
/wiki/User:Neolux
/wiki/Wikipedia:User_pages
/wiki/Wikipedia:FUW
/wiki/Wikipedia:Copyrights
/wiki/Wikipedia:Consensus
/wiki/Help:Edit_conflict
/wiki/Wikipedia:Edit_warring
/wiki/Wikipedia:Administrators#Wheel_war
/wiki/Wikipedia:Requests_for_administrator_attention
/wiki/Wikipedia:Request_an_account
/wiki/Wikipedia:Arbitration_Committee/Clerks
/wiki/Wikipedia:Arbitration_Committee
/wiki/Wikipedia:Arbitration/Requests
/wiki/Wikipedia:Arbitration
/wiki/Wikipedia:Dispute_resolution
/wiki/Wikipedia:DR_(disambiguation)
/wiki/Wikipedia:DR
/wiki/Wikipedia:Abuse_Reports
/wiki/Wikipedia:Edit_filter
/wiki/Wikipedia:Article_Feedback_Tool
/wiki/Wikipedia:HISPAGES
/wiki/Wikipedia:WikipediaSpace
/wiki/File:WikipediaSpace-WS1-Introduction-v1.pdf
/wiki/Portal:Featured_content
/wiki/File:Darkgreen_flag_waving.svg
/wiki/Special:WhatLinksHere/File:Dark

HTTPError: HTTP Error 404: Not Found

## Collecting data across an entire site

In [5]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')
    
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p><b><a href="/wiki/Stan_Coveleski" title="Stan Coveleski">Stan Coveleski</a></b> (July 13, 1889 – March 20, 1984) was an American <a href="/wiki/Major_League_Baseball" title="Major League Baseball">Major League Baseball</a> <a href="/wiki/Pitcher" title="Pitcher">pitcher</a>.  In 450 career games from 1912 to 1928, Coveleski posted a <a href="/wiki/Win%E2%80%93loss_record_(pitching)" title="Win–loss record (pitching)">win–loss record</a> of 215–142, with 224 <a href="/wiki/Complete_game" title="Complete game">complete games</a>, 38 <a href="/wiki/Shutouts_in_baseball" title="Shutouts in baseball">shutouts</a>, and a 2.89 <a href="/wiki/Earned_run_average" title="Earned run average">earned run average</a>. He made his major league debut with the <a href="/wiki/History_of_the_Philadelphia_Athletics" title="History of the Philadelphia Athletics">Philadelphia Athletics</a> in 1912. He signed with the <a href="/wiki/Cleveland_Indians" title="Cleveland Indians">Cleveland Indians<

Wikipedia:WikiProject Parliamentary Procedure
<p><b>WikiProject Parliamentary Procedure</b> is devoted to improving the quality and comprehensiveness of articles on topics related to <a href="/wiki/Parliamentary_procedure" title="Parliamentary procedure">parliamentary procedure</a>.
</p>
/w/index.php?title=Wikipedia:WikiProject_Parliamentary_Procedure&action=edit
--------------------
/wiki/Wikipedia:WikiProject
Wikipedia:WikiProject
<p>A <b>WikiProject</b> is a group of contributors who want to work together as a team to improve Wikipedia. These groups often focus on a specific topic area (for example, <a class="mw-redirect" href="/wiki/Wikipedia:WPMATH" title="Wikipedia:WPMATH">mathematics</a> or <a class="mw-redirect" href="/wiki/Wikipedia:INDIA" title="Wikipedia:INDIA">India</a>), a specific part of the encyclopedia (for example, <a class="mw-redirect" href="/wiki/Wikipedia:WPPORT" title="Wikipedia:WPPORT">Portals</a>), or a specific kind of task (for example, <a href="/wiki/Wikiped

Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Child_protection
Wikipedia:Child protection
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Biographies_of_living_persons
Wikipedia:Biographies of living persons
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Biographies_of_living_persons/Noticeboard
Wikipedia:Biographies of living persons/Noticeboard
<p>This page is for reporting issues regarding <b><a href="/wiki/Wikipedia:Biographies_of_living_persons" title="Wikipedia:Biographies of living persons">biographies of living persons</a></b>. Generally this means cases where editors are repeatedly adding defamatory or libelous material to articles about living people over an extended period.
</p>
/w/index.php?title=Wikipedia:Biographies_of_living_persons/Noticeboard&action=edi

Portal:Contents/Reference
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Portal:Contents/Culture_and_the_arts
Portal:Contents/Culture and the arts
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Portal:Contents/Geography_and_places
Portal:Contents/Geography and places
<p><span style="float:right;"><a href="/wiki/Special:Nearby" title="Special:Nearby">Places near you</a></span>
</p>
This page is missing something! Continuing.
--------------------
/wiki/Special:Nearby
Nearby
<p>Try a different browser or enable JavaScript if you've disabled it.</p>
This page is missing something! Continuing.
--------------------
/wiki/Special:MyTalk
User talk:122.171.102.245
<p><b>Other reasons this message may be displayed:</b>
</p>
/w/index.php?title=User_talk:122.171.102.245&action=edit
--------------------
/wiki/Case_sensitivity
Case sensitivity
<p>In computers, the <b>case sensitivity</b> of te

Wikipedia:Good articles
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Good_article_criteria
Wikipedia:Good article criteria
<p>The <b>good article criteria</b> are the six standards or tests by which a <a href="/wiki/Wikipedia:Good_article_nominations" title="Wikipedia:Good article nominations">good article nomination</a> (GAN) may be compared and judged to be a <a href="/wiki/Wikipedia:Good_articles" title="Wikipedia:Good articles">good article</a> (GA). A good article that has met the good article criteria may not have met the criteria for <a href="/wiki/Wikipedia:Featured_articles" title="Wikipedia:Featured articles">featured articles</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Good_article_nominations
Wikipedia:Good article nominations
<p>
<span id="checklinks-generator" title="all"></span><span id

Wikipedia:Wikipedia Signpost/2013-04-01/WikiProject report
<p>Instead of interviewing a WikiProject, this week's Report is dedicated to answering our readers' questions about WikiProjects. The following Frequently Asked Questions came from feedback at the WikiProject Report's talk page, the WikiProject Council's talk page, and from previous lists of FAQs. Included in today's Report are questions and answers that may prove useful to Wikipedia's newest editors as well as seasoned veterans.
</p>
/w/index.php?title=Wikipedia:Wikipedia_Signpost/2013-04-01/WikiProject_report&action=edit
--------------------
/wiki/Wikipedia:Wikipedia_Signpost
Wikipedia:Wikipedia Signpost
<p><span style="font-size:90%;">Could this be a new relationship between the Foundation and ArbCom, and between the Foundation and enwiki?
</span> <span class="autocomment nowrap" style="font-size:90%;"><a href="/wiki/Wikipedia:Wikipedia_Signpost/2019-06-30/Discussion_report" title="Wikipedia:Wikipedia Signpost/2019-06-30/Dis

IndexError: list index out of range