#### Real World Applications.
Multiple pages and multiple sites.
Web crawlers crawl across the web. At Their core is an element called recursion. They must retrieve page contents for a URL, examine that page for another URL and retrieve that page ad infinitum.

The scrapers used in the previous examples work great in situations where all the data you need is on a single page. With web crawlers, you must be extremely conscientious of how much bandwidth you are using and make every effort to determine whether there's a way to make the target server's load easier.

#### Transversing a Single Domain.
There these games called Six Degrees of Wikipedia and Six Degrees of Kevin Bacon. The object of the games is to link tow unlikely subjects by a chain containing no more than six total including the two original subjects:
For example, Eric Idle appeared in Dudley Do-Right with Brendan Fraser, who
appeared in The Air I Breathe with Kevin Bacon. 1 In this case, the chain from Eric Idle
to Kevin Bacon is only three subjects long.


In this section, you’ll begin a project that will become a Six Degrees of Wikipedia sol‐
ution finder: You’ll be able to take the Eric Idle page and find the fewest number of
link clicks that will take you to the Kevin Bacon page. (Sound Cool Right?)


##### But What About Wikipedia’s Server Load?
According to the Wikimedia Foundation (the parent organization behind Wikipedia),
the site’s web properties receive approximately 2,500 hits per second, with more than
99% of them to the Wikipedia domain (see the “Traffic Volume” section of the “Wiki‐
media in Figures” page). Because of the sheer volume of traffic, your web scrapers are
unlikely to have any noticeable impact on Wikipedia’s server load. However, if you
run the code samples in this book extensively, or create your own projects that scrape
Wikipedia, I encourage you to make a tax-deductible donation to the Wikimedia
Foundation—not just to offset your server load, but also to help make education
resources available for everyone else.
Also keep in mind that if you plan on doing a large project involving data from Wiki‐
pedia, you should check to make sure that data isn’t already available from the Wiki‐
pedia API. Wikipedia is often used as a website to demonstrate scrapers and crawlers
because it has a simple HTML structure and is relatively stable. However, its APIs
often make this same data more efficiently accessible.

In [1]:
#write a Python Script that retrieves an arbitrary Wikipedia Page and Produces a list of links on that page.
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html,'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#searchInput
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_TIFF_2015.jpg
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award

If you look at the list of links produced, you’ll notice that all the articles you’d expect
are there: “Apollo 13,” “Philadelphia,” “Primetime Emmy Award,” and so on. However,
there are some things that you don’t want as well:

//wikimediafoundation.org/wiki/Privacy_policy

//en.wikipedia.org/wiki/Wikipedia:Contact_us

In fact, Wikipedia is full of sidebar, footer, and header links that appear on every
page, along with links to the category pages, talk pages, and other pages that do not
contain different articles:
/wiki/Category:Articles_with_unsourced_statements_from_April_2014
/wiki/Talk:Kevin_Bacon


Lets examine patterns between "article links" and "other links" or he might have discovered the trick. If you examine the links that point to article pages, you'll notice they have three things in common : 
- They reside within the div with the id set to bodyContent.
- The URLs don't contain colons.
- The URLs begin with /wiki/

Using these rules and Regex ^(/wiki/)((?!:).)*$")  we can revise the above code slightly to retrieve only the desired article links:

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div',{'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki

This is a list of all article links in one, hard coded Wikipedia article while interesting is fairly useless in practice, You need to be able to take this code and transform it into something more like :
- A single function, getLinks, that takes in a wikipedia article URL of the form /wiki/<Article_Name> and returns a list of all linked article URLs in the same form.

- A main function that calls getLinks with a starting article, chooses a random article link from the returned list and calls getLinks again until you stop the program or until no article links are found on the new page.

Here is the code: 
It should return all the linked urls in the sub-page. It will take a lot more time. As it is scraping the site recursively.


In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
random.seed(datetime.datetime.now())

def getLinks(articleURL):
    html = urlopen('http://en.wikipedia.org{}'.format(articleURL))
    bs = BeautifulSoup(html,'html.parser')
    return bs.find('div',{'id':'bodyContent'}).find_all('a',href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0 :
    newArticle = links[random.randint(0,len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/James_Dean
/wiki/1950_Ford
/wiki/Ford_Custom
/wiki/Ford_Escort_(China)
/wiki/Ford_Focus
/wiki/Mid-size_car
/wiki/Rambler_Classic
/wiki/American_Center
/wiki/ISBN_(identifier)
/wiki/ISO_4217
/wiki/Roman_currency
/wiki/List_of_Muslim_states_and_dynasties
/wiki/Germiyanids
/wiki/Mu%C4%9Fla_Province
/wiki/Aksaray_Province
/wiki/Bing%C3%B6l_Province
/wiki/Sakarya_Province
/wiki/%C3%87anakkale_Province
/wiki/%C3%87anakkale_(electoral_district)
/wiki/Republican_People%27s_Party_(Turkey)
/wiki/1965_Turkish_general_election
/wiki/1927_Turkish_general_election
/wiki/1923_Turkish_general_election
/wiki/2017_Turkish_constitutional_referendum
/wiki/Justice_Party_(Turkey)
/wiki/Socialist_Revolution_Party_(Turkey)
/wiki/Nationalist_Democracy_Party
/wiki/2007_Turkish_constitutional_referendum
/wiki/%C3%96mer_%C3%87elik
/wiki/Akif_%C3%87a%C4%9Fatay_K%C4%B1l%C4%B1%C3%A7
/wiki/Numan_Kurtulmu%C5%9F
/wiki/Ordu_Province
/wiki/Central_Anatolia_Region_(statistical)
/wiki/East_Marmara_Region_(statistical

/wiki/ISBN_(identifier)
/wiki/ALGOL_60
/wiki/ISO/IEC_18000
/wiki/ISO/IEC_8859-8
/wiki/CER-GS
/wiki/Character_encodings_in_HTML
/wiki/Font_family_(HTML)
/wiki/XHTML
/wiki/Kirix_Strata
/wiki/Falkon
/wiki/360_Secure_Browser
/wiki/Xombrero
/wiki/Blazer_(web_browser)
/wiki/WebRTC
/wiki/Web_API
/wiki/Mod_python
/wiki/Application_server
/wiki/Comparison_of_application_servers#Java
/wiki/Proprietary_software
/wiki/List_of_vaporware
/wiki/Operating_system
/wiki/Qian_Xuesen
/wiki/Tung_Chee-hwa
/wiki/Chen_Kuiyuan
/wiki/Guo_Moruo
/wiki/Ji_Pengfei
/wiki/Lin_Boqu
/wiki/Buhe_(politician)
/wiki/Wang_Guangying
/wiki/Han_Qide
/wiki/Guo_Moruo
/wiki/Ann_Tse-kai
/wiki/Xie_Juezai
/wiki/Lu_Zhangong
/wiki/Communist_Party_of_China
/wiki/Zhou_Enlai
/wiki/Zhu_De
/wiki/Chinese_language
/wiki/Balangao_language
/wiki/Kankanaey_language
/wiki/Mimaropa
/wiki/Sitio
/wiki/List_of_summer_villages_in_Alberta
/wiki/Chestermere
/wiki/Time_zone
/wiki/Chronozone
/wiki/Pridoli_epoch
/wiki/Permian
/wiki/Permian%E2%80%93Triassi

/wiki/Emacs
/wiki/Organization
/wiki/Personality_clash#In_the_workplace
/wiki/Workplace_revenge
/wiki/Workplace_romance
/wiki/Inheritance
/wiki/Rivalry_(economics)
/wiki/Information_good
/wiki/Damaged_good
/wiki/Adware
/wiki/ABC-CLIO
/wiki/Santa_Barbara,_California
/wiki/Riverside,_California
/wiki/Ventura,_California
/wiki/Catholic_school
/wiki/Separation_of_church_and_state
/wiki/Freedom_of_religion_in_Turkey
/wiki/Freedom_of_religion_in_Malaysia
/wiki/Royal_Malaysian_Customs
/wiki/Royal_Malaysian_Customs_Department_Museum
/wiki/Flor_de_la_Mar
/wiki/Maritime_Museum_(Malaysia)
/wiki/Toy_Museum_(Melaka)
/wiki/Sarawak_State_Museum
/wiki/People%27s_Museum
/wiki/Royal_Malaysian_Customs_Department_Museum
/wiki/Tun_Ghafar_Baba_Museum
/wiki/Istana_Negara,_Jalan_Istana
/wiki/Stadthuys
/wiki/Melaka_Sultanate_Palace_Museum
/wiki/Petrosains
/wiki/Archaeology_museum
/wiki/Cairo
/wiki/GM_Korea
/wiki/General_Motors_ignition_switch_recalls
/wiki/This_Week_(ABC_TV_series)
/wiki/20/20_(American_TV_pro

/wiki/Damascus_steel
/wiki/Cryogenic_deflashing
/wiki/Molding_(process)
/wiki/Molding_(decorative)
/wiki/Dalbergia_melanoxylon
/wiki/Ecocrop
/wiki/Tropicos
/wiki/FishBase
/wiki/Natural_Resources_Conservation_Service#Plants
/wiki/International_Plant_Names_Index
/wiki/State_Herbarium_of_South_Australia
/wiki/Biodiversity_Heritage_Library
/wiki/Natural_Resources_Conservation_Service#Plants
/wiki/Conservation_technical_assistance
/wiki/Conservation_(ethic)
/wiki/Conservation_in_Hong_Kong
/wiki/List_of_buildings_and_structures_in_Hong_Kong
/wiki/Court_of_Final_Appeal_(Hong_Kong)
/wiki/Andrew_Cheung
/wiki/Chief_Executive_of_Hong_Kong
/wiki/Election_Committee
/wiki/Young_Plan_(Hong_Kong)
/wiki/ISBN_(identifier)
/wiki/JPEG_XR
/wiki/WebP
/wiki/Cross-platform
/wiki/Application_framework
/wiki/Cocoa_(API)
/wiki/CEGUI
/wiki/List_of_widget_toolkits
/wiki/Windows_Template_Library
/wiki/Dafny
/wiki/Entity_Framework
/wiki/Barrelfish
/wiki/NuGet
/wiki/Apache_Ivy
/wiki/Bean_Scripting_Framework
/wiki/Etc

/wiki/Le_Vision_Pictures
/wiki/The_Expendables_2
/wiki/The_Expendables_(franchise)
/wiki/Terry_Crews
/wiki/Fox_Broadcasting_Company
/wiki/Letterboxing_(filming)
/wiki/Capacitance_Electronic_Disc
/wiki/Versatile_Multilayer_Disc
/wiki/Toshiba
/wiki/Mazda
/wiki/Hiroshima_Toyo_Carp
/wiki/1984_Nippon_Professional_Baseball_season
/wiki/Yomiuri_Giants
/wiki/Shozo_Saijo
/wiki/World_Boxing_Association
/wiki/Jelena_Mrdjenovich
/wiki/France
/wiki/Argentina
/wiki/Crime_in_Argentina
/wiki/Taxation_in_Argentina
/wiki/Taxation_in_France
/wiki/Negative_income_tax
/wiki/Gabriel_Zucman
/wiki/Stephen_Gill_(political_scientist)
/wiki/Council_on_Foreign_Relations
/wiki/Margaret_Hamburg
/wiki/Center_for_Drug_Evaluation_and_Research
/wiki/International_Conference_on_Harmonisation_of_Technical_Requirements_for_Registration_of_Pharmaceuticals_for_Human_Use
/wiki/Brussels
/wiki/St._Michael_and_St._Gudula_Cathedral
/wiki/Brabantine_Gothic
/wiki/St_Bavo%27s_Cathedral,_Ghent
/wiki/Ghent
/wiki/Habsburgs
/wiki/Etich

/wiki/House_of_Assembly_of_Saint_Vincent_and_the_Grenadines
/wiki/Parliament_of_Bermuda
/wiki/2017_Bermudian_general_election
/wiki/Michael_Dunkley
/wiki/Henry_Tucker_(Bermudian_politician)
/wiki/Premiers_of_Bermuda
/wiki/Prime_Minister_of_Bahrain
/wiki/Prime_Minister_of_Brazil
/wiki/Economy_of_the_Empire_of_Brazil
/wiki/Economy_of_the_Mongolian_People%27s_Republic
/wiki/Consumer_goods
/wiki/Snowshoe
/wiki/Hessian_(boot)
/wiki/Ancient_Chinese_clothing
/wiki/Fashion_accessory
/wiki/Dickey_(garment)
/wiki/D%C3%A9butante_dress
/wiki/Jacket
/wiki/Guards_Coat
/wiki/Robe
/wiki/Magic_(paranormal)
/wiki/Andre_Breton
/wiki/Name_and_Title_Authority_File_of_Catalonia
/wiki/Museum_of_New_Zealand_Te_Papa_Tongarewa
/wiki/Dominion_Museum
/wiki/Trove
/wiki/Library_catalog
/wiki/Index_card
/wiki/Paper_towel
/wiki/Notebook
/wiki/Filter_paper
/wiki/Boston_round_(bottle)
/wiki/Beaker_(glassware)
/wiki/Thermogravimetric_analysis
/wiki/Doi_(identifier)
/wiki/ISO_16750
/wiki/Topic_map
/wiki/ISO/IEC_29119
/wi

/wiki/Polychlorinated_biphenyl
/wiki/Neoprene
/wiki/Thomas_M._Connelly
/wiki/Economics
/wiki/Homo_economicus
/wiki/List_of_game_theorists
/wiki/Bank_of_Sweden_Prize_in_Economic_Sciences_in_Memory_of_Alfred_Nobel
/wiki/Nobel_Committee_for_Chemistry
/wiki/Nobel_Prize
/wiki/Nobel_Foundation
/wiki/List_of_Asian_Nobel_laureates
/wiki/Nobel_Prize_in_Chemistry
/wiki/Enantioselective_synthesis
/wiki/Doi_(identifier)
/wiki/ISO_13490
/wiki/Floppy_disk
/wiki/Magnetic_storage
/wiki/Sequential_access_memory
/wiki/Drum_memory
/wiki/Berkeley_Software_Distribution
/wiki/Apple_SOS
/wiki/Application_programming_interface
/wiki/Preemption_(computing)
/wiki/Novell_DOS
/wiki/CHCP_(DOS_command)
/wiki/HIMEM
/wiki/DOS_(CONFIG.SYS_directive)
/wiki/COM2
/wiki/ArcaOS
/wiki/Operating_system
/wiki/Apple_TV
/wiki/RCA_connector
/wiki/DB13W3
/wiki/RCA_connector
/wiki/Digital_audio
/wiki/Foldback_(sound_engineering)
/wiki/Echo_(phenomenon)
/wiki/Napier_Museum
/wiki/Panieli_Poru_waterfalls
/wiki/Palakkad_Fort
/wiki/Tal

/wiki/Software_categories#Categorization_approaches
/wiki/Health_care
/wiki/Acute_care
/wiki/Self_care
/wiki/Doi_(identifier)
/wiki/ISO/IEC_646
/wiki/Ampersand
/wiki/Bash_(Unix_shell)
/wiki/Bourne_shell
/wiki/Computerworld
/wiki/International_Data_Group
/wiki/InfoWorld
/wiki/American_Society_for_Information_Science
/wiki/Non-governmental_organization
/wiki/Ministry_for_Foreign_Affairs_(Finland)
/wiki/Ministry_of_Foreign_Affairs_(Kenya)
/wiki/Ministry_of_Foreign_Affairs_and_Cooperation_(East_Timor)
/wiki/Ministry_of_Foreign_Affairs_(Sri_Lanka)
/wiki/Ministry_of_Foreign_Affairs_(Uzbekistan)
/wiki/Ministry_of_Foreign_Affairs_(Romania)
/wiki/Avram_Bunaciu
/wiki/Alexandru_Djuvara
/wiki/Nicolae_Titulescu
/wiki/Ion_Antonescu
/wiki/Nicolae_Titulescu
/wiki/German_National_Library_of_Economics
/wiki/Research_library
/wiki/List_of_libraries
/wiki/Deutsche_B%C3%BCcherei_Leipzig
/wiki/National_Library_of_Russia
/wiki/National_Library_of_Sweden
/wiki/VIAF_(identifier)
/wiki/Spanish_language
/wiki/La

KeyboardInterrupt: 


#### The Dark and Deep Webs


