## <center>Web Scraping Tutorial</center>
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [4]:
import requests
from bs4 import BeautifulSoup as bs

r = requests.get("http://pythonscraping.com/pages/page1.html")
soup = bs(r.content)
print(soup.prettify())


<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



In [5]:
print(soup.h1) # print the heading of html

<h1>An Interesting Title</h1>


In [15]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
# title = getTitle("http://pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>


Using this BeautifulSoup object, we can use the findAll function to extract a
Python list of proper nouns found by selecting only the text within <span
34
class="green"></span> tags (findAll is an extremely flexible function we’ll be
using a lot later in this book):

In [16]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [17]:
nameList = bsObj.findAll(text="the prince")
print(len(nameList))

7


In [18]:
allText = bsObj.findAll(id="text")
print(allText[0].get_text())


"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and
delivered by a scarlet-liveri

Similarly, bsObj.div.findAll("img") will find the first div tag in the document,
then retrieve a list of all img tags that are descendants of that div tag.
If you want to find only descendants that are children, you can use the .children
tag:

In [19]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


This code prints out the list of product rows in the giftList table. If you were to
write it using the descendants() function instead of the children() function,
about two dozen tags would be found within the table and printed, including img
tags, span tags, and individual td tags. It’s definitely important to differentiate
between children and descendants!

### Dealing with siblings
The BeautifulSoup next_siblings() function makes it trivial to collect data from
tables, especially ones with title rows:

In [20]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

The output of this code is to print all rows of products from the product table,
except for the first title row. Why does the title row get skipped? Two reasons:
first, objects cannot be siblings with themselves. Any time you get siblings of an
object, the object itself will not be included in the list. Second, this function calls
next siblings only. If we were to select a row in the middle of the list, for example,
and call next_siblings on it, only the subsequent (next) siblings would be
returned. So, by selecting the title row and calling next_siblings, we can select
all the rows in the table, without selecting the title row itself.

### Regular Expression with Beautiful Soup

In [23]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
    print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


In this case, we
can look at the file path of the product images: This prints out only the relative image paths that start with ../img/gifts/img and end
in .jpg, the output of which is the following:

### Web Crawl

If
you examine the links that point to article pages (as opposed to other internal
pages), they all have three things in common:
1. They reside within the div with the id set to bodyContent
2. The URLs do not contain semicolons
3. The URLs begin with /wiki/.

We can use these rules to revise the code slightly to retrieve only the desired article
links:

In [25]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(1990_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Holl

In [27]:
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# import datetime
# import random
# import re
# random.seed(datetime.datetime.now())
# def getLinks(articleUrl):
#     html = urlopen("http://en.wikipedia.org"+articleUrl)
#     bsObj = BeautifulSoup(html)
#     return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
# links = getLinks("/wiki/Kevin_Bacon")
# while len(links) > 0:
#     newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
#     print(newArticle)
#     links = getLinks(newArticle)

/wiki/Fox_Broadcasting_Company
/wiki/John_Matoian
/wiki/Duke_University
/wiki/Wayback_Machine
/wiki/Reddit
/wiki/Internet_culture
/wiki/Microblogs
/wiki/World_Wide_Web
/wiki/URL
/wiki/IPv6
/wiki/6to4
/wiki/IPv6_rapid_deployment
/wiki/Regional_Internet_registry
/wiki/Information_infrastructure
/wiki/Asia-Pacific_Economic_Cooperation
/wiki/ASEAN
/wiki/Sports_in_Asia
/wiki/Iran_men%27s_national_volleyball_team
/wiki/2009_FIVB_Volleyball_World_League
/wiki/Serbia_men%27s_national_volleyball_team
/wiki/Aleksandar_%C5%A0o%C5%A1tar
/wiki/Novak_Djokovic
/wiki/2018_Shanghai_Rolex_Masters
/wiki/2018_ABN_AMRO_World_Tennis_Tournament_%E2%80%93_Singles
/wiki/Switzerland
/wiki/Economy_of_Sierra_Leone
/wiki/Tourism_in_Sierra_Leone
/wiki/Bank_of_Sierra_Leone
/wiki/Music_of_Sierra_Leone
/wiki/Music_of_Djibouti
/wiki/Songwriter
/wiki/Record_Producer
/wiki/Electronic_music
/wiki/Yamaha_Corporation
/wiki/D%27Angelico_Guitars
/wiki/Gordon-Smith_Guitars
/wiki/Harp_guitar
/wiki/Guitar
/wiki/Paulinho_Nogueira

KeyboardInterrupt: 

Next, it defines the getLinks function, which takes in an article URLof the form
/wiki/..., prepends the Wikipedia domain name, http://en.wikipedia.org,
63
and retrieves the BeautifulSoup object for the HTMLat that domain. It then extracts
a list of article link tags, based on the parameters discussed previously, and returns
them

In [29]:
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# import re
# pages = set()
# def getLinks(pageUrl):
#     global pages
#     html = urlopen("http://en.wikipedia.org"+pageUrl)
#     bsObj = BeautifulSoup(html)
#     for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
#         if 'href' in link.attrs:
#             if link.attrs['href'] not in pages:
# #We have encountered a new page
#                 newPage = link.attrs['href']
#                 print(newPage)
#                 pages.add(newPage)
#                 getLinks(newPage)
# getLinks("")

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Category:Administrative_backlog
/wiki/Help:Category
/wiki/Wikipedia:Categorization
/wiki/Wikipedia:WikiProject_Cats
/wiki/Wikipedia:CATP
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Reliable_sources/Perennial_sources
/wiki/Wikipedia:Reliable_sources
/wiki/Wikipedia:WikiProject_Reliability
/wiki/Wikipedia:WRE
/wiki/Wikipedia:WikiProject
/wiki/WikiProject
/wiki/Wiki
/wiki/File:En-Wiki2.ogg
/wiki/User:Dmcdevit
/wiki/Digital_Public_Library_of_America
/wiki/File:Digital_Public_Library_of_America_-_Logo.png
/wiki/User:Dominic
/wiki/Wikipedia:GLAM-Wiki
/wiki/Wikipedia:GLAM/About
/wiki/Wikipedia:GLAM


KeyboardInterrupt: 

In [32]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("This page is missing something! No worries though!")
        for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
            if 'href' in link.attrs:
                if link.attrs['href'] not in pages:
#We have encountered a new page
                    newPage = link.attrs['href']
                    print("----------------\n"+newPage)
                    pages.add(newPage)
                    getLinks(newPage)
getLinks("")

Main Page
<p><b><a href="/wiki/Sagitta" title="Sagitta">Sagitta</a></b> is a dim but distinctive <a href="/wiki/Constellation" title="Constellation">constellation</a> in the northern sky. Its name is <a href="/wiki/Latin" title="Latin">Latin</a> for 'arrow', and it should not be confused with the larger constellation <a href="/wiki/Sagittarius_(constellation)" title="Sagittarius (constellation)">Sagittarius</a>, the archer. It was included among the 48 constellations listed by the 2nd-century astronomer <a href="/wiki/Ptolemy" title="Ptolemy">Ptolemy</a>, and it remains one of the <a href="/wiki/IAU_designated_constellations" title="IAU designated constellations">88 modern constellations</a> defined by the <a href="/wiki/International_Astronomical_Union" title="International Astronomical Union">International Astronomical Union</a>. Although it dates from antiquity, Sagitta has no star brighter than 3rd <a href="/wiki/Apparent_magnitude" title="Apparent magnitude">magnitude</a> and has 

Wikipedia:WikiProject Resource Exchange
<p><b>Shortcuts</b>
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:WikiProject
Wikipedia:WikiProject
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
----------------
/wiki/WikiProject
WikiProject
<p>A <b>WikiProject</b>, or <b>Wikiproject</b>, is the organization of a group of participants in a <a href="/wiki/Wiki" title="Wiki">wiki</a> established in order to achieve specific editing goals, or to achieve goals relating to a specific field of knowledge. WikiProjects are prevalent within the largest wiki, <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a>, and exist to varying degrees within <a class="mw-redirect" href="/wiki/Wikimedia_project" title="Wikimedia project">sister projects</a> such as <a href="/wiki/Wiktionary" title="Wiktionary">Wiktionary</a>, <a href="/wiki/Wikiquote" title="Wikiquote">Wikiquote</a>, <a href="/wiki/Wikidata" title="Wikidata">Wikidata<

User talk:103.125.129.226
<p><b>Other reasons this message may be displayed:</b>
</p>
This page is missing something! No worries though!
----------------
/wiki/Special:NewSection/User_talk:103.125.129.226
Creating User talk:103.125.129.226
<p><a class="image" href="/wiki/File:Information.svg"><img alt="Information.svg" data-file-height="256" data-file-width="256" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/2/28/Information.svg/20px-Information.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/2/28/Information.svg/30px-Information.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/2/28/Information.svg/40px-Information.svg.png 2x" width="20"/></a> <i>Content that <a href="/wiki/Wikipedia:Copyright_violations" title="Wikipedia:Copyright violations">violates any copyrights</a> will be deleted. Encyclopedic content must be <a href="/wiki/Help:Introduction_to_referencing_with_Wiki_Markup/1" title="Help:Introduction to referencing with Wiki Markup

Wikipedia:Shortcut
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:Keyboard_shortcuts
Wikipedia:Keyboard shortcuts
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:WikiProject_Kansas
Wikipedia:WikiProject Kansas
<p><span style="font-size:100%;font-weight:bold;border: none; margin: 0; padding:0; padding-bottom:.1em; color:#FFD700;"><a class="image" href="/wiki/File:Seal_of_Kansas.svg"><img alt="Seal of Kansas.svg" data-file-height="600" data-file-width="600" decoding="async" height="48" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/Seal_of_Kansas.svg/48px-Seal_of_Kansas.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/Seal_of_Kansas.svg/72px-Seal_of_Kansas.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/45/Seal_of_Kansas.svg/96px-Seal_of_Kansas.svg.png 2x" width="48"/></a><br/><i>Welcome</i></span>
</p>
This page

KeyboardInterrupt: 

In [33]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
pages = set()
random.seed(datetime.datetime.now())
#Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                internalLinks.append(link.attrs['href'])
    return internalLinks

#Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" or "www" that do not contain the current URL
    for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bsObj = BeautifulSoup(html)
    externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0])
    if len(externalLinks) == 0:
        internalLinks = getInternalLinks(startingPage)
        return getNextExternalLink(internalLinks[random.randint(0, len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]

def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink("http://oreilly.com")
    print("Random external link is: "+externalLink)
    followExternalOnly(externalLink)
    
followExternalOnly("http://oreilly.com")


Random external link is: https://www.linkedin.com/company/oreilly-media
Random external link is: https://www.youtube.com/user/OreillyMedia
Random external link is: https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
Random external link is: https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
Random external link is: https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
Random external link is: https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
Random external link is: https://channelstore.roku.com/details/c8a2d0096693eb9455f6ac165003ee06/oreilly
Random external link is: https://itunes.apple.com/us/app/safari-to-go/id881697395
Random external link is: https://www.facebook.com/OReilly/
Rand

KeyboardInterrupt: 

In [34]:
#Collects a list of all external URLs found on the site
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    bsObj = BeautifulSoup(html)
    internalLinks = getInternalLinks(bsObj,splitAddress(siteUrl)[0])
    externalLinks = getExternalLinks(bsObj,splitAddress(siteUrl)[0])
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            print("About to get link: "+link)
            allIntLinks.add(link)
            getAllExternalLinks(link)

getAllExternalLinks("http://oreilly.com")

https://twitter.com/oreillymedia
https://www.facebook.com/OReilly/
https://www.linkedin.com/company/oreilly-media
https://www.youtube.com/user/OreillyMedia
https://itunes.apple.com/us/app/safari-to-go/id881697395
https://play.google.com/store/apps/details?id=com.safariflow.queue
https://channelstore.roku.com/details/c8a2d0096693eb9455f6ac165003ee06/oreilly
https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
About to get link: https://www.oreilly.com
https://www.oreilly.com
https://learning.oreilly.com/accounts/login-check/
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/feature-certification.html
https://www.oreilly.com/online-learning/intro-interactive-learning.html
https://www.oreilly.com/online-learning/l

ValueError: unknown url type: '/home/'