类似的维基百科抓取项目，他说，为了判断一个维基百科内链是 否链接到一个词条页面，他写了一个很大的过滤函数，代码超过了 100 行。不幸的是，他没有提前花很多时间去寻找“词条链接”和“其他链接”之间的模式，也可能他后来发现了。如果你仔细观察那些指向词条页面(不是指向其他内部页面)的链接，会发现它们都 有 3 个共同点:
1. 它们都在 id 是 bodyContent 的 div 标签里
2. URL 不包含冒号
3. URL 都以 /wiki/ 开头

我们可以利用这些规则稍微调整一下代码来仅获取词条链接，使用的
# 正则表达式为 
^(/wiki/)((?!:).)*$"):


In [1]:
from dataclasses import dataclass
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re 
import random
import datetime 


In [None]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
bs = BeautifulSoup(html,"html.parser")
for link in bs.find("div",{"id":'bodyContent'}).find_all(
    'a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs["href"])

# 随机爬取路径

In [None]:
random.seed(datetime.datetime.now())
def get_links(articleUrl):
    html = urlopen("https://en.wikipedia.org{}".format(articleUrl))
    bs = BeautifulSoup(html,"html.parser")
    return bs.find("div",{"id":'bodyContent'}).find_all(
        'a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = get_links("/wiki/Kevin_Bacon")
while len(links)>0:
    newArticle = links[random.randint(0,len(links)-1)].attrs["href"]
    print(newArticle)
    links = get_links(newArticle)


为了避免一个页面被抓取两次，
# 链接去重
是非常重要的。在代码运行时，要把已发现的所 有链接都放到一起，并保存在方便查询的集合(set)里。集合与列表类似，但是集合中的 元素没有特定的顺序，集合只存储唯一的元素，这正是我们需要的功能。只有“新”链接 才应被抓取，并从其页面中搜索其他链接:

In [3]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("https://en.wikipedia.org{}".format(pageUrl))
    bs = BeautifulSoup(html,"html.parser")
    for link in bs.find_all("a",href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs["href"] not in pages:
                # 已经转换到新页面
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")

/wiki/Wikipedia
/wiki/Main_Page
/wiki/Free_content
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_page_protection/Administrator_instructions
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy#Semi-protection
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Reliable_sources/Perennial_sources
/wiki/Wikipedia:Reliable_sources
/wiki/Wikipedia:WikiProject_Reliability
/wiki/Wikipedia:WRE
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Policies_and_guidelines
/wiki/Wikipedia:WikiProject_Politics


KeyboardInterrupt: 

# 处理重定向
对于服务器端重定向，你通常不需要担心。如果你使用的是 Python 3.x 的 urllib 库的 话，它可以自动处理重定向问题!如果你使用的是 requests 库的话，需要将允许重定 向的标志设置为 True:
<u>r = requests.get('http://github.com', allow_redirects=True)</u>

In [4]:
from urllib.request import urlopen
from urllib.parse import urlparse 
from bs4 import  BeautifulSoup
import re 
import datetime 
import random 

In [5]:
pages = set()
random.seed(datetime.datetime.now())

since Python 3.9 and will be removed in a subsequent version. The only 
supported seed types are: None, int, float, str, bytes, and bytearray.
  random.seed(datetime.datetime.now())


In [6]:
''' 获取页面中所有内链的列表'''
def getInteralLinks(bs,includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,urlparse(includeUrl).netloc)
    internaLinks= []
    # 找出所有以“/”开头的连接
    for link in bs.find_all('a',href =re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs["href"] is not None:
            if link.attrs["href"] not in internaLinks:
                if(link.attrs["href"].startswith('/')):
                    internaLinks.append(includeUrl+link.attrs['href'])
                else:
                    internaLinks.append(link.attrs['href'])

    return internaLinks

In [7]:
''' 获取页面中所有外链的列表'''
def getExternalLinks(bs,excludeUrl):
    externalLiks = []
    # 找出所有以http或者www开头且不包含当前url的链接
    for link in bs.find_all('a',href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href']is not None:
            if link.attrs['href'] not in externalLiks:
                externalLiks.append(link.attrs["href"])
    
    return externalLiks


In [8]:
def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html,"html.parser")
    externalLinks = getExternalLinks(bs,urlparse(startingPage).netloc)

    if len(externalLinks)==0:
        print("没有外链lo")
        domain = '{}://{}'.format(urlparse(startingPage).scheme,urlparse(startingPage).netloc)
        internalLinks = getInteralLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,
                                         len(internalLinks)-1)])

    else:
        return externalLinks[random.randint(0,len(externalLinks)-1)]

In [9]:
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)

In [10]:
followExternalOnly("http://oreilly.com")

Random external link is: https://www.youtube.com/user/OreillyMedia
Random external link is: https://developers.google.com/youtube
Random external link is: https://blog.youtube
Random external link is: https://www.youtube.com/channel/UCqVDpXKLmKeBU_yyt_QkItQ
Random external link is: https://developers.google.com/youtube
Random external link is: https://research.youtube/
Random external link is: https://support.google.com/youtube/contact/yt_researcher_certification
Random external link is: https://research.youtube/policies/terms
Random external link is: https://developers.google.com/youtube/terms/revision-history
Random external link is: https://support.google.com/youtube/contact/yt_api_audited_developer_requests_form
Random external link is: https://policies.google.com/privacy?hl=en
Random external link is: https://support.google.com/android/answer/9021432?hl=en
Random external link is: https://developer.android.com/reference/android/os/Build#fingerprint
Random external link is: https:/

HTTPError: HTTP Error 403: Forbidden

In [11]:
'''收集网站上发现的所有外链列表'''
allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = "{}://{}".format(urlparse(siteUrl).scheme,urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html,"html.parser")
    internalLinks = getInteralLinks(bs,domain)
    externalLinks = getExternalLinks(bs,domain)
    
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)

    for link in internalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            getAllExternalLinks(link)

allIntLinks.add("http://oreilly.com")
getAllExternalLinks("http://oreilly.com")


https://www.oreilly.com
https://www.oreilly.com/member/login/
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/feature-certification.html
https://www.oreilly.com/online-learning/intro-interactive-learning.html
https://www.oreilly.com/online-learning/live-events.html
https://www.oreilly.com/online-learning/feature-answers.html
https://www.oreilly.com/online-learning/insights-dashboard.html
https://www.oreilly.com/radar/
https://www.oreilly.com/content-marketing-solutions.html
https://www.oreilly.com/online-learning/radar-event-data-ai-2022.html
https://learning.oreilly.com/start-trial/
https://www.oreilly.com/ceros/cloud-labs-register.html
https://www.oreilly.com/online-learni

KeyboardInterrupt: 