* 웹 크롤링의 핵심은 재귀. URL에서 페이지를 가져오고, 그 페이지를 검사해서 다른 URL을 찾고, 다시 그 페이지를 가져오는 무한 반복
* 위 작업을 하다보면 많은 부하가 일어나므로 주의 필요

## 3.1 단일 도메인 내의 이동

* 에릭 아이들이라는 배우가 출연한 영화 작품을 기준으로 케빈 베이커 페이지에 닿는 최소 작품수 확인 프로젝트

In [7]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

In [2]:
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.findAll('a'): # a tag 같는 것을 link로 갖고 오기
    if 'href' in link.attrs: # link의 속성 중 href 가 있는 link만 갖고 오기
        print(link.attrs['href']) # 해당 link 의 href 속성 보여주기

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#searchInput
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/P

* 위를 보면, 우리가 보려는 것과 다른 것들이 매우 많음
* 우리가 원하는 항목 링크를 파악하기 위한, 3가지 기준 존재
    1. 이 링크들은 id가 bodyContent인 div안에 있음
    2. URL에는 콜론이 포함되어있지 않음
    3. URL은 /wiki/로 시작함

In [4]:
for link in bs.find('div', {'id':'bodyContent'}).findAll('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia,_Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Balto_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(1982_film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood

* 우리가 의미있는 항목에서 다른 페이지를 갖고 오려면
    1. /wiki/<article_name> 형태인 위키백과 항목의 URL을 받고, 링크된 항목 URL 목록 전체를 반환하는 getLinks 함수를 만들어야 하고
    2. getLinks를 통해 반환된 리스트에서 무작위로 항목 링크를 선택하여 다시 한번 getLinks 하는 작업을 새 페이지에 항목 링크가 없을 때까지 반복해야 함

In [None]:
random.seed(datetime.datetime.now())

def getLinks(articleUrl) :
    
    html = urlopen(f'http://en.wikipedia.org{articleUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    
    return bs.find('div', {'id':'bodyContent'}).findAll('a', href=re.compile('^(/wiki/)((?!:).)*$'))
    # flim 항목으로 끝나는 걸 보려면 $ 앞에 m\) 추가

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0 :
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

* 위는 단순히 무작위로 항목별로 이동하는 함수일 뿐 우리가 풀려는 문제 자체를 해결한 것은 아님

## 3.2 전체 사이트 크롤링

* 사이트맵 생성과 데이터 수집에 있어서 효용성

In [None]:
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.findAll('a', href = re.compile('^(/wiki/)')):
        if 'href' in link.attrs :
            if link.attrs['href'] not in pages :
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks('')

### 3.2.1 전체 사이트에서 데이터 수집

* 페이지 제목, 첫 번째 문단, 편집 페이지를 가리키는 링크 수집 스크레퍼 만들기
* wikipedia pattern은 아래와 같음
    1. 제목은 항상 h1 태그 안에 있음
    2. 첫 번째 문단 텍스트 선택하려면, div#mw-content-text -> p 첫번째 문단 태그
    3. 편집 링크는 항목 페이지에만 존재, li#ca-edit -> span -> a

In [20]:
def getLinks(pageUrl) :
    # global pages
    html = urlopen('http://en.wikipedia.org' + pageUrl)
    bs = BeautifulSoup(html, 'html.parser')
    try :
        print('Head: ',bs.h1.get_text())
        print('First Paragraph: ', bs.find(id='mw-content-text').findAll('p')[0])
        print('Edit Link: ', bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError :
        print('This page is missing something! No worries though!')

In [50]:
pages = set()
def getLinks(pageUrl) :
    global pages
    html = urlopen('http://en.wikipedia.org' + pageUrl)
    bs = BeautifulSoup(html, 'html.parser')
    try :
        print('Head: ',bs.h1.get_text())
        print('First Paragraph: ', bs.find(id='mw-content-text').findAll('p')[0])
        print('Edit Link: ', bs.find(id='ca-edit').find('a').attrs['href'])
    except AttributeError :
        print('This page is missing something! No worries though!')
    for link in bs.findAll('a', href = re.compile('^(/wiki/)')):
        if 'href' in link.attrs :
            if link.attrs['href'] not in pages :
                newPage = link.attrs['href']
                print('-----------------------\n'+newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Head:  Main Page
First Paragraph:  <p><b><a href="/wiki/Landis%27s_Missouri_Battery" title="Landis's Missouri Battery">Landis's Missouri Battery</a></b> was an <a href="/wiki/Artillery_battery" title="Artillery battery">artillery battery</a> that served in the <a href="/wiki/Confederate_States_Army" title="Confederate States Army">Confederate States Army</a> during the early stages of the <a href="/wiki/American_Civil_War" title="American Civil War">American Civil War</a>. The battery was formed in late 1861 and early 1862, and was crewed by a maximum of 62 men. It fielded two <a href="/wiki/Canon_obusier_de_12" title="Canon obusier de 12">12-pounder Napoleon cannons</a> <i>(example pictured)</i> and two <a href="/wiki/M1841_24-pounder_howitzer" title="M1841 24-pounder howitzer">24-pounder howitzers</a>. The battery saw limited action at the <a href="/wiki/Battle_of_Iuka" title="Battle of Iuka">Battle of Iuka</a> before providing artillery support at the <a href="/wiki/Second_Battle_of

Head:  Pages that link to "File:People icon.svg"
First Paragraph:  <p>The following pages link to <b id="specialDeleteTarget"><a href="/wiki/File:People_icon.svg" title="File:People icon.svg">File:People icon.svg</a></b> <span id="specialDeleteLink"></span>
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Help:What_links_here
Head:  Help:What links here
First Paragraph:  <p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
Head:  Wikipedia:Project namespace
First Paragraph:  <p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:Policies_and_guidelines
Head:  Wikipedia:Policies and guidelines
First Paragraph:  <p>Wikipedia's <b>policies and guidelines</b> are developed by the community to describe best practices, clarify principles, resolve conflicts, and otherwise furthe

Head:  Des Moines, Iowa
First Paragraph:  <p class="mw-empty-elt">
</p>
Edit Link:  /w/index.php?title=Des_Moines,_Iowa&action=edit
-----------------------
/wiki/Des_Moines_(disambiguation)
Head:  Des Moines (disambiguation)
First Paragraph:  <p><b><a href="/wiki/Des_Moines,_Iowa" title="Des Moines, Iowa">Des Moines</a></b> is the capital of the state of Iowa in the United States.
</p>
Edit Link:  /w/index.php?title=Des_Moines_(disambiguation)&action=edit
-----------------------
/wiki/Des_Moines_metropolitan_area
Head:  Des Moines metropolitan area
First Paragraph:  <p class="mw-empty-elt">
</p>
Edit Link:  /w/index.php?title=Des_Moines_metropolitan_area&action=edit
-----------------------
/wiki/Metropolitan_statistical_area
Head:  Metropolitan statistical area
First Paragraph:  <p><span style="font-size:110%;"><a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">Population</a></span>
</p>
Edit Link:  /w/index.php?title=Metropoli

Head:  Wikipedia:Graphics Lab/Photography workshop
First Paragraph:  <p>The <b><a href="/wiki/Wikipedia:Graphics_Lab" title="Wikipedia:Graphics Lab">Graphics Lab</a></b> is a project to improve the graphical content of the Wikimedia projects. Requests for image improvements can be added to the workshop pages: <a class="mw-redirect" href="/wiki/Wikipedia:GL/ILL" title="Wikipedia:GL/ILL">Illustrations</a>, <a class="mw-redirect" href="/wiki/Wikipedia:GL/PHOTO" title="Wikipedia:GL/PHOTO">Photographs</a> and <a class="mw-redirect" href="/wiki/Wikipedia:GL/MAP" title="Wikipedia:GL/MAP">Maps</a>. For questions or suggestions one can use the talk pages: <a href="/wiki/Wikipedia_talk:Graphics_Lab" title="Wikipedia talk:Graphics Lab">Talk:Graphics Lab</a>, <a href="/wiki/Wikipedia_talk:Graphics_Lab/Illustration_workshop" title="Wikipedia talk:Graphics Lab/Illustration workshop">Talk:Illustrations</a>, <a href="/wiki/Wikipedia_talk:Graphics_Lab/Photography_workshop" title="Wikipedia talk:Graphic

Head:  File:Digital Public Library of America - Logo.png
First Paragraph:  <p><a class="internal" href="//upload.wikimedia.org/wikipedia/commons/5/52/Digital_Public_Library_of_America_-_Logo.png" title="Digital Public Library of America - Logo.png">Digital_Public_Library_of_America_-_Logo.png</a> ‎<span class="fileInfo">(500 × 180 pixels, file size: 42 KB, MIME type: <span class="mime-type">image/png</span>)</span>
</p>
This page is missing something! No worries though!
-----------------------
/wiki/User:Dominic
Head:  User:Dominic
First Paragraph:  <p><br/>
</p>
Edit Link:  /w/index.php?title=User:Dominic&action=edit
-----------------------
/wiki/Wikipedia:GLAM-Wiki
Head:  Wikipedia:GLAM
First Paragraph:  <p><br/>
<span style="font-size:larger;">The <b>GLAM–Wiki initiative</b> ("galleries, libraries, archives, and museums" with Wikipedia; also including botanic gardens and zoos) helps cultural institutions share their resources with the world through collaborative projects with experi

Head:  Wikipedia:Manual of Style/Abbreviations
First Paragraph:  <p>This guideline covers the use of <a href="/wiki/Abbreviation" title="Abbreviation">abbreviations</a> – including <a href="/wiki/Acronym" title="Acronym">acronyms and initialisms</a>, <a href="/wiki/Contraction_(grammar)" title="Contraction (grammar)">contractions</a>, and other <a class="mw-redirect" href="/wiki/Shortening_(grammar)" title="Shortening (grammar)">shortenings</a> – as used in the <a href="/wiki/English_Wikipedia" title="English Wikipedia">English Wikipedia</a>.
</p>
Edit Link:  /w/index.php?title=Wikipedia:Manual_of_Style/Abbreviations&action=edit
-----------------------
/wiki/Wikipedia:Manual_of_Style
Head:  Wikipedia:Manual of Style
First Paragraph:  <p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:Policies_and_guidelines#guide
Head:  Wikipedia:Policies and guidelines
First Paragraph:  <p>Wikipedia's <b>policies and guidelines</b> 

Head:  Wikipedia:Essays
First Paragraph:  <p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:NOTESSAY
Head:  Wikipedia:What Wikipedia is not
First Paragraph:  <p><a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a> is an online <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia</a> and, as a means to that end, an <a class="extiw" href="https://meta.wikimedia.org/wiki/The_Wikipedia_Community" title="meta:The Wikipedia Community">online community</a> of individuals interested in building and using a high-quality encyclopedia in a spirit of mutual respect. Therefore, <b>there are certain things that Wikipedia is <em>not</em></b>.
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:Notability
Head:  Wikipedia:Notability
First Paragraph:  <p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
-----------------------
/wiki/Wikipedia:NPOV

KeyboardInterrupt: 

## 3.3 인터넷 크롤링

In [41]:

# 페이지에서 발견된 내부 링크를 모두 목록으로 만듭니다.

def getInternalLinks(bs, includeUrl) :
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    
    # /로 시작하는 링크를 모두 찾습니다.
    for link in bs.findAll('a', href=re.compile('^(/|.*' + includeUrl + ')')): # |문은 사실상 의미가 없어보임
        if link.attrs['href'] is not None : 
            if link.attrs['href'] not in internalLinks :
                if(link.attrs['href'].startswith('/')) :
                    internalLinks.append(includeUrl+link.attrs['href'])
                else :
                    internalLinks.append(link.attrs['href'])
    return internalLinks

def getExternalLinks(bs, excludeUrl) :
    externalLinks = []
    
    # 현재 URL을 포함하지 않으면서 http나 www로 시작하는 링크를 모두 찾습니다.
    for link in bs.findAll('a', href = re.compile('^(http|www)((?!' + excludeUrl + ').)*$')) :
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks :
                externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage) :
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    if len(externalLinks) == 0 :
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        print(internalLinks)
        return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks)-1)])
    
    else :
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite) :
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is : {}'.format(externalLink))
    followExternalOnly(externalLink)        

In [None]:
#pages = set()
random.seed(datetime.datetime.now())

followExternalOnly('http://oreilly.com')

In [None]:
# 사이트에서 찾은 외부 URL을 모두 리스트로 수집합니다.

allExtLinks = set()
allIntLinks = set()

def getAllExternalLinks(siteUrl) :
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme, urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLniks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)
    
    for link in externalLinks :
        if link not in allExtLinks :
            allExtLinks.add(link)
            print(link)
    for link in internalLinks :
        if link not in allIntLinks :
            allIntLinks.add(link)
            getAllExternalLinks(link)

allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

# 4. 웹 크롤링 모델

* 다양한 사이트는 다양한 형식을 갖고 있는 데 이런 형식에 구애 받지 않으려면, 확장성이 뛰어난 클로러를 만들어야 함

## 4.1 객체 계획 및 정의

* 모든 데이터를 수집할 수는 없으므로 수집하고자 하는 정보 정의가 필요

In [59]:
import requests
import time

s_t = time.time()
class Content:
    def __init__(self, url, title, body) :
        self.url = url
        self.title = title
        self.body = body
        
def getPage(url) :
    # requests 로 처리하면 0.1~7 urlopen으로 하면 1~3초
    
    req = requests.get(url)
    #html = urlopen(url)
    return BeautifulSoup(req.text, 'html.parser')
    #return BeautifulSoup(html, 'html.parser')

def scrapeNYTimes(url) :
    bs = getPage(url)
    title = bs.find('h1').text
    lines = bs.select('div.StoryBodyCompanionColumn div p')
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)

def scrapeBrookings(url) :
    bs = getPage(url)
    title = bs.find('h1').text
    body = bs.find('div', {'class', 'post-body'}).text
    return Content(url, title, body)

b_url = '''https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'''

content = scrapeBrookings(b_url)
print('Title: {}'.format(content.title))
print(f'URL: {content.url}')
print(content.body)


n_url = 'https://www.nytimes.com/2020/11/30/opinion/trump-conspiracy-germany-1918.html'
content = scrapeNYTimes(n_url)
print('Title: {}'.format(content.title))
print(f'URL: {content.url}')
print(content.body)

Title: Delivering inclusive urban access: 3 uncomfortable truths
URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/

The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.	






Jeffrey Gutman
Nonresident Senior Fellow - Global Economy and Development







Adie Tomer
Fellow - Metropolitan Policy Program

 Twitter
AdieTomer






But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel times, rising housi

URL: https://www.nytimes.com/2020/11/30/opinion/trump-conspiracy-germany-1918.html
HAMBURG, Germany — It may well be that Germans have a special inclination to panic at specters from the past, and I admit that this alarmism annoys me at times. Yet watching President Trump’s “Stop the Steal” campaign since Election Day, I can’t help but see a parallel to one of the most dreadful episodes from Germany’s history.
One hundred years ago, amid the implosions of Imperial Germany, powerful conservatives who led the country into war refused to accept that they had lost. Their denial gave birth to arguably the most potent and disastrous political lie of the 20th century — the Dolchstosslegende, or stab-in-the-back myth.
Its core claim was that Imperial Germany never lost World War I. Defeat, its proponents said, was declared but not warranted. It was a conspiracy, a con, a capitulation — a grave betrayal that forever stained the nation. That the claim was palpably false didn’t matter. Among a si

In [None]:
request

In [44]:
url

'\nhttps://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/\n'