### 지난시간 복습

- 우리가 만들려고 하는 것은 Crawler + Scrapper
- 저 둘의 목적은 다름. 웹에서 저 둘을 하기 위해서는 HTTP(Req/Resp - Bytes - (Text/HTML)String => Requests, BeautifulSoup
- Beautifulsoup => DOM : find~, select(CSS Selector)
- CSS Selector 에서 셀렉트를 하는 대상이 되는 부분에는 HTML Tag, #id, .class, HTML Attributes, :(가상선택자)
- 여기서 속성은 [키(^$*)=밸류]
- 크롤러는 하이퍼링크를 따라다니면서 웹을 따라다니는 것이 주 목적.
- 하이퍼링크를 어떻게 찾을 것인가? (a[href], iframe[src], img, ...) => 하이퍼링크를 얻음
- 이제 정규화; scheme://netloc(sever domain)/path?params#fragment(이것은 내부주소라서 수집해봤자 같은 페이지다!)
- 웹은 무한한 공간이기 때문에, 탐색의 가장 기본전략은 BFS(Queue), DFS(Stack)
- 위를 검색결과로 이용한다고 했을 때 큐는 최상단부터 순차적으로 탐색하기 때문에 가까운 정보를 찾을 때 유리함.
- 특정 사이트에 있는 내용을 전부다 이용하려고 하면 DFS 가 좀 더 유리함.
- 두 전략이든 깊이 우선 탐색에서는 못 헤어나올 수도 있음. 특정 노드를 따라서 너무 깊이 들어가다 보니까 헤어나오기 힘듦.
- 웹에서 DFS 는 주의해야 함. 그 때는 Depth 로 컨트롤 할 것.

### 수업

- 오늘 내용은 Focused Crawling 을 할 것.
- 우리가 취한 전략에 따라 움직이게끔 할 것; domain, path, 영역, depth 를 통해 전략을 짤 것.
- 이렇게 하면 하이퍼링크 정보들을 수집할 것.
- 이를 어떻게 분석할 수 있을까와 관련된 것 중에 Link Analysis => PageRank, 스크래핑 들어가볼 것.

- 이 이후에는 로그인 뚫기, 쿠키 사용하기 등등 해볼 것
- 자바스크립트 별도로 파싱할 때 도움을 받을 수 있는 Selenium 을 소개할 것

In [1]:
from urllib.robotparser import RobotFileParser
from requests import get
from requests.compat import urlparse, urljoin
from requests.exceptions import HTTPError
from time import sleep
from bs4 import BeautifulSoup
import re

In [3]:
headers = {
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
}

#### 전략 1: Depth 제한

##### Queue 방식

In [3]:
URLs = list()
seens = list()
# 구조 변경. 기존 list 에서 dict의 list로(key:url, depth)
URLs.append({'url': 'https://search.naver.com/search.naver?query=%ED%95%9C%EC%86%8C%ED%9D%AC&where=nexearch',
             'depth':0})

# 전략1. depth(0 -> 1 -> 2, ... limit)
#      ['url', ...]   <- 지금까지 한 구조
#      [{url:'url', depth:0}, ...]

# pop(0): BFS(Queue), pop(-1): DFS(Stack)
while URLs:
    seed = URLs.pop(0)
    seens.append(seed['url'])
    
    # 전략 1 적용; depth 2까지 들고올 것
    if seed['depth'] > 3:
        break
    
    # list에서 꺼낸 url은 dict이므로, 실제 주소는 dict의 key:url
    resp = get(seed['url'], headers=headers)
    
    # 에러 발생 확인 문단
    try:
        resp.raise_for_status()
    except HTTPError as e:
        print(e)
        continue

    if re.search('text|html', resp.headers['content-type']) is None:
        continue
            
    dom = BeautifulSoup(resp.text, 'html.parser')
    
    for link in dom.select('a[href], iframe[src]'):
        url = urljoin(seed['url'], link.attrs['href'] if link.has_attr('href') else link.attrs['src'])
        
        # URL seen?
        if len(urlparse(url).fragment) == 0 and urlparse(url).scheme in ['http', 'https']:
            # {depth제한} => list의 dict 풀어서 => 주소만 있는 list
            if url not in [u['url'] for u in URLs] and url not in seens:
                # 앞으로 방문할 URL 목록에 dict로 추가
                URLs.append({'url':url, 'depth':seed['depth']+1})
            
    print(len(URLs))

188
316
323
404 Client Error: Not Found for url: https://search.naver.com/@5@
325
332
331
330
330
331
421
487
598
665
696
741
758
757
804
803
836
835
1589
1597
1668
1668
1687
1717
1748
1776
1801
1822
1843
1843
1846
1845
1892
1984
2073
2133
2133
2977
2976
3054
3132
3155
3157
3157
3193
3193
3193
3320
3352
3505
3718
3846
4033
4165
4341
4532
4622
4775
4867
5017
5145
5339
5473
5736
5852
5971
5995
6018
6033
6037
6042
6075
6084
6085
403 Client Error: Forbidden for url: https://namu.wiki/w/%ED%95%9C%EC%86%8C%ED%9D%AC
6084
6089
6096
6103
6136
6405
6504
6585
6595
6857
6945
6959
7084
7094
7240
7248
7354
7362
7419
7448
7458
7465
7699
7823
7847
7847
7906
7947
7947
7956
8062
8141
8161
8161
8169
8171
8321
8439
8458
8458
8474
8474
8486
8486
8498
8516
8542
8570
8613
8613
8613
8613
8612
8611
8742
8907
8966
8966
8966
9221
9468
9639
9856
10083
10254
10254
10254
10438
10938
10937
10936
10936
10936
10936
11143
11245
11365
11607
11821
12068
12294
12505
12524
12535
12535
12535
12668
12668
12668
12668
12693
12

KeyboardInterrupt: 

In [4]:
len(URLs), len(seens)
# depth 2 까지 갔는데도 이만큼이나 수집한 것.
# seens 가 페이지

(18609, 376)

In [28]:
a = [{'a':1, 'b':2}]
A = a.pop()
A

{'a': 1, 'b': 2}

##### Stack 방식

In [5]:
URLs = list()
seens = list()
# 구조 변경. 기존 list 에서 dict의 list로(key:url, depth)
URLs.append({'url': 'https://search.naver.com/search.naver?query=%ED%95%9C%EC%86%8C%ED%9D%AC&where=nexearch',
             'depth':0})

# 전략1. depth(0 -> 1 -> 2, ... limit)
#      ['url', ...]   <- 지금까지 한 구조
#      [{url:'url', depth:0}, ...]

# pop(0): BFS(Queue), pop(-1): DFS(Stack)
while URLs:
    seed = URLs.pop() # DFS
    seens.append(seed['url'])
    
    # 전략 1 적용; depth 2까지 들고올 것
    if seed['depth'] > 3:
        continue
    
    # list에서 꺼낸 url은 dict이므로, 실제 주소는 dict의 key:url
    resp = get(seed['url'], headers=headers)
    
    try:
        resp.raise_for_status()
    except HTTPError as e:
        print(e)
        continue

    if re.search('text|html', resp.headers['content-type']) is None:
        continue
            
    dom = BeautifulSoup(resp.text, 'html.parser')
    
    for link in dom.select('a[href], iframe[src]'):
        url = urljoin(seed['url'], link.attrs['href'] if link.has_attr('href') else link.attrs['src'])
        
        # URL seen?
        if len(urlparse(url).fragment) == 0 and urlparse(url).scheme in ['http', 'https']:
            # {depth제한} => list의 dict 풀어서 => 주소만 있는 list
            if url not in [u['url'] for u in URLs] and url not in seens:
                # 앞으로 방문할 URL 목록에 dict로 추가
                URLs.append({'url':url, 'depth':seed['depth']+1})
            
    print(len(URLs))

188
266
278
298
305
284
274
273
272
271
270
284
268
267
266
265
326
404 Client Error: Not Found for url: https://apps.apple.com/app/id1530772901?mt=12
356
397
322
331
322
319
350
402
317
328
367
313
374
316
321
313
322
307
319
305
312
309
308
311
302
301
299
298
313
306
301
295
349
291
305
300
304
291
286
300
289
308
288
281
307
283
279
280
278
278
280
277
279
277
279
278
273
283
269
269
268
278
281
280
285
279
278
276
275
273
273
271
275
276
278
282
263
270
277
268
271
317
269
275
263
262
280
330
340
279
279
275
290
273
272
271
278
276
282
270
268
269
316
263
276
268
276
276
328
295
280
275
273
272
271
270
269
268
267
275
265
261
260
264
281
262
262
273
260
259
258
285
284
284
282
289
280
288
285
371
315
287
378
288
280
409
314
312
404 Client Error: Not Found for url: https://play.google.com/store/apps/details?id=jp.naver.linemanga.android&hl=ko&gl=US


ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))

In [None]:
# 위에서 OSError 는 play.google 에서 사이트를 막은 것. 거기서부터 흐름타게 못한다고 보면 됨

In [6]:
len(URLs), len(seens)

(267, 2091)

#### 전략 2: Domain 제한

In [7]:
URLs = list()
seens = list()
# 구조 변경. 기존 list 에서 dict의 list로(key:url, depth)
URLs.append({'url': 'https://search.naver.com/search.naver?query=%ED%95%9C%EC%86%8C%ED%9D%AC&where=nexearch',
             'depth':0})

# 전략2. domain
#       blog.naver.com 로 제한하면 블로그에 있는 정보들만 가져올 것
allowedDomain = ['blog.naver.com']

# pop(0): BFS(Queue), pop(-1): DFS(Stack)
while URLs:
    seed = URLs.pop(0)
    seens.append(seed['url'])
    
    # 전략 1 적용; depth 2까지 들고올 것
    if seed['depth'] > 3:
        continue
    
    # list에서 꺼낸 url은 dict이므로, 실제 주소는 dict의 key:url
    resp = get(seed['url'], headers=headers)
    
    try:
        resp.raise_for_status()
    except HTTPError as e:
        print(e)
        continue

    if re.search('text|html', resp.headers['content-type']) is None:
        continue
            
    dom = BeautifulSoup(resp.text, 'html.parser')
    
    for link in dom.select('a[href], iframe[src]'):
        url = urljoin(seed['url'], link.attrs['href'] if link.has_attr('href') else link.attrs['src'])
        
        # URL seen?
        if len(urlparse(url).fragment) == 0 and urlparse(url).scheme in ['http', 'https']:
            # {depth제한} => list의 dict 풀어서 => 주소만 있는 list
            # 전략2 적용. blog.naver.com
            if url not in [u['url'] for u in URLs] and url not in seens and urlparse(url).netloc in allowedDomain:
                # 앞으로 방문할 URL 목록에 dict로 추가
                URLs.append({'url':url, 'depth':seed['depth']+1})
            
    print(len(URLs))

13
13
13
13
13
13
13
13
13
13
13
13
13
13
26
53
58
59
88
96
129
131
174
177
178
211
215
214
213
213
216
215
228
233
233
232
231
230
232
232
500 Server Error:  for url: https://blog.naver.com/FILEPATH
230
229
239
291
294
308
308
307
306
307
307
306
306
305
306
305
304
305
304
304
303
304
303
302
303
302
301
404 Client Error:  for url: https://blog.naver.com/prologue/FILEPATH
300
299
298
297
322
345
370
395
394
393
418
443
445
466
472
487
488
487
486
485
484
483
482
481
483
482
481
480
479
478
477
476
478
479
478
477
478
477
476
480
479
478
503
503
503
503
528


KeyboardInterrupt: 

In [8]:
len(URLs), len(seens)

(527, 117)

In [9]:
URLs[3] # depth 가 3이라 정보가 없을 것

{'url': 'https://blog.naver.com/PostList.naver?blogId=ryuri666&categoryNo=31&skinType=&skinId=&from=menu',
 'depth': 3}

In [10]:
seens
# 블로그로 제한하고 depth를 제한 했더니 특정 서비스나 영역 안에서만 가져오게끔 만들었음

['https://search.naver.com/search.naver?query=%ED%95%9C%EC%86%8C%ED%9D%AC&where=nexearch',
 'https://blog.naver.com/jjunnk',
 'https://blog.naver.com/borareview',
 'https://blog.naver.com/borareview/222930653320',
 'https://blog.naver.com/borareview/222947756397',
 'https://blog.naver.com/chois909',
 'https://blog.naver.com/chois909/222899860624',
 'https://blog.naver.com/ryuri666',
 'https://blog.naver.com/ryuri666/222948588408',
 'https://blog.naver.com/bbekimha',
 'https://blog.naver.com/bbekimha/222992899748',
 'https://blog.naver.com/bbekimha/222953351736',
 'https://blog.naver.com/hongya931',
 'https://blog.naver.com/hongya931/223013613065',
 'https://blog.naver.com/PostList.naver?blogId=jjunnk&widgetTypeCall=true&directAccess=true',
 'https://blog.naver.com/prologue/PrologueList.naver?blogId=borareview&directAccess=true',
 'https://blog.naver.com/PostView.naver?blogId=borareview&logNo=222930653320&redirect=Dlog&widgetTypeCall=true&directAccess=false',
 'https://blog.naver.com/Po

#### iframe

In [11]:
resp = get('https://blog.naver.com/qkrdmsdhr95/222890376163')
resp.text
# 여기 한소희가 있어야 하는데 아무리 찾아봐도 없음

'\n\n\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html lang="ko">\n<head>\n<meta http-equiv="Pragma" content="no-cache"/>\n<meta http-equiv="Expires" content="-1"/>\n<meta name="robots" content="noindex,follow"/>\n<meta name="referrer" content="always"/>\n<meta http-equiv="content-type" content="text/html;charset=UTF-8"/>\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico?3" />\n<link rel="alternate" type="application/rss+xml" href="https://rss.blog.naver.com/qkrdmsdhr95.xml" title="RSS feed for qkrdmsdhr95 Blog"/>\n<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="https://blog.naver.com/NBlogWlwLayout.naver?blogId=qkrdmsdhr95" />\n\n\n\n\n<title>By Jade : 네이버 블로그</title>\n</head>\n<script type="text/javascript" src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset-347491577_https.js" char

In [None]:
# iframe 에 내용이 있기 때문
# <body>
#     <iframe id="mainFrame" name="mainFrame" allowfullscreen="true" src="/PostView.naver?blogId=chois909&logNo=222899860624&redirect=Dlog&widgetTypeCall=true&directAccess=false" scrolling="auto"  onload="oFramesetTitleController.start(self.frames['mainFrame'], self, sTitle);oFramesetTitleController.onLoadFrame();oFramesetUrlController.start(self.frames['mainFrame']);oFramesetUrlController.onLoadFrame()" allowfullscreen></iframe>
# </body>

# iframe 안에 dtd가 또 시작됨.

### 본격적인 Scraping

- 특정 영역에 있는 특정 페이지가 필요하기 때문에 스크래핑을 할 것.
- 스크래핑 vs 크롤링
- 크롤링은 링크를 따라 다니니까 웹 페이지 전체를 색인하기 위해서 필요함
- 스크래핑은 아님. 그 안에서 내가 갖고 싶은 특정 영역, 해당 부분 딱 그부분만 필요함. 즉 목적이 다름
- 크롤링은 시드 주소만 주고 전략만 취하면 알아서 돌아감. 그래서 라지 스케일
- 스크래핑은 내가 갖고 싶은 특정 페이지라서 1페이지 일 수도 있음. 내가 결정할 수 있음(끝나는 위치를)
- 크롤링은 중복을 반드시 판단해야 함. 그래서 사본을 들고서 분석함
- 스크래핑은 검색결과에서 링크, 제목 해당 영역만 가지고 나머지는 버려버림.
- 그래서 크롤링은 우리가 만든 봇, 위의 기능이 필요함 -> 분석은 나중.
- 스크래핑은 데이터가 어딨는지를 알고 지칭하고 선택해야 하기 때문에 파싱이 꼭 필요함.
- 데이터를 수집한 것은 스크래핑, 사이트를 다운받고 링크를 가지고 했으면 크롤링을 한 것.

In [None]:
# 이제 크롤러에 스크래핑 하는 것을 붙혀볼 것.
# 텍스트와 이미지를 같이 수집
# 이미지만 수집
# 텍스트만 수집
# data extraction 하는 부분에 방점.

#### Domain 제한 크롤러에 스크래핑 추가

In [16]:
!mkdir data
# 폴더 만들기. 여기다가 다 때려박을 것.

In [12]:
URLs = list()
seens = list()
# 구조 변경. 기존 list 에서 dict의 list로(key:url, depth)
URLs.append({'url': 'https://search.naver.com/search.naver?query=%ED%95%9C%EC%86%8C%ED%9D%AC&where=nexearch',
             'depth':0})

# 전략2. domain
#       blog.naver.com 로 제한하면 블로그에 있는 정보들만 가져올 것
allowedDomain = ['blog.naver.com', 'postfiles.pstatic.net']

# pop(0): BFS(Queue), pop(-1): DFS(Stack)
while URLs:
    seed = URLs.pop(0)
    seens.append(seed['url'])
    
    # 전략 1 적용; depth 2까지 들고올 것
    if seed['depth'] > 3:
        continue    
    
    # list에서 꺼낸 url은 dict이므로, 실제 주소는 dict의 key:url
    resp = get(seed['url'], headers=headers)
    
    try:
        resp.raise_for_status()
    except HTTPError as e:
        print(e)
        continue
        
    # 전략 3: 텍스트/HTML + image/format 
    if re.search('text|html|image|jpeg|png|gif|bmp', resp.headers['content-type']) is None:
        continue
            
    if re.search('image|jpeg|png|gif|bmp', resp.headers['content-type']):
        #이미지 저장
        # https://blog.naver.com/path1/path2/(image1234@!~123) -> 이것을 추출
        filename = resp.url.split('/')[-1]
        # (image1234@!~123) -> (image1234123)
        filename = re.sub('[?#!]', '', filename)
        # image/(___)
        ext = re.search('image/(\w+);?',resp.headers['content-type']).group(1)
        # filename: ./data/image1234123.ext
        with open('./data/'+filename+'.'+ext, 'wb') as fp:
            fp.write(resp.content)
        
    else:
        # 위와 동일하게 filename 만들어서 w모드 저장. resp.text 저장해야 함
        dom = BeautifulSoup(resp.text, 'html.parser')

        for link in dom.select('a[href], iframe[src], img[src]'):
            url = urljoin(seed['url'], link.attrs['href'] if link.has_attr('href') else link.attrs['src'])

            if len(urlparse(url).fragment) == 0 and urlparse(url).scheme in ['http', 'https']:
                # {depth제한} => list의 dict 풀어서 => 주소만 있는 list
                # 전략2 적용. blog.naver.com
                if url not in [u['url'] for u in URLs] and url not in seens and urlparse(url).netloc in allowedDomain:
                    # 앞으로 방문할 URL 목록에 dict로 추가
                    URLs.append({'url':url, 'depth':seed['depth']+1})

        print(len(URLs))

13
13
13
13
13
13
13
13
13
13
13
13
13
13
69
96
130
157
186
215
248
263
306
339
352
385
407
406
405
405
408
407
420
425
395
421
428
443
454
465
500 Server Error:  for url: https://blog.naver.com/FILEPATH
463
462
514
566
569
583
583
582
581
597
624
638
676
702
721
732
768
781
788
809
834
841
858
867
875
880
882
404 Client Error:  for url: https://blog.naver.com/prologue/FILEPATH
881
880
879
883
923
933
974
973
972
971
1012
1053
1055
1076
1082
1097
1098
1124
1150
1176
1201
1220
1243
1261
1281
1298
1329
1354
1376
1414
1439
1473
1515
1541
1572
1599
1625
1638
1637
1660
1681
1697
1722
1702
1702
1702
1726
1725
1724
1858
2013
2199
2261
2417
2418
2451
2454
2454
2454
2454
2454
2453
2453
2477
2494
2507
2554
2575
2608
2622
2640
2658
2670
2682
2703
2724
2738
2750
2761
2779
2787
2789
2884
2881
2880
2976
3008
3044
3085
3126
3127
3126
3129
3128
3128
3127
3126
3152
3178
3217
3238
3264
3282
3301
3326
3341
3369
3398
3428
3455
3487
3494
3523
3530
3556
3561
3584
3602
3625
3637
3651
3661
3692
3712
3735
3740

In [23]:
re.search('image/(\w+);?','text/html;image/png;').group(1)

'png'

#### 웹툰 크롤링 + 스크래핑 ( focused crawling)

In [29]:
!mkdir webtoon

In [32]:
# 전략을 취해야 하는 부분도 많음
# 특정 도메인, 특정 영역, Depth X
URLs = ['https://comic.naver.com/webtoon']
visited = list()

while URLs:
    seed = URLs.pop(0) # Queue
    visited.append(seed)
    
    resp = get(seed, headers=headers)
    
    # 오류 처리(위 코드 참조) 여기서는 아래와 같이 간단하게만 할 것.
    if resp.status_code != 200:
        continue
        
    if re.search('image', resp.headers['content-type']):
        filename = resp.url.split('/')[-1]
        filename = re.sub('[?#!]', '', filename)
        ext = re.search('image/(\w+);?',resp.headers['content-type']).group(1)
        with open('./webtoon/'+filename+'.'+ext, 'wb') as fp:
            fp.write(resp.content)
        
    if re.search('html', resp.headers['content-type']): 
        dom = BeautifulSoup(resp.text, 'html.parser')
        # 영역 제한 -1  (웹툰 홈)
        for a in dom.select('ul[class$="R52q0"] a[href^="/webtoon/"]'):
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
                
        # 영역 제한 -2  (특정 웹툰의 회차 목록)
        for a in dom.select('li[class$="M8zq4"] > a[href^="/webtoon/"]'):
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
                
        # 영역 제한 -3  (특정 웹툰의 특정 회차의 이미지 목록)
        for img in dom.select('img[id^=content_image_]'):
            nurl = urljoin(seed, a.attrs['src'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
                
        

In [None]:
# 오류가 남. DHTML 할 때 다시 할 것

In [None]:
# 웹툰을 못 가져가게 해 놓음. 따라서 뉴스 크롤링을 해보자

In [33]:
!mkdir news_naver

In [13]:
# 전략을 취해야 하는 부분도 많음
# 특정 도메인, 특정 영역, Depth X
URLs = ['https://news.naver.com/']
visited = list()

while URLs:
    seed = URLs.pop(0) # Queue
    visited.append(seed)
    
    resp = get(seed, headers=headers)
    
    # 오류 처리(위 코드 참조) 여기서는 아래와 같이 간단하게만 할 것.
    if resp.status_code != 200:
        continue
        
    if re.search('image', resp.headers['content-type']):
        filename = resp.url.split('/')[-1]
        filename = re.sub('[?#!= ]', '', filename)
        ext = re.search('image/(\w+);?',resp.headers['content-type']).group(1)
        with open('./news_naver/'+filename+'.'+ext, 'wb') as fp:
            fp.write(resp.content)
        
    if re.search('html', resp.headers['content-type']): 
        dom = BeautifulSoup(resp.text, 'html.parser')
        # 영역 제한 -1  (뉴스 카테고리)
        for a in dom.select('[role=menu] a')[1:7]:
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
                
        # 영역 제한 -2  (특정 뉴스 카테고리 - 뉴스목록)
        for a in dom.select('a.cluster_text_headline'):
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
                
        # 영역 제한 -3  (특정 뉴스 한 개)
        if dom.select_one('#contents'):
            # 파일로 저장 - 뉴스
            filename = resp.url.split('/')[-1]
            filename = re.sub('[?#!= ]', '', filename)
            with open('./news_naver/'+filename+'.txt', 'w', encoding='utf8') as fp:
                fp.write(dom.select_one('#contents').get_text().strip())
            
            for img in dom.select('#contents img[src]'):
                nurl = urljoin(seed, img.attrs['src'])
                if nurl not in URLs and nurl not in visited:
                    URLs.append(nurl)
        print(len(URLs))

6
49
82
125
152
179
212
213
214
215
216
217
218
217
218
217
218
219
218
219
220
221
220
221
222
221
222
221
222
223
222
223
222
223
222
223
224
223
224
223
222
223
222
223
224
223
222
223
222
221
220
219
218
217
218
219
218
217
216
215
214
213
214
215
214
215
214
213
212
211
210
209
208
207
206
205
204
205
204
203
202
201
200
199
198
197
196
195
194
193
194
193
192
191
190
189
188
189
188
187
186
185
184
183
182
181
180
179
178
177
176
175
174
175
174
173
172
171
172
171
170
169
168
167
166
165
164
163
162
163
162
161
160
159
160
159
160
159
158
157
156
155
154
153
152
151
150
149
148
149
148
147
146
145
144
143
142
141
142
141
140
141
140
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
125
124
123
122
121
122
121
120
119
118
117
116
115
114
113
112
111
110
109
108
107
106
105
104
103
102
101
102
101
100
99
98
97
96
97
96
95
94


In [None]:
# css selector 들이 겹치지 않도록 확인해줘야 함.
# 그 이유?
# 계속 똑같은 걸 찾으면서 반복할 수 있기 때문에

In [40]:
url = 'https://news.naver.com/'
resp = get(url, headers=headers)
resp.headers['content-type']

'text/html;charset=utf-8'

#### Page Rank

In [None]:
# DB 를 붙였다고 가정해보자.
# 어디서 정보를 찾는 것이 가장 가치 있을까
# 구글의 관련 알고리즘임.
# 하이퍼 링크를 정량화
# outbound link -> 나가는 링크
# inbound link -> 들어오는 링크
# paga rank -> 가중치, 영향력을 다룸. 순위를 부여함(누가 더 중요한 정보를 제공했는가에 따라)
# text Rank 라는 것도 있음.
# 누가 누구를 링크하는지를 알아야 함.

In [42]:
visited[:10]

['https://news.naver.com/',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=100',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=101',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=102',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=103',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=105',
 'https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=104',
 'https://n.news.naver.com/mnews/article/421/0006689688?sid=100',
 'https://n.news.naver.com/mnews/article/011/0004168186?sid=100',
 'https://n.news.naver.com/mnews/article/018/0005444484?sid=100']

In [43]:
urlparse('https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=100')

ParseResult(scheme='https', netloc='news.naver.com', path='/main/main.naver', params='', query='mode=LSD&mid=shm&sid1=100', fragment='')

#### 다음 뉴스 예제

In [None]:
# 똑같은 것을 다음에서 해보기

In [41]:
!mkdir news_daum

In [10]:
URLs = ['https://news.daum.net/']
visited = list()

while URLs:
    print('-'*20)
    seed = URLs.pop(0)
    print('set seed')
    visited.append(seed)
    print('upload visited')
    
    resp = get(seed, headers=headers)
    
    try:
        resp.raise_for_status()
    except HTTPError as e:
        print(e)
        continue
        
    if re.search('image', resp.headers['content-type']):
        filename = resp.url.split('/')[-1]
        filename = re.sub('[?#!= ]', '', filename)
        ext = re.search('image/(\w+);?',resp.headers['content-type']).group(1)
        with open('./news_daum/'+filename+'.'+ext, 'wb') as fp:
            fp.write(resp.content)
        print('image download')
        
    print('next step')
        
    if re.search('html', resp.headers['content-type']):
        dom = BeautifulSoup(resp.text, 'html.parser')
        
        print('start news text')
        
        for a in dom.select('ul[class=gnb_comm] a')[1:7]:
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
        print('news home')
                
        for a in dom.select('a.link_txt'):
            nurl = urljoin(seed, a.attrs['href'])
            if nurl not in URLs and nurl not in visited:
                URLs.append(nurl)
        print('news_cat_one')
                
        if dom.select('#mArticle [data-cloud="article_body"]'):
            filename = resp.url.split('/')[-1]
            filename = re.sub('[?#!= ]', '', filename)
            with open('./news_daum/'+filename+'.txt', 'w', encoding='utf8') as fp:
                fp.write(dom.select_one('#mArticle [data-cloud="article_body"]').get_text().strip())
            
            for img in dom.select('#mArticle [data-cloud="article_body"] img[src]'):
                nurl = urljoin(seed, img.attrs['src'])
                if nurl not in URLs and nurl not in visited:
                    URLs.append(nurl)
        print(len(URLs))
        print('-'*20)

--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
44
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
64
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
83
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
105
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
120
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
141
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
163
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
181
--------------------
--------------------
set seed
upload visited
next s

next step
start news text
news home
news_cat_one
1896
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1897
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1897
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1905
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1914
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1914
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1915
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1915
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

news home
news_cat_one
2110
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2110
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2110
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2116
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2126
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2135
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2139
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2150
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2151
------------

next step
start news text
news home
news_cat_one
2277
--------------------
--------------------
set seed
upload visited
image download
next step
--------------------
set seed
upload visited
image download
next step
--------------------
set seed
upload visited
image download
next step
--------------------
set seed
upload visited
image download
next step
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2272
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2271
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2270
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2269
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2268
--------------------
--------------------
set seed
upload visited


next step
start news text
news home
news_cat_one
2200
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2199
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2198
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2197
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2196
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2195
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2194
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2193
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

start news text
news home
news_cat_one
2127
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2126
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2125
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2124
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2123
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2122
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2121
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2120
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2

next step
start news text
news home
news_cat_one
2052
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2051
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2050
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2049
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2048
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2047
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2046
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2045
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

1983
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1982
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1981
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1980
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1979
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1978
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1977
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1976
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1975
--------------------
--------------

next step
start news text
news home
news_cat_one
2034
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2033
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2032
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2031
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2030
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2029
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2028
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
2027
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

next step
start news text
news home
news_cat_one
1965
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1964
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1963
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1962
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1961
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1960
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1959
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1958
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

next step
start news text
news home
news_cat_one
1896
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1895
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1894
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1893
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1892
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1891
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1890
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news_cat_one
1889
--------------------
--------------------
set seed
upload visited
next step
start news text
news home
news

KeyboardInterrupt: 

In [7]:
URLs[-1]

'https://news.daum.net/photo/'

In [None]:
# 위에서 시간이 지날수록 URLs 이 줄어드는 이유는?