#### 크롤링시 조심
1. 컨텐츠에 지적재산권이 있는지
2. 크롤링이 사이트에 부담을 주는지
3. 사이트 이용방침을 위반하지 않는지
4. 사용자의 민감한 정보를 가져오지 않는지
5. 가져온 컨텐츠를 적합한 사용표준하에 사용하는지

#### robots.txt

crawler 같은 bot의 접근을 제어하기 위한 규약

https://www.google.co.kr/robots.txt

Disallow: /search

Allow: /search/about
Allow: /search/static
...

#### 사용된 웹 기술 확인

In [12]:
import builtwith

print(builtwith.parse('https://www.google.com')) #정상적 접근아님
print(builtwith.parse('https://www.naver.com'))
print(builtwith.parse('https://www.wordpress.com'))

{'web-servers': ['Google Web Server']}
{}
{'web-servers': ['Nginx'], 'font-scripts': ['Google Font API'], 'ecommerce': ['WooCommerce'], 'cms': ['WordPress'], 'programming-languages': ['PHP'], 'blogs': ['PHP', 'WordPress']}


#### 웹사이트 소유자 확인

In [13]:
import whois

print(whois.whois("naver.com"))
print(whois.whois("sogang.ac.kr"))

{
  "domain_name": [
    "NAVER.COM",
    "naver.com"
  ],
  "registrar": "Gabia, Inc.",
  "whois_server": "whois.gabia.com",
  "referral_url": null,
  "updated_date": [
    "2016-08-05 06:37:57",
    "2018-02-28 11:27:15"
  ],
  "creation_date": [
    "1997-09-12 04:00:00",
    "1997-09-12 00:00:00"
  ],
  "expiration_date": [
    "2023-09-11 04:00:00",
    "2023-09-11 00:00:00"
  ],
  "name_servers": [
    "NS1.NAVER.COM",
    "NS2.NAVER.COM",
    "ns1.naver.com",
    "ns2.naver.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "ok https://icann.org/epp#ok"
  ],
  "emails": [
    "white.4818@navercorp.com",
    "dl_ssl@navercorp.com",
    "abuse@gabia.com"
  ],
  "dnssec": "unsigned",
  "name": "NAVER Corp.",
  "org": "NAVER Corp.",
  "address": "6 Buljung-ro, Bundang-gu, Seongnam

#### 웹사이트 다운로드

In [35]:
import urllib
from urllib.request import urlopen

![](https://user-images.githubusercontent.com/38183218/42746165-7451e354-8911-11e8-93de-8822f34665ca.PNG)
![](https://user-images.githubusercontent.com/38183218/42746166-7477cb14-8911-11e8-81da-1a53595e2c68.PNG)

In [36]:
resp = urlopen("https://python.org") # url을 열어 웹의 내용을 받아오겠다는 request

In [37]:
html = resp.read() #들어있는 html을 읽어옴 바이트로

In [38]:
html.decode('utf-8')



![](https://user-images.githubusercontent.com/38183218/42746167-749c5eb6-8911-11e8-959b-1ab9cbdc10e1.PNG)
![](https://user-images.githubusercontent.com/38183218/42746168-74d641bc-8911-11e8-8239-711beb7ec91a.PNG)

In [24]:
# 쿼리 포함한 url(search에 던지는 url)
#  HTTP Error 403: Forbidden
# resp = urlopen("https://www.google.co.kr/search?q=%EC%B9%B4%EC%9D%B4%EC%8A%A4%ED%8A%B8+%EB%A7%90%EB%AD%89%EC%B9%98&sa=X&ved=0ahUKEwjSj_Ky4KLcAhWRFogKHZZ0C_kQ1QIIuQEoBg&biw=1536&bih=759")
# resp = urlopen("https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98")

![](https://user-images.githubusercontent.com/38183218/42746171-754a7096-8911-11e8-9c69-92cae978ca05.PNG)

In [39]:
from urllib.error import HTTPError

try:
    urlopen("https://www.google.co.kr/search?q=%EC%B9%B4%EC%9D%B4%EC%8A%A4%ED%8A%B8+%EB%A7%90%EB%AD%89%EC%B9%98&sa=X&ved=0ahUKEwjSj_Ky4KLcAhWRFogKHZZ0C_kQ1QIIuQEoBg&biw=1536&bih=759")
except HTTPError as e:
    print(e.code, e.reason, e.headers)

403 Forbidden Content-Type: text/html; charset=UTF-8
Date: Tue, 17 Jul 2018 04:15:54 GMT
Server: gws
Cache-Control: private
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close




#### 예제
![](https://user-images.githubusercontent.com/38183218/42746175-75ca183c-8911-11e8-8042-6d23d0c85304.PNG)

In [40]:
from urllib.request import Request

In [41]:
def download(url, agent = "python bot", num_retries=2):
    headers = {'User-agent':agent}
    req = Request(url, headers = headers)
                 
    try:
        resp = urlopen(req)
        
    except HTTPError as e:
        resp = None
        print(e.code,e.reason,e.headers)
        
        if 500<=e.code<600 and num_retries>0:
            return download(url, num_retries = num_retries - 1) #재귀적으로 retry
   
    
    
    return resp
                 

In [42]:
url = "https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98"
download(url)

403 Forbidden Date: Tue, 17 Jul 2018 04:16:00 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Set-Cookie: page_uid=T01WlspVuFKssthPfShssssssbK-025980; path=/; domain=.naver.com
Set-Cookie: _naver_usersession_=GON10NJDhT+gPruCHVUJqA==; path=/; expires=Tue, 17-Jul-18 04:21:00 GMT; domain=.naver.com
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; report=/p/er/post/xss
Cache-Control: no-cache, no-store, must-revalidate, max-age=0
Pragma: no-cache
Vary: Accept-Encoding




header의 agent정보를 사용자로 바꿔주면 403에러 해결

In [43]:
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
resp = download( url, agent = user)
type(resp)

http.client.HTTPResponse

In [44]:
resp.read().decode("utf-8")

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

![](https://user-images.githubusercontent.com/38183218/42746176-75f20716-8911-11e8-8732-ac8ea6479de4.PNG)

- requests 이용
    - 훨씬더 abstract 하고 사용편함, 짧은 코드
    - get 방식/ post 방식
        - get: 파라미터를 url에 붙여서 넘김, 편하지만 데이터타입과 인코딩등 한계 있음 
        - post: url과 데이터를 분리해서 넘김, 넘길때 데이터를 실제로 받아옴, 많은 양의 데이터를 넘길때 등 사용
        
http://docs.python-requests.org/en/master/user/quickstart/

In [45]:
import requests

In [46]:
def download2(url, agent = "python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.request("get", url, headers = headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)

    return resp

In [47]:
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
html = download2( url, agent = user)

In [48]:
html.text

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

In [49]:
html.encoding

'UTF-8'

![](https://user-images.githubusercontent.com/38183218/42746173-756e4944-8911-11e8-91e8-81bd0a6920fd.PNG)

In [50]:
from urllib import parse

In [51]:
parse.quote("한글") #hex bytes로 바꿔줌, 한글을 url에 들어갈 쿼리문으로 바꾸는 역할

'%ED%95%9C%EA%B8%80'

In [52]:
url = "https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8\
&query="
url = url + parse.quote("말뭉치")
url

'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98'

In [53]:
user

'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

In [54]:
html = download2(url, agent = user)
html.text

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

#### 예제 post-json

In [55]:
import json

In [56]:
params = {
        "where":"nexearch",
        "sm":"top_hty",
        "fbm":1,
        "ie":"utf8",
        "query":parse.quote("고려대")
         }

In [57]:
params

{'fbm': 1,
 'ie': 'utf8',
 'query': '%EA%B3%A0%EB%A0%A4%EB%8C%80',
 'sm': 'top_hty',
 'where': 'nexearch'}

In [58]:
jsonParams = json.dumps(params)
jsonParams #to str

'{"where": "nexearch", "sm": "top_hty", "fbm": 1, "ie": "utf8", "query": "%EA%B3%A0%EB%A0%A4%EB%8C%80"}'

In [59]:
def downloadJson(url, agent="python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.post(url, json=jsonParams, headers=headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)
    
    return resp

In [60]:
html = downloadJson("http://httpbin.org/post",agent=user)
html.text

'{"args":{},"data":"\\"{\\\\\\"where\\\\\\": \\\\\\"nexearch\\\\\\", \\\\\\"sm\\\\\\": \\\\\\"top_hty\\\\\\", \\\\\\"fbm\\\\\\": 1, \\\\\\"ie\\\\\\": \\\\\\"utf8\\\\\\", \\\\\\"query\\\\\\": \\\\\\"%EA%B3%A0%EB%A0%A4%EB%8C%80\\\\\\"}\\"","files":{},"form":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Content-Length":"122","Content-Type":"application/json","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"},"json":"{\\"where\\": \\"nexearch\\", \\"sm\\": \\"top_hty\\", \\"fbm\\": 1, \\"ie\\": \\"utf8\\", \\"query\\": \\"%EA%B3%A0%EB%A0%A4%EB%8C%80\\"}","origin":"163.152.3.137","url":"http://httpbin.org/post"}\n'

#### get방식

In [61]:
def downloadGet(url, agent="python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.get(url, headers=headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)
    
    return resp

In [62]:
downloadGet(url, agent=user).text[:1000]

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

### Crawling

![](https://user-images.githubusercontent.com/38183218/42746174-759b921e-8911-11e8-9370-972934728d01.PNG)

#### BeautifulSoup

BS is a parsing library while Scrapy is a Web-spider or web scraper framework

In [63]:
from bs4 import BeautifulSoup

![](https://user-images.githubusercontent.com/38183218/42746169-74fdd07e-8911-11e8-9dd1-90f8a57783d6.PNG)

In [64]:
html = """
    <html>
        <head></head>
        <body>
            <div id="wrap">
                <p class="content">
                    <a href="_blank">Click</a>
                </p>
            </div>
        </body>
    </html>
"""

In [65]:
doc = BeautifulSoup(html,"lxml") #dom tree로 가져옴

In [66]:
doc.div

<div id="wrap">
<p class="content">
<a href="_blank">Click</a>
</p>
</div>

In [67]:
doc.a["href"] 

'_blank'

In [68]:
doc.a.attrs

{'href': '_blank'}

![](https://user-images.githubusercontent.com/38183218/42746170-7524bc20-8911-11e8-9b98-1fe3ff01a525.PNG)

In [126]:
html

'\n    <html>\n        <head></head>\n        <body>\n            <div id="wrap">\n                <p class="content">\n                    <a href="_blank">Click</a>\n                </p>\n            </div>\n        </body>\n    </html>\n    \n'

In [16]:
soup = BeautifulSoup(downloadGet(url, agent=user).text, "lxml")

In [129]:
type(soup)

bs4.BeautifulSoup

In [130]:
soup.find('a')

<a href="#lnb"><span>메뉴 영역으로 바로가기</span></a>

In [135]:
a_list = soup.find_all('a')

In [136]:
len(a_list)

384

In [139]:
for row in a_list:
    print(row.attrs["href"])

#lnb
#content
http://www.naver.com
#
#
https://help.naver.com/support/alias/search/word/word_16.naver
#
#
https://nid.naver.com/nidlogin.login?url=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_hty%26fbm%3D1%26ie%3Dutf8%26query%3D%25EB%25A7%2590%25EB%25AD%2589%25EC%25B9%2598
https://help.naver.com/support/alias/search/word/word_16.naver
https://help.naver.com/support/alias/search/word/word_21.naver
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/w

In [69]:
params = {
    }

url = "http://www.ppomppu.co.kr/zboard/zboard.php?id=freeboard"

In [70]:
html = downloadGet(url, agent=user)

In [71]:
html.encoding

'euc-kr'

In [72]:
type(html.text)
# 디코딩 필요

str

In [73]:
html.text #깨짐

'\x1f�\x08\x00\x00\x00\x00\x00\x00\x03�}�\x7f\x1b蘭偵惰s���P�]◁��\x1d�\x0b\t��\x0b�Bzzz�|�\x19IckbiF�\x19�q!�\x7fq\t�\x00%홴d�O���\x0c56���\x04Zh�\x108iR89M\x02�鞋k���F�$K~L�\x13�cif玆�^{獐^{蔗g\x0e<p寮�#�|�\t\x12U�1縝�\x1f\x7f孩���;�욹\x1et:\x0f\x1d9D�厚G�y��\x1d.rDb\x05�WxQ`cN�\x13�2��*J▤�\x1c\x1a\x1ar\x0cy\x1d‡�<誌�\x04�rca酵]1�tD�\x08\x13h=�w�\x0f�F�O�SX�熢鼈����9(\n\n\'(�#�\t�!a���Q�\x13�\x13���p��dN�s��}@b��\x1e�\x17�\x12�ⓑ�8;훗遞F<��\x18W�EX�k��w狐\x1f�\x1f\x14�\tV�C1s젝��s�~�1J\tl��3lR����移�\x0e?創s?\'\x07\x0f;*H#��5X�D?暉暈+\x08\x07멜!Q��&*�hm=���^F8�sC\tQRL�C|D��#� \x1f咽徹>�\x0b��l�.��\x18�\x07群G�p-���/%eN㈜Y㉦\x7f���\x11���9�\x1e邵\x19�#\t�`\x0c\x1f₃�v�\x17U)!�\tNR���每쳬�~놉L~�H�u\x13IGXt\x0cHNJ(;y�\x0c��D汲rw�|�.GB鰲"�q\x07m� ����Z\x01�Wb\\\x00/\x10;Q/�oΞ鑛/\x15\x0e8�\x1bh�\x031^\x18 Q�懿�t秘��麟\x17퉜\x18�&x\x19\x14�＃<玟폰莫�YQ\x11\x1fy\x01���H\\�o�綺�(�)6����m�k�\x02�.�\x0c�eJ��FkrC�\x04S�6p�IK:P�A?痲配奄말5藝6��du!頭��}拾\'�m뽁`\\�Q�\x07�rB��;�9�栓��Ja遊�-]�賑��x��\x01�z]]�n�\x0e\x08�?o���\t��\x14qF~\x1d퀭�Y�V�\r\x05�4/x랜�?

In [150]:
type(html.content)

bytes

In [151]:
#urllib로 만들었던 download를 써보자

In [74]:
html = download(url, agent=user)

In [75]:
type(html) #HTTPResponse 객체

http.client.HTTPResponse

In [76]:
html = html.read()
type(html)

bytes

In [77]:
html = html.decode("euc-kr") 

In [78]:
print(type(html))
html

<class 'str'>


'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="author" content="PPOMPPU CO.">\n<meta name="description" content="뽐뿌">\n<meta name="keywords" content="">\n\n\n<!--\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />\n-->\n\n\n<meta property="og:image" content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" />\n\n<meta property="og:site_name" content="뽐뿌" />\n\n<title>뽐뿌 - 자유게시판</title><!--<link href=\'http://fonts.googleapis.com/css?family=Noto+Sans\' rel=\'stylesheet\' type=\'text/css\'>-->\n\n<link rel="stylesheet" type="text/css" hr

In [88]:
html2 = html.encode("utf-8").decode("utf-8") #utf-8로 인코딩 후 다시 디코딩

In [89]:
type(html2)
html2

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="author" content="PPOMPPU CO.">\n<meta name="description" content="뽐뿌">\n<meta name="keywords" content="">\n\n\n<!--\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />\n-->\n\n\n<meta property="og:image" content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" />\n\n<meta property="og:site_name" content="뽐뿌" />\n\n<title>뽐뿌 - 자유게시판</title><!--<link href=\'http://fonts.googleapis.com/css?family=Noto+Sans\' rel=\'stylesheet\' type=\'text/css\'>-->\n\n<link rel="stylesheet" type="text/css" hr

In [84]:
text1 = BeautifulSoup(html, "lxml")
text1.contents #원래 깨지던 문서를 BeautifulSoup가 자동으로 인코딩,디코딩

['html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"',
 <html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="text/javascript" http-equiv="Content-Script-Type"/>
 <meta content="text/css" http-equiv="Content-Style-Type"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="PPOMPPU CO." name="author"/>
 <meta content="뽐뿌" name="description"/>
 <meta content="" name="keywords"/>
 <!--
 <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />
 -->
 <meta content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" property="og:image"/>
 <meta content="뽐뿌" property="og:site_name"/>
 <title>뽐뿌 - 자유게시판</title><!--<link href='http://fonts.googleapis.com/css?family=Noto+Sans' rel='stylesheet' type='text/css'>-->
 <link href="//www.ppomppu.co.kr/css/style.css?v=2018070218" rel="styles

In [85]:
a_lst1 = text1.find_all('a')

In [86]:
for row in a_lst1:
    print(row.attrs)

{'href': 'http://s.ppomppu.co.kr/?idno=ad&ppomno=&ad_no=10307&target=aHR0cDovL3R4LnRoZWxpbmUxMy5jb20vY2xpY2s/YWQ9NjE4NjE=&encode=on', 'target': '_blank'}
{'rel': ['#tab1-contents'], 'class': ['tab']}
{'rel': ['#tab2-contents'], 'class': ['tab']}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=event'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=buy'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=help'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=freeboard'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=etc_info'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=free_picture'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=news2'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=review'}
{'href': 'http://www.ppomppu.co.kr/recent_main_article.php?type=market'}
{'href': 'http://www.ppomppu.co.kr/myinfo/env.php?cmd=env', 'target': '_blank'}
{'href': 'h

In [90]:
text2 = BeautifulSoup(html2, "lxml")
text2.contents

['html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"',
 <html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="text/javascript" http-equiv="Content-Script-Type"/>
 <meta content="text/css" http-equiv="Content-Style-Type"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="PPOMPPU CO." name="author"/>
 <meta content="뽐뿌" name="description"/>
 <meta content="" name="keywords"/>
 <!--
 <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />
 -->
 <meta content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" property="og:image"/>
 <meta content="뽐뿌" property="og:site_name"/>
 <title>뽐뿌 - 자유게시판</title><!--<link href='http://fonts.googleapis.com/css?family=Noto+Sans' rel='stylesheet' type='text/css'>-->
 <link href="//www.ppomppu.co.kr/css/style.css?v=2018070218" rel="styles

#### CSS selector

In [96]:
rows = text1.select("li > div")
print(len(links))

9


In [97]:
for row in rows:
    print(row)

<div class="sub-menu">
<ul>
<li class="strong"><i class="caret"></i><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu">뽐뿌게시판</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu2">휴대폰뽐뿌</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu4">해외뽐뿌</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu3">MD뽐뿌</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu5">오프라인뽐뿌</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu7">뷰티뽐뿌</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu6">업체뽐뿌</a></li>
<li class="divider"><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=pmarket">업체게시판</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=pmarket2">휴대폰업체</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=pmarket3">통신업체</a></li>
<li><a href="http://www.ppomppu.co.kr/zboard/zboard.php?id=card_market">카드업체</a></li>


#### Collecting Links


![46](https://user-images.githubusercontent.com/38183218/42806141-a9c51362-89e8-11e8-8e86-0aaceaa9da69.PNG)

In [98]:
def getUrls(url, params=None, num_retries=2):
    headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
    resp = requests.get(url, params=params, headers=headers)
    
    if 500<=resp.status_code < 600 and num_retries>0:
        print("error: ", resp.status_code, resp.reason)
        return getUrls(url=url, params=params, num_retries=num_retries-1)
    
    html = BeautifulSoup(resp.content, "lxml")
    links = html.select("h3.r > a")
    
    return [link["href"] for link in links if link.has_attr("href") == True]

In [99]:
seed = "https://www.google.com/search"
params = {
    "q":"한글",
    "ie":"utf-8"
}
urlList = getUrls(seed,params)
print(urlList)

['http://www.hancom.com/downLoad.downPU.do', 'https://www.hancom.com/product/productWindowsMain.do', 'https://namu.wiki/w/%ED%95%9C%EA%B8%80', 'https://namu.wiki/w/%ED%95%9C%EC%BB%B4%EC%98%A4%ED%94%BC%EC%8A%A4%20%ED%95%9C%EA%B8%80', 'https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B8%80', 'https://ko.wikibooks.org/wiki/%ED%95%9C%EA%B5%AD%EC%96%B4_%EC%9E%85%EB%AC%B8/%ED%95%9C%EA%B8%80_%EC%9E%90%EB%AA%A8', 'https://www.korean.go.kr/hangeul/principle/001.html', 'http://www.kocca.kr/cop/bbs/view/B0000137/1833364.do?menuNo=200827&noticevent=Y']


In [101]:
def getUrlsWithSelector(url, params=None, select="a", num_retries=2):
    headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
    
    resp = requests.get(url, params=params, headers=headers)
    
    if 500<=resp.status_code < 600 and num_retries>0:
        print("error: ", resp.status_code, resp.reason)
        return getUrls(url=url, params=params, num_retries=num_retries-1)
    
    html = BeautifulSoup(resp.content, "lxml")
    links = html.select(select)
    
    return [link.get("href") for link in links if link.has_attr("href") == True]

In [102]:
def checkUrl(url):
    dest = parse.urlparse(url)
    
    if len(dest.scheme) > 0 and dest.scheme in ['http','https']:
        return True
    else:
        return False

In [174]:
seed = "https://www.naver.com"
params = {
    "query":"말뭉치",
    "ie":"utf-8"
}

In [175]:
queue = getUrlsWithSelector(seed,params, "a")
result = []

In [177]:
while queue:
    url = queue.pop()
    while checkUrl(url) is False:
        url = queue.pop()
        
    urlList = getUrls(url)
    
    result.append(url)
    result.extend(urlList)
    
    print(url, len(urlList), end="\n\n")

http://www.navercorp.com/ 0

https://help.naver.com/ 0

https://www.navercorp.com/ko/company/proposalGuide.nhn 0

http://recruit.navercorp.com/naver/recruitMain 0

http://www.navercorp.com/ 0

http://www.naverlabs.com/ 0

http://d2.naver.com/ 0

http://naver.github.io/ 0

https://developers.naver.com/docs/common/openapiguide/#/apilist.md/ 0

http://developers.naver.com 0

https://smartplace.naver.com/ 0

https://sell.storefarm.naver.com/#/home/about 0

http://business.naver.com/service.html 0

http://business.naver.com/guide.html 0

http://www.navercorp.com/ko/service/business.nhn 0

http://www.navercorp.com/ko/service/creators.nhn 0

http://music.naver.com/promotion/clovaspeaker/ticket.nhn 0

http://music.naver.com/promotion/clovaspeaker/ticket.nhn 0

https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%ED%94%84%EB%A1%9C%EC%A0%9D%ED%8A%B8%EA%BD%83 0

https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%ED%94%84%EB%A1%9C%E

http://search.naver.com/search.naver?where=nexearch&query=%ED%97%AC%EA%B8%B0%EC%B6%94%EB%9D%BD&sm=top_lve&ie=utf8 0

http://datalab.naver.com/keyword/realtimeDetail.naver?datetime=2018-07-17T17:44:00&query=%EC%9D%B4%EC%9B%90%ED%9D%AC&where=main 0

http://search.naver.com/search.naver?where=nexearch&query=%EC%9D%B4%EC%9B%90%ED%9D%AC&sm=top_lve&ie=utf8 0

http://datalab.naver.com/keyword/realtimeDetail.naver?datetime=2018-07-17T17:44:00&query=%EC%A4%91%EB%B3%B5&where=main 0

http://search.naver.com/search.naver?where=nexearch&query=%EC%A4%91%EB%B3%B5&sm=top_lve&ie=utf8 0

http://datalab.naver.com/keyword/realtimeDetail.naver?datetime=2018-07-17T17:44:00&query=%EA%B9%80%EB%B3%91%EC%A4%80&where=main 0

http://search.naver.com/search.naver?where=nexearch&query=%EA%B9%80%EB%B3%91%EC%A4%80&sm=top_lve&ie=utf8 0

http://datalab.naver.com/keyword/realtimeDetail.naver?datetime=2018-07-17T17:44:00&query=%EC%A0%9C%ED%97%8C%EC%A0%88&where=main 0

http://search.naver.com/search.naver?where=nexearch&q

IndexError: pop from empty list

In [178]:
headers = {"User-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
resp = requests.get(seed, params=params, headers=headers)

In [179]:
html = BeautifulSoup(resp.content, "lxml")

In [181]:
html.select("a")

[<a href="#news_cast" onclick="document.getElementById('news_cast2').tabIndex = -1;document.getElementById('news_cast2').focus();return false;"><span>뉴스스탠드 바로가기</span></a>,
 <a href="#themecast" onclick="document.getElementById('themecast').tabIndex = -1;document.getElementById('themecast').focus();return false;"><span>주제별캐스트 바로가기</span></a>,
 <a href="#time_square" onclick="document.getElementById('time_square').tabIndex = -1;document.getElementById('time_square').focus();return false;"><span>타임스퀘어 바로가기</span></a>,
 <a href="#shp_cst" onclick="document.getElementById('shp_cst').tabIndex = -1;document.getElementById('shp_cst').focus();return false;"><span>쇼핑캐스트 바로가기</span></a>,
 <a href="#account" onclick="document.getElementById('account').tabIndex = -1;document.getElementById('account').focus();return false;"><span>로그인 바로가기</span></a>,
 <a class="al_favorite" data-clk="top.mkhome" href="http://help.naver.com/support/alias/contents2/naverhome/naverhome_1.naver">네이버를 시작페이지로<span class=

#### Crawling vs Scraping
![55](https://user-images.githubusercontent.com/38183218/42806142-a9fae76c-89e8-11e8-955c-d60bdd3d58d9.PNG)