- request
- BeautifulSoup

### HTTP 프로토콜 

request와 response

웹브라우저가 http를 통해 서버에 헤더를 붙여 request하면 

서버 역시 헤더에 정보를 붙여 response, html코드화하고

웹브라우져가 렌더링 통해 화면을 띄우는 구조

하위에 TCP/IP 


#### 크롤링시 조심
1. 컨텐츠에 지적재산권이 있는지
2. 크롤링이 사이트에 부담을 주는지
3. 사이트 이용방침을 위반하지 않는지
4. 사용자의 민감한 정보를 가져오지 않는지
5. 가져온 컨텐츠를 적합한 사용표준하에 사용하는지

#### robots.txt

crawler 같은 bot의 접근을 제어하기 위한 규약

https://www.google.co.kr/robots.txt

Disallow: /search

Allow: /search/about
Allow: /search/static
...

#### 사용된 웹 기술 확인

In [12]:
import builtwith

print(builtwith.parse('https://www.google.com')) #정상적 접근아님
print(builtwith.parse('https://www.naver.com'))
print(builtwith.parse('https://www.wordpress.com'))

{'web-servers': ['Google Web Server']}
{}
{'web-servers': ['Nginx'], 'font-scripts': ['Google Font API'], 'ecommerce': ['WooCommerce'], 'cms': ['WordPress'], 'programming-languages': ['PHP'], 'blogs': ['PHP', 'WordPress']}


#### 웹사이트 소유자 확인

In [13]:
import whois

print(whois.whois("naver.com"))
print(whois.whois("sogang.ac.kr"))

{
  "domain_name": [
    "NAVER.COM",
    "naver.com"
  ],
  "registrar": "Gabia, Inc.",
  "whois_server": "whois.gabia.com",
  "referral_url": null,
  "updated_date": [
    "2016-08-05 06:37:57",
    "2018-02-28 11:27:15"
  ],
  "creation_date": [
    "1997-09-12 04:00:00",
    "1997-09-12 00:00:00"
  ],
  "expiration_date": [
    "2023-09-11 04:00:00",
    "2023-09-11 00:00:00"
  ],
  "name_servers": [
    "NS1.NAVER.COM",
    "NS2.NAVER.COM",
    "ns1.naver.com",
    "ns2.naver.com"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "ok https://icann.org/epp#ok"
  ],
  "emails": [
    "white.4818@navercorp.com",
    "dl_ssl@navercorp.com",
    "abuse@gabia.com"
  ],
  "dnssec": "unsigned",
  "name": "NAVER Corp.",
  "org": "NAVER Corp.",
  "address": "6 Buljung-ro, Bundang-gu, Seongnam

#### 웹사이트 다운로드

In [14]:
import urllib
from urllib.request import urlopen

![](https://user-images.githubusercontent.com/38183218/42746165-7451e354-8911-11e8-93de-8822f34665ca.PNG)
![](https://user-images.githubusercontent.com/38183218/42746166-7477cb14-8911-11e8-81da-1a53595e2c68.PNG)

In [15]:
resp = urlopen("https://python.org") # url을 열어 웹의 내용을 받아오겠다는 request

In [17]:
html = resp.read() #바이트로 들어있는 html을 읽어옴

In [19]:
html.decode('utf-8')



![](https://user-images.githubusercontent.com/38183218/42746167-749c5eb6-8911-11e8-959b-1ab9cbdc10e1.PNG)
![](https://user-images.githubusercontent.com/38183218/42746168-74d641bc-8911-11e8-8239-711beb7ec91a.PNG)

In [24]:
# 쿼리 포함한 url(search에 던지는 url)
#  HTTP Error 403: Forbidden
# resp = urlopen("https://www.google.co.kr/search?q=%EC%B9%B4%EC%9D%B4%EC%8A%A4%ED%8A%B8+%EB%A7%90%EB%AD%89%EC%B9%98&sa=X&ved=0ahUKEwjSj_Ky4KLcAhWRFogKHZZ0C_kQ1QIIuQEoBg&biw=1536&bih=759")
# resp = urlopen("https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98")

![](https://user-images.githubusercontent.com/38183218/42746171-754a7096-8911-11e8-9c69-92cae978ca05.PNG)

In [26]:
from urllib.error import HTTPError

try:
    urlopen("https://www.google.co.kr/search?q=%EC%B9%B4%EC%9D%B4%EC%8A%A4%ED%8A%B8+%EB%A7%90%EB%AD%89%EC%B9%98&sa=X&ved=0ahUKEwjSj_Ky4KLcAhWRFogKHZZ0C_kQ1QIIuQEoBg&biw=1536&bih=759")
except HTTPError as e:
    print(e.code, e.reason, e.headers)

403 Forbidden Content-Type: text/html; charset=UTF-8
Date: Mon, 16 Jul 2018 04:28:15 GMT
Server: gws
Cache-Control: private
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close




#### 예제
![](https://user-images.githubusercontent.com/38183218/42746175-75ca183c-8911-11e8-8042-6d23d0c85304.PNG)

In [27]:
from urllib.request import Request

In [74]:
def download(url, agent = "python bot", num_retries=2):
    headers = {'User-agent':agent}
    req = Request(url, headers = headers)
                 
    try:
        resp = urlopen(req)
        
    except HTTPError as e:
        resp = None
        print(e.code,e.reason,e.headers)
        
        if 500<=e.code<600 and num_retries>0:
            return download(url, num_retries = num_retries - 1) #재귀적으로 retry
   
    
    
    return resp
                 

In [66]:
url = "https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98"
download(url)

403 Forbidden Date: Mon, 16 Jul 2018 04:51:52 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Set-Cookie: page_uid=T0wCjspySoCsstnTRQ8ssssst/R-183654; path=/; domain=.naver.com
Set-Cookie: _naver_usersession_=fztrpSfhFTSbEsirKobMgQ==; path=/; expires=Mon, 16-Jul-18 04:56:52 GMT; domain=.naver.com
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; report=/p/er/post/xss
Cache-Control: no-cache, no-store, must-revalidate, max-age=0
Pragma: no-cache
Vary: Accept-Encoding




header의 agent정보를 사용자로 바꿔주면 403에러 해결

In [75]:
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
resp = download( url, agent = user)
type(resp)

http.client.HTTPResponse

In [76]:
resp.read().decode("utf-8")

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

![](https://user-images.githubusercontent.com/38183218/42746176-75f20716-8911-11e8-8732-ac8ea6479de4.PNG)

- requests 이용
    - 훨씬더 abstract 하고 사용편함, 짧은 코드
    - get 방식/ post 방식
        - get: 파라미터를 url에 붙여서 넘김, 편하지만 데이터타입과 인코딩등 한계 있음 
        - post: url과 데이터를 분리해서 넘김, 넘길때 데이터를 실제로 받아옴, 많은 양의 데이터를 넘길때 등 사용

In [71]:
import requests

In [82]:
def download2(url, agent = "python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.request("get", url, headers = headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)

    return resp

In [83]:
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
html = download2( url, agent = user)

In [85]:
html.text

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

In [86]:
html.encoding

'UTF-8'

![](https://user-images.githubusercontent.com/38183218/42746173-756e4944-8911-11e8-91e8-81bd0a6920fd.PNG)

In [87]:
from urllib import parse

In [88]:
parse.quote("한글") #hex bytes로 바꿔줌, 한글을 url에 들어갈 쿼리문으로 바꾸는 역할

'%ED%95%9C%EA%B8%80'

In [91]:
url = "https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8\
&query="
url = url + parse.quote("말뭉치")
url

'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query=%EB%A7%90%EB%AD%89%EC%B9%98'

In [96]:
user

'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

In [99]:
html = download2(url, agent = user)
html.text

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

#### 예제 post-json

In [100]:
import json

In [101]:
params = {
        "where":"nexearch",
        "sm":"top_hty",
        "fbm":1,
        "ie":"utf8",
        "query":parse.quote("고려대")
         }

In [102]:
params

{'fbm': 1,
 'ie': 'utf8',
 'query': '%EA%B3%A0%EB%A0%A4%EB%8C%80',
 'sm': 'top_hty',
 'where': 'nexearch'}

In [103]:
jsonParams = json.dumps(params)
jsonParams #to str

'{"where": "nexearch", "sm": "top_hty", "fbm": 1, "ie": "utf8", "query": "%EA%B3%A0%EB%A0%A4%EB%8C%80"}'

In [106]:
def downloadJson(url, agent="python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.post(url, json=jsonParams, headers=headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)
    
    return resp

In [114]:
html = downloadJson("http://httpbin.org/post",agent=user)
html.text

'{"args":{},"data":"\\"{\\\\\\"where\\\\\\": \\\\\\"nexearch\\\\\\", \\\\\\"sm\\\\\\": \\\\\\"top_hty\\\\\\", \\\\\\"fbm\\\\\\": 1, \\\\\\"ie\\\\\\": \\\\\\"utf8\\\\\\", \\\\\\"query\\\\\\": \\\\\\"%EA%B3%A0%EB%A0%A4%EB%8C%80\\\\\\"}\\"","files":{},"form":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Content-Length":"122","Content-Type":"application/json","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"},"json":"{\\"where\\": \\"nexearch\\", \\"sm\\": \\"top_hty\\", \\"fbm\\": 1, \\"ie\\": \\"utf8\\", \\"query\\": \\"%EA%B3%A0%EB%A0%A4%EB%8C%80\\"}","origin":"163.152.3.129","url":"http://httpbin.org/post"}\n'

#### get방식

In [111]:
def downloadGet(url, agent="python bot", num_retries=2):
    headers = {'User-agent':agent}
    resp = requests.get(url, headers=headers)
    
    if 500<resp.status_code<600 and num_retries>0:
        print(resp.status_code, resp.reason)
        return download(url, num_retries=num_retries - 1)
    
    return resp

In [123]:
downloadGet(url, agent=user).text[:1000]

'<!doctype html> <html lang="ko"> <head> <meta charset="utf-8"> <meta name="referrer" content="always">  <meta name="format-detection" content="telephone=no,address=no,email=no"> <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=2.0"> <meta property="og:title" content="말뭉치 : 네이버 통합검색"/> <meta property="og:image" content="https://ssl.pstatic.net/sstatic/search/common/og_v3.png"> <meta property="og:description" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <meta name="description" lang="ko" content="\'말뭉치\'의 네이버 통합검색 결과입니다."> <title>말뭉치 : 네이버 통합검색</title> <link rel="shortcut icon" href="https://ssl.pstatic.net/sstatic/search/favicon/favicon_140327.ico">  <link rel="search" type="application/opensearchdescription+xml" href="https://ssl.pstatic.net/sstatic/search/opensearch-description.https.xml" title="Naver" /><link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/sstatic/search/pc/css/search1_180712.css"> <link rel="stylesheet" type="text/css" href="h

### Crawling

![](https://user-images.githubusercontent.com/38183218/42746174-759b921e-8911-11e8-9370-972934728d01.PNG)

#### BeautifulSoup

BS is a parsing library while Scrapy is a Web-spider or web scraper framework

In [115]:
from bs4 import BeautifulSoup

![](https://user-images.githubusercontent.com/38183218/42746169-74fdd07e-8911-11e8-9dd1-90f8a57783d6.PNG)

In [116]:
html = """
    <html>
        <head></head>
        <body>
            <div id="wrap">
                <p class="content">
                    <a href="_blank">Click</a>
                </p>
            </div>
        </body>
    </html>
"""

In [117]:
doc = BeautifulSoup(html,"lxml") #dom tree로 가져옴

In [118]:
doc.div

<div id="wrap">
<p class="content">
<a href="_blank">Click</a>
</p>
</div>

In [120]:
doc.a["href"] 

'_blank'

In [121]:
doc.a.attrs

{'href': '_blank'}

![](https://user-images.githubusercontent.com/38183218/42746170-7524bc20-8911-11e8-9b98-1fe3ff01a525.PNG)

In [126]:
html

'\n    <html>\n        <head></head>\n        <body>\n            <div id="wrap">\n                <p class="content">\n                    <a href="_blank">Click</a>\n                </p>\n            </div>\n        </body>\n    </html>\n    \n'

In [128]:
soup = BeautifulSoup(downloadGet(url, agent=user).text, "lxml")

In [129]:
type(soup)

bs4.BeautifulSoup

In [130]:
soup.find('a')

<a href="#lnb"><span>메뉴 영역으로 바로가기</span></a>

In [135]:
a_list = soup.find_all('a')

In [136]:
len(a_list)

384

In [139]:
for row in a_list:
    print(row.attrs["href"])

#lnb
#content
http://www.naver.com
#
#
https://help.naver.com/support/alias/search/word/word_16.naver
#
#
https://nid.naver.com/nidlogin.login?url=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_hty%26fbm%3D1%26ie%3Dutf8%26query%3D%25EB%25A7%2590%25EB%25AD%2589%25EC%25B9%2598
https://help.naver.com/support/alias/search/word/word_16.naver
https://help.naver.com/support/alias/search/word/word_21.naver
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/word_18.naver
javascript:;
javascript:;
https://help.naver.com/support/alias/search/word/word_17.naver
https://help.naver.com/support/alias/search/word/w

In [144]:
params = {
    }

url = "http://www.ppomppu.co.kr/zboard/zboard.php?id=freeboard"

In [145]:
html = downloadGet(url, agent=user)

In [147]:
html.encoding

'euc-kr'

In [148]:
type(html.text)
# 디코딩 필요

str

In [146]:
html.text #깨짐

'\x1f�\x08\x00\x00\x00\x00\x00\x00\x03�}�\x7f\x1b蘭偵惰s��辜"�D���v$.\x04h9\x17\x08��uC>����5�hF��痢�好�\x01z�b�Z�%�$�\x0f\x19j\x1c�\x14N�-�4!倜����$α|�Z{fㅡ��\x1f�琁�X�Y{�뒀^互�^{����=z包�_<�\x18�iq�<��G�|�\x10a�n土|�縫G�<J~��#O=IX��\x1cQ8I\x154A�8欄~�i�01MK\x0c멓＃．Q�KV��G�u�@^,\x166>:5KIWD�0���x\x07��\\\x04��y�#훤���\x14F晃!Y�xIs\x1e\x19K�\x0c\t隱����襟Xt��c�����O���\nC���<\x17V�Dmn퓜\x11NⅧ룐���|M\x16aU�S惰��<�<$�\x13�&�Dk�\'\x1e寨�!�1KI\\��3\\R������g\x0e?最3?!�\x0e�*H#석5X�B?台벅+\x08�必QY��\x16*�ho?x��YF8"鼎\tY�,��BD��#��\x10��徹\x01"H����T��힘�3\x0e�8\\�\'殖KI�W�w\x0e悠\x1f植\x03D��!^sFxt��HB0\x19�Gj?���EUJ(r�W�1?#\x0f\r\x08qn�j<��%\x12`�D�\x15�]���\x12�n\x01(�\\"\x11�z�\x1eO�롼�����Z�A\x1b>�&㉣�^@\x134�\x0f�\x05�$�O�o�>\\X^+\x1ct�7��\x07EA\x1a&1���\x1d�nQ西��dyH阿����qt���\\\\\x10형O��徊s�/\x1cD�E�CE�Rc<�9�\x06>�w���\x08P삭2��)�2:��\r�\x12L-邦m7-�B5F�`�>O�핌拖貫k�i��Bⓖ\x1d\x15㏘���者�`\\�S따��RD�\x1e�\x1d鏡}]zⅠ*m�K�bBx8$�(~\x00�>O/藩2\x00즌��賚�!�S"苧\x0bA�o�\x03o辣P�J��w���c�\x040���\x1a져Pb.��*�G�\x11$q�?�<�\x03\x01�ir2\x1csn탬�x��z

In [150]:
type(html.content)

bytes

In [151]:
#urllib로 만들었던 download를 써보자

In [169]:
html = download(url, agent=user)

In [170]:
type(html) #HTTPResponse 객체

http.client.HTTPResponse

In [171]:
html = html.read()
type(html)

bytes

In [172]:
html = html.decode("euc-kr") 

In [173]:
print(type(html))
html

<class 'str'>


'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="author" content="PPOMPPU CO.">\n<meta name="description" content="뽐뿌">\n<meta name="keywords" content="">\n\n\n<!--\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />\n-->\n\n\n<meta property="og:image" content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" />\n\n<meta property="og:site_name" content="뽐뿌" />\n\n<title>뽐뿌 - 자유게시판</title><!--<link href=\'http://fonts.googleapis.com/css?family=Noto+Sans\' rel=\'stylesheet\' type=\'text/css\'>-->\n\n<link rel="stylesheet" type="text/css" hr

In [174]:
html2 = html.encode("utf-8").decode("utf-8") #utf-8로 인코딩 후 다시 디코딩

In [175]:
type(html2)
html2

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="author" content="PPOMPPU CO.">\n<meta name="description" content="뽐뿌">\n<meta name="keywords" content="">\n\n\n<!--\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />\n-->\n\n\n<meta property="og:image" content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" />\n\n<meta property="og:site_name" content="뽐뿌" />\n\n<title>뽐뿌 - 자유게시판</title><!--<link href=\'http://fonts.googleapis.com/css?family=Noto+Sans\' rel=\'stylesheet\' type=\'text/css\'>-->\n\n<link rel="stylesheet" type="text/css" hr

In [176]:
text1 = BeautifulSoup(html, "lxml")
text1.contents #원래 깨지던 문서를 BeautifulSoup가 자동으로 인코딩,디코딩

['html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"',
 <html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="text/javascript" http-equiv="Content-Script-Type"/>
 <meta content="text/css" http-equiv="Content-Style-Type"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="PPOMPPU CO." name="author"/>
 <meta content="뽐뿌" name="description"/>
 <meta content="" name="keywords"/>
 <!--
 <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />
 -->
 <meta content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" property="og:image"/>
 <meta content="뽐뿌" property="og:site_name"/>
 <title>뽐뿌 - 자유게시판</title><!--<link href='http://fonts.googleapis.com/css?family=Noto+Sans' rel='stylesheet' type='text/css'>-->
 <link href="//www.ppomppu.co.kr/css/style.css?v=2018070218" rel="styles

In [177]:
a_lst1 = text1.find_all('a')

In [179]:
for row in a_lst1:
    print(row.attrs)

{'rel': ['#tab1-contents'], 'class': ['tab']}
{'rel': ['#tab2-contents'], 'class': ['tab']}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=event'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=buy'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=help'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=freeboard'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=etc_info'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=free_picture'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=news2'}
{'href': 'http://www.ppomppu.co.kr/zboard/zboard.php?id=review'}
{'href': 'http://www.ppomppu.co.kr/recent_main_article.php?type=market'}
{'href': 'http://www.ppomppu.co.kr/myinfo/env.php?cmd=env', 'target': '_blank'}
{'href': 'http://www.ppomppu.co.kr/myinfo/member_bookmark.php', 'target': '_blank'}
{'href': 'http://www.ppomppu.co.kr/index.php', 'class': ['logo-sm']}
{'href': '/z

In [180]:
text2 = BeautifulSoup(html2, "lxml")
text2.contents

['html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"',
 <html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="text/javascript" http-equiv="Content-Script-Type"/>
 <meta content="text/css" http-equiv="Content-Style-Type"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="PPOMPPU CO." name="author"/>
 <meta content="뽐뿌" name="description"/>
 <meta content="" name="keywords"/>
 <!--
 <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, user-scalable=yes, target-densitydpi=device-dpi" />
 -->
 <meta content="http://www.ppomppu.co.kr/images/icon_app_20160427.png" property="og:image"/>
 <meta content="뽐뿌" property="og:site_name"/>
 <title>뽐뿌 - 자유게시판</title><!--<link href='http://fonts.googleapis.com/css?family=Noto+Sans' rel='stylesheet' type='text/css'>-->
 <link href="//www.ppomppu.co.kr/css/style.css?v=2018070218" rel="styles