# **자연어 분석 1일 - Requests**
- 데이터 크롤링 수업내용 정리
- **한국IT비지니스진흥협회**

# **urllib 모듈을 사용한 크롤링**
## **1 urlparse**
Python3 에서 기본제공하는 크롤링 수집기

In [3]:
from urllib import parse

url = "https://www.google.com/search?q=%EB%B0%95%EB%B3%B4%EC%98%81"
# 기본 URL 과 GET Params 내용의 추출
parse.urlparse(url) 

In [4]:
# GET Params 내용의 변경
parse.urljoin(url, '/search/about')

'https://www.google.com/search/about'

In [5]:
# Unicode TEXT 를 Byte Text 로 변경 & Get Query 생성
parse.urlencode({'q':"자연어분석"})

'q=%EC%9E%90%EC%97%B0%EC%96%B4%EB%B6%84%EC%84%9D'

In [6]:
# Byte Text 생성
txt_byte = parse.quote_plus('파이썬')
txt_byte

'%ED%8C%8C%EC%9D%B4%EC%8D%AC'

In [7]:
# Byte Text 인코딩 변환
parse.unquote_plus(txt_byte)

'파이썬'

## **2 Naver News 수집기 테스트**
- **Robots.txt** 로 수집 허용범위 확인하기 **(OPT-OUT 내용을 정의)**
- 별도의 인증서를 필요로 하는 경우는 아래의 설정을 추가해야 한다

```python
Signature:
request.urlopen(
    url,
    data=None,
    timeout=<object object at 0x7f468312d550>,
    *,
    cafile=None,
    capath=None,
    cadefault=False,
    context=None
)
```

In [8]:
# Python 기본 모듈인 urllib 을 사용하여 Robots.txt 내용 확인하기
url = "https://news.naver.com/robots.txt"

from urllib import request
resp = request.urlopen(url)
resp.read()

b'User-agent: Yeti\nAllow: /main/imagemontage\nDisallow: /\nUser-agent: *\nDisallow: /'

In [9]:
# Robots.txt 수집범위 Test Driven
from urllib import robotparser
robot = robotparser.RobotFileParser()
robot.set_url(url)
robot.read()
robot.can_fetch("text", '/')

False

In [10]:
# User-Agent 중 "Yeti" 로 혀용범위를 사용한 추출 테스트
robot.can_fetch("Yeti", '/main/imagemontage')

True

## **3 Google 수집기 테스트**
### **01 구글 Header 내용의 확인**
- 제한조건이 까다롭고 수집기별 작동시 제한이 많다

In [11]:
# Robots.txt 내용 확인하기
url = "https://www.google.com/robots.txt"
resp = request.urlopen(url)
resp.read()[:500]

b'User-agent: *\nDisallow: /search\nAllow: /search/about\nAllow: /search/static\nAllow: /search/howsearchworks\nDisallow: /sdch\nDisallow: /groups\nDisallow: /index.html?\nDisallow: /?\nAllow: /?hl=\nDisallow: /?hl=*&\nAllow: /?hl=*&gws_rd=ssl$\nDisallow: /?hl=*&*&gws_rd=ssl\nAllow: /?gws_rd=ssl$\nAllow: /?pt1=true$\nDisallow: /imgres\nDisallow: /u/\nDisallow: /preferences\nDisallow: /setprefs\nDisallow: /default\nDisallow: /m?\nDisallow: /m/\nAllow:    /m/finance\nDisallow: /wml?\nDisallow: /wml/?\nDisallow: /wml/search?'

In [12]:
# response 객체 생성하기
from urllib import request
url = "https://www.google.com"
resp = request.urlopen(url)

# Generator 로 1번 읽으면 없어진다
type(resp)

http.client.HTTPResponse

In [13]:
# Response Header 값의 해석
resp.getheaders()

[('Date', 'Tue, 23 Jul 2019 00:22:06 GMT'),
 ('Expires', '-1'),
 ('Cache-Control', 'private, max-age=0'),
 ('Content-Type', 'text/html; charset=ISO-8859-1'),
 ('P3P', 'CP="This is not a P3P policy! See g.co/p3phelp for more info."'),
 ('Server', 'gws'),
 ('X-XSS-Protection', '0'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('Set-Cookie',
  '1P_JAR=2019-07-23-00; expires=Thu, 22-Aug-2019 00:22:06 GMT; path=/; domain=.google.com'),
 ('Set-Cookie',
  'NID=188=jjGQaRy-GF9JfZq9a0jzF3EirCiV-vaZH0f_AtPRosckLXTP6SIsIPhn5lyFtG-9ax-rG3DG1enLcgaUdq57uvZIzP1SwaKVHSyXKghmes75gN3NoeSiPAaMkgE7A-vaUvrp_t2kCFNOuj1KRhMndQKbq_Wfi_5XJXmkfMRmwR0; expires=Wed, 22-Jan-2020 00:22:06 GMT; path=/; domain=.google.com; HttpOnly'),
 ('Alt-Svc', 'quic=":443"; ma=2592000; v="46,43,39"'),
 ('Accept-Ranges', 'none'),
 ('Vary', 'Accept-Encoding'),
 ('Connection', 'close')]

### **02 구글 검색결과 내용의 확인**
```python
url = """https://www.google.com/search?client=ubuntu&hs=aqE
    &channel=fs&ei=5Zs1XbP9OtyAr7wP5O-q6A4
    &q=%EB%B0%95%EB%B3%B4%EC%98%81
    &oq=%EB%B0%95%EB%B3%B4%EC%98%81
    &gs_l=psy-ab.3..0l10.3038266.3042074..3042290...8.0..1.183.2500.1j
    17......0....1..gws-wiz.....0..0i71j0i273j0i10i67j0i10j0i10i42j0i1
    0i30j0i13i30j0i131.1QqFyiatYic
    &ved=0ahUKEwjzg8eQtMjjAhVcwIsBHeS3Cu0Q4dUDCAo&uact=5"""
```

In [14]:
# HTTPError: HTTP Error 403: Forbidden 를 발생한다
url = "https://www.google.com/search?client=ubuntu&q=%EB%B0%95%EB%B3%B4%EC%98%81"
# resp = request.urlopen(url)
# resp.getheaders()

### **03 Bot 수집기를 숨겨서 Https Forbiden 통과시키기**
- 수집기에 **Header 를 추가해서** 브라우저로 속여야 한다
- 단 주의할 점은 **영어 이외의 문자는 "%EB%B0%95%EB%B3%B4%EC%98%81" Byte 데이터로** 입력을 해야 한다

In [15]:
# url 을 입력시 Byte 로 꼭 변환을 해야만 한다
userAgent = {"user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}
req = request.Request(url, headers=userAgent)
resp = request.urlopen(req)
resp.getheaders()

[('Content-Type', 'text/html; charset=UTF-8'),
 ('Date', 'Tue, 23 Jul 2019 00:22:06 GMT'),
 ('Expires', '-1'),
 ('Cache-Control', 'private, max-age=0'),
 ('Strict-Transport-Security', 'max-age=31536000'),
 ('P3P', 'CP="This is not a P3P policy! See g.co/p3phelp for more info."'),
 ('Server', 'gws'),
 ('X-XSS-Protection', '0'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('Set-Cookie',
  '1P_JAR=2019-07-23-00; expires=Thu, 22-Aug-2019 00:22:06 GMT; path=/; domain=.google.com'),
 ('Set-Cookie',
  'CGIC=CgZ1YnVudHU; expires=Sun, 19-Jan-2020 00:22:06 GMT; path=/complete/search; domain=.google.com; HttpOnly'),
 ('Set-Cookie',
  'CGIC=CgZ1YnVudHU; expires=Sun, 19-Jan-2020 00:22:06 GMT; path=/search; domain=.google.com; HttpOnly'),
 ('Set-Cookie',
  'NID=188=j3SMQ6lDpWkqocXY8O9RseYy-NPwo4p_7GuvB_Lsn0J7PRXOiHNW1zzQvSaGCmf_wI8Vw05jQxu5G5iaCEflygPmm9a8l16eofhODpZhH6HNrTn7iOn-VEguFsNbycBuR8v-JoNmyfCIuNr5veMpWBdq9k3dDIxPaMPNCbJT9vU; expires=Wed, 22-Jan-2020 00:22:06 GMT; path=/; domain=.google.com; HttpO

## **2 urllib 수집내용 살펴보기**
- 수집시 Url 주소를 Byte 로 입력해야 하던 것처럼
- response 결과값도 **Byte** 로 불러온다
- **.decode('') 로 변환하면 정상적인 **String** 으로 활용이 가능
- 단 response 는 **Generator** 인관계로 **1번 내용을 확인하면 메모리서 삭제** 됨에 주의 할 것!!!

In [16]:
# Byte 원본 내용 그대로 살펴보기
resp.read()[:1200]

b'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>\xeb\xb0\x95\xeb\xb3\xb4\xec\x98\x81 - Google \xea\xb2\x80\xec\x83\x89</title><script nonce="hMiAoNZ07spDiIW8E/oedA==">(function(){window.google={kEI:\'LlM2XejaIbCwmAXp06XIAg\',kEXPI:\'31\',authuser:0,kscs:\'c9c918f0_LlM2XejaIbCwmAXp06XIAg\',kGL:\'KR\'};google.sn=\'web\';google.kHL=\'ko\';google.jsfs=\'\';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||google.kEI};google.getLEI=function(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b};google.https=function(){return"https:"==window.location.protocol};google.ml=function(){return null};google.time=function(){return(new Date).getTime()};google.log=function(a,

In [17]:
# Generator 의 한계로 위에서 내용을 확인하면, .decode 결과는 출력되지 않는다
# 따라서 변수로 객체내용을 저장한 뒤 반복 활용을 해야만 된다
resp.read().decode('utf-8')

''

# **requests 상위모듈의 활용**
## **1 requests 모듈 사용하기**
- **! pip install requests**
- urllib 는 **Byte Get Query** 입력 
- 사용자 정보를 위한 **header** 정보를 추가로 필요로 한다
- response 결과값에 대해 **decode** 해석 을 맞춰야 한다

In [18]:
# 생성자 면서도 Decoding 까지 자동으로 완료
# 자동으로 Agent, String 쿼리문, response Decoder 까지 모두 맞춰준다
url = "https://www.google.com/search?q=박보영"

import requests
resp = requests.request("get", url)
resp.request.headers

{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [19]:
# url 내용을 추가
url = "https://www.google.com/search"
userAgent = {"user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}

resp = requests.request("get", 
                        url,
                        params = {"q":"박보영"}, # Post 전송시 data 로 적용
                        headers=userAgent)
# user-Agent 등 번거로운 내용을 모두 자동으로 추가
print(resp.request.url)
print(resp.request.headers)
list(resp.headers.keys())

https://www.google.com/search?q=%EB%B0%95%EB%B3%B4%EC%98%81
{'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


['Content-Type',
 'Date',
 'Expires',
 'Cache-Control',
 'Strict-Transport-Security',
 'P3P',
 'Content-Encoding',
 'Server',
 'X-XSS-Protection',
 'X-Frame-Options',
 'Set-Cookie',
 'Alt-Svc',
 'Transfer-Encoding']

In [20]:
print(resp.status_code)
print(resp.reason)   # status_code 를 해석
print(resp.encoding) # 자동으로 문서에 맞는 인코딩 적용
resp.text[:500]      # 인코딩 자동으로 적용

200
OK
UTF-8


'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>박보영 - Google 검색</title><script nonce="+TOvL8uyNmDt62AZ3XOiVw==">(function(){window.google={kEI:\'MFM2XZDcBNPGmAWoq4eQCw\',kEXPI:\'31\',authuser:0,kscs:\'c9c918f0_MFM2XZDcBNPGmAWoq4eQCw\',kGL:\'KR\'};google.sn=\'web\';google.kHL=\'ko\';google.jsfs=\'\';})();(function(){google.l'

In [21]:
# Byte 원본 데이터 형태로 내용 살펴보기
resp.content[:500]

b'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>\xeb\xb0\x95\xeb\xb3\xb4\xec\x98\x81 - Google \xea\xb2\x80\xec\x83\x89</title><script nonce="+TOvL8uyNmDt62AZ3XOiVw==">(function(){window.google={kEI:\'MFM2XZDcBNPGmAWoq4eQCw\',kEXPI:\'31\',authuser:0,kscs:\'c9c918f0_MFM2XZDcBNPGmAWoq4eQCw\',kGL:\'KR\'};google.sn=\'web\';google.kHL=\'ko\';google.jsfs=\'\';})();(function('

In [22]:
# 임의의 인코딩으로 변환하기
resp.encoding = "euc-kr"
resp.text[:500]

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>諛�蹂댁�� - Google 寃����</title><script nonce="+TOvL8uyNmDt62AZ3XOiVw==">(function(){window.google={kEI:\'MFM2XZDcBNPGmAWoq4eQCw\',kEXPI:\'31\',authuser:0,kscs:\'c9c918f0_MFM2XZDcBNPGmAWoq4eQCw\',kGL:\'KR\'};google.sn=\'web\';google.kHL=\'ko\';google.jsfs=\'\';})();(function(){go'

In [23]:
resp.encoding = "utf-8"
resp.text[:500]

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><meta content="origin" name="referrer"><title>박보영 - Google 검색</title><script nonce="+TOvL8uyNmDt62AZ3XOiVw==">(function(){window.google={kEI:\'MFM2XZDcBNPGmAWoq4eQCw\',kEXPI:\'31\',authuser:0,kscs:\'c9c918f0_MFM2XZDcBNPGmAWoq4eQCw\',kGL:\'KR\'};google.sn=\'web\';google.kHL=\'ko\';google.jsfs=\'\';})();(function(){google.l'

## **2 HTTP(s)**
- http://www.httpbin.org/ 서버 오류를 테스트 하는 사이트
- http://www.crawler-test.com/status_codes/status_403
1. 1XX : (Hold on)
1. 200 : **Success** (Here we go)
1. 3XX : **Redirection** (Go away)
1. 4XX : **Client Error** 
1. 5XX : **Server Error**

In [24]:
# 서버 403 오류를 발생하는 예제 사이트
url = "http://www.crawler-test.com/status_codes/status_403"
requests.request("get", url)

<Response [403]>

In [25]:
# requests 를 활용하는 함수를 정의한다

def download(method, url, data=None):
    userAgent = {"user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}

    try:
        resp = requests.request(method, url,  data=data, headers= userAgent)
        if 400 <= resp.status_code <= 500:
            print(resp.status_code, "서버오류")
        elif 500 <= resp.status_code <= 600:
            print(resp.status_code, "재시도")
        else:
            return resp

    except requests.exceptions.HTTPError as e:
        print(e.code, e.reason)
        print(e.response)
        
    return None

download("get", url=url)

403 서버오류


In [26]:
# 503 오류가 발생한 경우 자동으로 반복실행 한다
import requests, time

def download(method, url, params=None, data=None, retries=3):
    userAgent = {"user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}

    try:
        resp = requests.request(method, url, params=params, data=data, headers= userAgent)
        resp.raise_for_status()

    except requests.exceptions.HTTPError as e:
        
        if 500 <= e.response.status_code < 600 and retries > 0 :
            time.sleep(0.5)
            print(retries, "한번 더 시도하기")
            return download(method, url, params, data, retries-1)
        else: 
            print("Errors")
            return None

    return resp

In [27]:
# 4XX 오류시 접속을 중단한다
url = "http://www.crawler-test.com/status_codes/status_403"
download("get", url=url)

Errors


In [28]:
# 5XX 오류시 작업을 반복한다
url = "http://www.crawler-test.com/status_codes/status_503"
html = download("get", url=url)
if not html:
    print("Errors")

3 한번 더 시도하기
2 한번 더 시도하기
1 한번 더 시도하기
Errors
Errors
