### scrapy
- 파이썬 언어를 이용한 웹 데이터 수집 프레임 워크
    - 프레임워크와 라이브러리 또는 패키지의 창
    - 프레임워크는 특정 목적을 가진 기능의 코드가 미리 설정 되어서 빈칸채우기 식으로 코드를 작성
    - 패키지는 다른 사람이 작성해 놓은 코들를 가져다가 사용하는 방법
- scrapy
    - pit install scrapy
- tree
    - sudo apt install tree

#### index
- xpath : css-selector 역할을 해주는 문법
- 스크래피의 구조
- gmarket 베스트 상품 데이터 크롤링

In [2]:
import scrapy
import requests
from scrapy.http import TextResponse # xpath 연습

#### 1. xpath 사용법 
- 네이버, 다음 실시간 검색어 데이터
- 네이버 검색어 xpath

```
//*[@id="NM_RTK_ROLLING_WRAP"]/div/a/span
```

- `//` : 가장 상위 엘리먼트
- '*' : 조건에 맞는 하위 엘리먼트를 모두 살펴봄, css selector의 "div. txt"와 같은 기능
- '[@id="NM_RTK_ROLLING_WRAP"]' : 조건 : id가 NM_RTK_ROLLING_WRAP 인 엘리먼트
- '/' : 바로 아래 엘리먼트를 살펴봄, "div > .txt"
- `div[1]` : div태그에서 1 번째 엘리먼트를 선택
- `-`: 현재 엘리먼트를 선택
- `not` : not(조건)


In [24]:
# 웹페이지에 연결
req = requests.get("https://www.naver.com/")

# response 객체 생성, beautiful soup으로 parsing 하던 것을 scrapy의 response로 parsing 한다고 생각하면 됨.
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [35]:
# 네이버 키워드 순위 데이터 가져오기
# xpath: xpath selector
# data : xpath selector로 선택된 엘리먼트

response.xpath('//*[@id="NM_RTK_ROLLING_WRAP"]/div/div/a/span[2]')


[]

In [28]:
# text를 data로 설정
response.xpath('//*[@id="NM_RTK_ROLLING_WRAP"]/div/a/span/text()')

[]

In [None]:
# response 객체에서 data 변수만 가져옴
response.xpath('//*[@id="NM_RTK_ROLLING_WRAP"]/div/a/span/text()').extract()

In [None]:
//*[@id="NM_RTK_ROLLING_WRAP"]/div/div


#### 2. Scrapy Project
- scrapy 프로젝트 생성
- scrapy 구조
- gmarket 베스트 상품 링크 수집, 링크 안에 있는 상세 정보 수집

In [36]:
# 프로젝트 생성

In [37]:
!scrapy startproject crawler

New Scrapy project 'crawler', using template directory '/home/ubuntu/.pyenv/versions/3.6.9/envs/python3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /home/ubuntu/python3/notebook/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com


In [38]:
!ls

 01_mysql_with_python.ipynb	 Package.ipynb
 02_SQLAlchemy.ipynb		 __pycache__
 03_pymongo.ipynb		 calc
 04_slack.ipynb			 crawler
 05_requests_1.ipynb		 mypack
 06_requests_2.ipynb		 naver_main.png
 07_selenium_server.ipynb	 summary_database.ipynb
'08_iterator, generator.ipynb'	 zigbang.py
 09_scrapy_gmarket.ipynb


In [39]:
!tree crawler

[01;34mcrawler[00m
├── [01;34mcrawler[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── [01;34mspiders[00m
│       ├── __init__.py
│       └── [01;34m__pycache__[00m
└── scrapy.cfg

4 directories, 7 files


#### scrapy의 구조
- spiders
    - 어떤 웹서비스를 어떻게 크롤링 할 것 인지에 대한 코드를 작성(.py 파일로 작성)
- items.py
    - 모델에 해당하는 코드, 저장하는 제이터의 자료구조를 설정
- pipelines.py
    - 스크래핑 한 결과물을 item 형태로 구성하고 처리하는 방법에 대한 코드
- settings.py
    - 스크래핑 할 때의 환경설정값을 지정
    - robot.txt : 따를지 안따를지
        

#### gmarket 베스트 셀러 상품 수집
- 상품명, 상세페이지 URL, 원가, 판매가, 할인률
- xpath 확인
- items.py
- spider.py
- 크롤러 실행

###$ 1. xpath 확인

In [46]:
req = requests.get("http://corners.gmarket.co.kr/Bestsellers")
response = TextResponse(req.url, body=req.text, encoding="utf-8")

In [51]:
items = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li')
len(items)

200

In [52]:
links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a')
len(links)

200

In [54]:
links[0]

<Selector xpath='//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a' data='<a href="http://item.gmarket.co.kr/It...'>

In [55]:
links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href')
len(links)

200

In [56]:
links[0]

<Selector xpath='//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href' data='http://item.gmarket.co.kr/Item?goodsc...'>

In [61]:
links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()

In [62]:
links[0]

'http://item.gmarket.co.kr/Item?goodscode=1620413379&ver=637292869981332930'

In [70]:
req = requests.get(links[0])
response = TextResponse(req.url, body=req.text, encoding="utf-8")
title = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract() 
s_price = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract()
o_price = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract()
# discount_rate
title, s_price, o_price

('키위 제스프리 골드 키위 특대과 30과 ', '27,900', '43,350')

In [72]:
req = requests.get(links[0])
response = TextResponse(req.url, body=req.text, encoding="utf-8")
title = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract() 
s_price = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
o_price = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
discount_rate = str(round((1 - int(s_price) / int(o_price))*100, 2)) + "%"
title, s_price, o_price, discount_rate

('키위 제스프리 골드 키위 특대과 30과 ', '27900', '43350', '35.64%')

#### 2. items.py 작성

In [74]:
! cat crawler/crawler/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [75]:
%%writefile crawler/crawler/items.py
import scrapy


class CrawlerItem(scrapy.Item):
    title = scrapy.Field()
    s_price = scrapy.Field()
    o_price = scrapy.Field()
    discount_rate = scrapy.Field()
    link = scrapy.Field()


Overwriting crawler/crawler/items.py


#### 3. spider.py 작성

In [86]:
%%writefile crawler/crawler/spiders/spider.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    name = "GmarketBestsellers" # 이름 값
    allow_domain = ["gmarket.co.kr"]  # 다른 웹 창이 떠도 이 url 것만 크롤링 하겠다.
    start_urls = ["http://corners.gmarket.co.kr/Bestsellers"]  # 최초의 request 해주는 url, 여러개 입력 가능
    
    # start_urls 에 request해서 받은 response로 아래 함수 실행
    def parse(self, response):
        links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
        for link in links:
            yield scrapy.Request(link, callback=self.page_content)
    # 함수의 결과데이터가 바로 윗줄의 link로 요청하고 callback 함수를 아래의 함수 이름으로 실행
    
    def page_content(self, response):
        item = CrawlerItem()        
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract() 
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = item["s_price"]
            
        item["discount_rate"] = str(round((1 - int(item["s_price"]) / int(item["o_price"]))*100, 2)) + "%"
        item["link"] = response.url
        yield item
    

Overwriting crawler/crawler/spiders/spider.py


#### 4. Scrapy 실행

In [80]:
!ls crawler

crawler  scrapy.cfg


In [79]:
%%writefile run.sh
cd crawler
scrapy crawl GmarketBestsellers

Writing run.sh


In [82]:
# 실행권한 부여

In [81]:
!chmod +x run.sh

In [87]:
!./run.sh

2020-07-02 03:43:51 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: crawler)
2020-07-02 03:43:51 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 15 2020, 12:56:52) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1023-aws-x86_64-with-debian-buster-sid
2020-07-02 03:43:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-02 03:43:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2020-07-02 03:43:51 [scrapy.extensions.telnet] INFO: Telnet Password: 368e2b1f04c2ce17
2020-07-02 03:43:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scra

In [None]:
%load crawler/crawler/spiders/spider.py

In [None]:
# %load crawler/crawler/spiders/spider.py
import scrapy
from crawler.items import CrawlerItem

class Spider(scrapy.Spider):
    name = "GmarketBestsellers" # 이름 값
    allow_domain = ["gmarket.co.kr"]  # 다른 웹 창이 떠도 이 url 것만 크롤링 하겠다.
    start_urls = ["http://corners.gmarket.co.kr/Bestsellers"]  # 최초의 request 해주는 url, 여러개 입력 가능
    
    # start_urls 에 request해서 받은 response로 아래 함수 실행
    def parse(self, response):
        links = response.xpath('//*[@id="gBestWrap"]/div/div[3]/div[2]/ul/li/div[1]/a/@href').extract()
        for link in links:
            yield scrapy.Request(link, callback=self.page_content)
    # 함수의 결과데이터가 바로 윗줄의 link로 요청하고 callback 함수를 아래의 함수 이름으로 실행
    
    def page_content(self, response):
        item = CrawlerItem()        
        item["title"] = response.xpath('//*[@id="itemcase_basic"]/h1/text()')[0].extract() 
        item["s_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/strong/text()')[0].extract().replace(",", "")
        try:
            item["o_price"] = response.xpath('//*[@id="itemcase_basic"]/p/span/span/text()')[0].extract().replace(",", "")
        except:
            item["o_price"] = item["s_price"]
            
        item["discount_rate"] = str(round((1 - int(item["s_price"]) / int(item["o_price"]))*100, 2)) + "%"
        item["link"] = response.url
        yield item
    


- 결과를 csv로 저장
- csv 파일에 ',' 있으면 깨짐. replace 해주는게 좋음

In [91]:
%%writefile run.sh
cd crawler
scrapy crawl GmarketBestsellers -o GmarketBestsellers.csv

Overwriting run.sh


In [92]:
!./run.sh

2020-07-02 04:47:38 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: crawler)
2020-07-02 04:47:38 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 15 2020, 12:56:52) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1023-aws-x86_64-with-debian-buster-sid
2020-07-02 04:47:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-02 04:47:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2020-07-02 04:47:38 [scrapy.extensions.telnet] INFO: Telnet Password: 93e91ce482f4da34
2020-07-02 04:47:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scra

In [93]:
!ls crawler/

GmarketBestsellers.csv	crawler  scrapy.cfg


In [118]:
!pwd

/home/ubuntu/python3/notebook/crawler


In [109]:
import os

In [116]:
os.getcwd()

'/home/ubuntu/python3/notebook/crawler'

In [136]:
os.chdir('/home/ubuntu/python3/notebook')

In [96]:
files = !ls crawler/
files

['GmarketBestsellers.csv', 'crawler', 'scrapy.cfg']

In [124]:
df = pd.read_csv(files[0])
df.tail(2)

Unnamed: 0,discount_rate,link,o_price,s_price,title
197,0.0%,http://item.gmarket.co.kr/Item?goodscode=17924...,21900,21900,[ahc] AHC 더퓨어 아이크림 30ml 4개 + 3종 키트 증정
198,10.0%,http://item.gmarket.co.kr/Item?goodscode=18362...,25000,22500,[아이더] (현대백화점)아이더 남성 트위스팅 폴로 티셔츠


#### 5. Pipelines 설정
- item을 출력하기 전에 실행되는 코드를 정의

In [128]:
import requests
import json

def send_slack(msg):
    WEBHOOK_URL = "https://hooks.slack.com/services/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"
    payload = {
        "channel": "#welcome",
        "username": "NYM",
        "text": msg,
    }
    requests.post(WEBHOOK_URL, json.dumps(payload))

In [129]:
send_slack("테스트")

In [132]:
!cat ~/python3/notebook/crawler/crawler/pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class CrawlerPipeline:
    def process_item(self, item, spider):
        return item


In [140]:
%%writefile crawler/crawler/pipelines.py
import requests
import json

class CrawlerPipeline(object):
    
    def __send_slack(self, msg):
        WEBHOOK_URL = "https://hooks.slack.com/services/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"
        payload = {
            "channel": "#welcome",
            "username": "NYM",
            "text": msg,
        }
        requests.post(WEBHOOK_URL, json.dumps(payload))
        
    def process_item(self, item, spider):
        keyword = "세트"
        print("="*100)
        print(item["title"], keyword)
        print("="*100)
        if keyword in item["title"]:
            self.__send_slack("{},{},{}".format(
                item["title"], item["s_price"], item["link"]))
        return item

Overwriting crawler/crawler/pipelines.py


- pipeline 설정 : settings.py
```
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}
```

In [None]:
# echo 뒤에 오는 문자열을 출력, > 하나면 지우고 추가, >> 밑에 추가

In [137]:
!echo "ITEM_PIPELINES = {" >> crawler/crawler/settings.py
!echo "  'crawler.pipelines.CrawlerPipeline': 300," >> crawler/crawler/settings.py
!echo "}" >> crawler/crawler/settings.py

In [138]:
!tail -n 3 crawler/crawler/settings.py

ITEM_PIPELINES = {
  'crawler.pipelines.CrawlerPipeline': 300,
}


In [141]:
!./run.sh

2020-07-02 05:58:29 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: crawler)
2020-07-02 05:58:29 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.9 (default, Apr 15 2020, 12:56:52) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-5.3.0-1023-aws-x86_64-with-debian-buster-sid
2020-07-02 05:58:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-02 05:58:29 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'crawler',
 'NEWSPIDER_MODULE': 'crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['crawler.spiders']}
2020-07-02 05:58:29 [scrapy.extensions.telnet] INFO: Telnet Password: 2c5401cfacb1a5a0
2020-07-02 05:58:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scra