# Ch.06 Scrapy Tutorial2

<https://sugiaki1989.gitbook.io/scrapy-note/chapter06_tutorial02>

[Books to scrape](http://books.toscrape.com/)

## プロジェクトの作成

```bash
scrapy startproject sample_books

cd sample_books

scrapy genspider books_spider books.toscrape.com/
```

```bash
tree
.
├── sample_books
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-310.pyc
│   │   └── settings.cpython-310.pyc
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-310.pyc
│       └── books_spider.py
└── scrapy.cfg
```

## HTML構造の調査とクローラーの設計

- 各書籍のURLを取得
- 詳細ページをスクレイピング
- 次のページへ移動
- 各書籍のURLを取得
- 詳細ページをスクレイピング
- ...

## 必要な項目のスクレイピング

```bash
scrapy shell "http://books.toscrape.com"  
2023-06-30 20:25:22 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: sample_books)
2023-06-30 20:25:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.11 (main, Apr 24 2023, 17:34:58) [Clang 14.0.3 (clang-1403.0.22.14.1)], pyOpenSSL 23.2.0 (OpenSSL 3.1.1 30 May 2023), cryptography 41.0.1, Platform macOS-13.4-x86_64-i386-64bit
2023-06-30 20:25:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'sample_books',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'sample_books.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['sample_books.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-06-30 20:25:22 [asyncio] DEBUG: Using selector: KqueueSelector
2023-06-30 20:25:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-06-30 20:25:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-06-30 20:25:22 [scrapy.extensions.telnet] INFO: Telnet Password: 883ed7f4c4773ad9
2023-06-30 20:25:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2023-06-30 20:25:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-06-30 20:25:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-06-30 20:25:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-06-30 20:25:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-06-30 20:25:22 [scrapy.core.engine] INFO: Spider opened
2023-06-30 20:25:23 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2023-06-30 20:25:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10c5e7520>
[s]   item       {}
[s]   request    <GET http://books.toscrape.com>
[s]   response   <200 http://books.toscrape.com>
[s]   settings   <scrapy.settings.Settings object at 0x10c5e7130>
[s]   spider     <BooksSpiderSpider 'books_spider' at 0x10cac2620>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
```

```bash
>>> fetch("http://books.toscrape.com")
2023-06-30 20:26:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com> (referer: None)
```

### ページ内の書籍のURLの一覧を取得

```bash
>>> response.xpath('//h3/a/@href').getall()
['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/tipping-the-velvet_999/index.html', 'catalogue/soumission_998/index.html', 'catalogue/sharp-objects_997/index.html', 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html', 'catalogue/the-requiem-red_995/index.html', 'catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html', 'catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html', 'catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html', 'catalogue/the-black-maria_991/index.html', 'catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html', 'catalogue/shakespeares-sonnets_989/index.html', 'catalogue/set-me-free_988/index.html', 'catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html', 'catalogue/rip-it-up-and-start-again_986/index.html', 'catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html', 'catalogue/olio_984/index.html', 'catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html', 'catalogue/libertarianism-for-beginners_982/index.html', 'catalogue/its-only-the-himalayas_981/index.html']
```

### URLのリストをfor-loopを使って、各書籍のページへリクエストを送る

```python
books = response.xpath('//h3/a/@href').getall()
for book in books:
     abs_url = response.urljoin(book)
     yield Request(abs_url, callback=self.parse_book)
```

### Nextページに移動するためのURLを取得する

```bash
>>> response.xpath('//a[text()="next"]/@href').get()
'catalogue/page-2.html'
```

### 次のページへのURLを正規化してリクエスト送れるようにする

```python
next_page_url = response.xpath('//a[text()="next"]/@href').get()
abs_next_page_url = response.urljoin(next_page_url)
if abs_next_page_url is not None:
    yield Request(abs_next_page_url, callback=self.parse)
```

#### parse関数

```python
def parse(self, response):
    books = response.xpath('//h3/a/@href').getall()
    for book in books:
        abs_url = response.urljoin(book)
        yield Request(abs_url, callback=self.parse_book)

    # If there is a next button on this page, move the crawler
    # このページに「Next」ボタンがある場合は、クローラーを移動させる。
    next_page_url = response.xpath('//a[text()="next"]/@href').get()
    abs_next_page_url = response.urljoin(next_page_url)
    if abs_next_page_url is not None:
        yield Request(abs_next_page_url, callback=self.parse)
```

#### spiders/books_spider.py

### クローラーの実行

```bash
scrapy crawl books_spider -o result.json

2023-06-30 21:06:12 [scrapy.core.engine] INFO: Closing spider (finished)
2023-06-30 21:06:12 [scrapy.extensions.feedexport] INFO: Stored json feed (1000 items) in: result.json
2023-06-30 21:06:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 354623,
 'downloader/request_count': 1051,
 'downloader/request_method_count/GET': 1051,
 'downloader/response_bytes': 22195383,
 'downloader/response_count': 1051,
 'downloader/response_status_count/200': 1050,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 1,
 'elapsed_time_seconds': 47.900103,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 6, 30, 12, 6, 12, 253752),
 'item_scraped_count': 1000,
 'log_count/DEBUG': 2055,
 'log_count/INFO': 11,
 'memusage/max': 57229312,
 'memusage/startup': 57229312,
 'request_depth_max': 50,
 'response_received_count': 1051,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1050,
 'scheduler/dequeued/memory': 1050,
 'scheduler/enqueued': 1050,
 'scheduler/enqueued/memory': 1050,
 'start_time': datetime.datetime(2023, 6, 30, 12, 5, 24, 353649)}
2023-06-30 21:06:12 [scrapy.core.engine] INFO: Spider closed (finished)
```