# Web scraping - Srapy

## Define the crawler.

Scrapy is an elegant solution that unites the common features:
1. **Querying** and crawling the web pages,
2. **Extracting** the data based on xpath or css selectors, and
3. **Exporting** the data to a JSON or CSV file.

The above makes Scrapy the preferable tool for scraping non-ajax web sites.

The Scrapy crawler is initialized as a class, and then run. It is intended to
be run from the CLI mostly, but can also run in IPY like we have here.

Because Scrapy is asynchronous, it's visibly faster than the other solutions.

**NOTE:** Scrapy is written in the **Twisted** Python Framework, which provides
even-driven async behavior. However, in our context, it will only allow to run
the script once. If you want to re-run, you will have to restart the IPY kernel.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import datetime

class BikeSpider(scrapy.Spider):
    name = 'bikespider'
    start_urls = [
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2',
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2&page=2',
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2&page=3',
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2&page=4',
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2&page=5',
    ]
    custom_settings = {
       'FEED_FORMAT': 'csv',
       'FEED_URI': 'bikes-scrapy.csv',
       'LOG_ENABLED': False,
    }

    def parse(self, response):
        for ad in response.css('.awrapper .listItemContainer .listItemLink'):
            yield {
                'title': ad.css('span.title *::text').extract_first(),
                'price': ad.css('span.price *::text').extract_first(),
                'image': 'https://' + ad.css('img.cover::attr(src)').extract_first()
            }

## Run the crawler.

This is where we initialize and run the crawler. The *bikes-scrapy.csv* file
will be saved in the script folder.

In [2]:
begin_time = datetime.datetime.now()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(BikeSpider)
process.start()

diff = datetime.datetime.now() - begin_time
diff

2021-08-09 18:23:03 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-08-09 18:23:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Jun  2 2021, 10:49:15) - [GCC 9.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.11.0-25-generic-x86_64-with-glibc2.29
2021-08-09 18:23:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-08-09 18:23:03 [scrapy.crawler] INFO: Overridden settings:
{'LOG_ENABLED': False,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


datetime.timedelta(seconds=1, microseconds=467534)