# Web scraping performance - Requests + BeautifulSoup VS Scrapy

## Our task here.

Subjective observations followed by some early measurements have indicated, that
Scrapy is around 4-5 times faster than Requests + BS solution (R+BS). Being
built on top of an async framework, scrapy is processing multiple requests
asynchronously, greatly increasing the process bandwidth.
See https://docs.scrapy.org/en/latest/intro/overview.html#what-just-happened

Here we will measure how Scrapy's performance varies in the case of 100 requests,
and how this compares with R+BS performance in the same test.

## Subject 1: Scrapy.

Scrapy's default setting is CONCURRENT_REQUESTS_PER_DOMAIN = 8.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import datetime

# 1 item.
class BikeSpider(scrapy.Spider):
    name = 'bikespider'
    start_urls = [
        'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2',
    ] * 100
    custom_settings = {
       'LOG_ENABLED': False,
    }

    def parse(self, response):
        for ad in response.css('.awrapper .listItemContainer .listItemLink'):
            pass


# A single process will be reused.
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

# Run 1 item.
begin_time = datetime.datetime.now()
process.crawl(BikeSpider)
process.start()
diff = datetime.datetime.now() - begin_time
print(f'Scrapy 100 items in {diff.seconds}s')

2021-08-11 09:24:15 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-08-11 09:24:15 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.10 (default, Jun  2 2021, 10:49:15) - [GCC 9.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.11.0-25-generic-x86_64-with-glibc2.29
2021-08-11 09:24:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-08-11 09:24:15 [scrapy.crawler] INFO: Overridden settings:
{'LOG_ENABLED': False,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


Scrapy 100 items in 7s


## Subject 1: R+BS.

R+BS has no concurrency. It processes 1 item at a time. 100 items will take
about 2 minutes.

In [2]:
from bs4 import BeautifulSoup
import requests

begin_time = datetime.datetime.now()

base_url = 'https://bazar.bg/obiavi/gradski-velosipedi/varna?condition=2'
count_pages = 100
data = []
for i in range(count_pages):
    cur_page = i + 1
    print(f'\rPage {cur_page} of {count_pages}', end='\r')
    response = requests.get(base_url)
    html = response.content
    soup = BeautifulSoup(html, 'html.parser')
    css_selector = '.awrapper .listItemContainer .listItemLink'
    for item in soup.select(css_selector):
        pass

diff = datetime.datetime.now() - begin_time
print(f'R+BS 100 items in {diff.seconds}s          ')


R+BS 100 items in 103s
