# CrawlSpider -  Following Links Automatically

🕸 ```scrapy.Spider``` is the simplest spider that would, basically, 
visit the URLs defined in ```start_urls``` or returned by ```start_requests()```.
For example, ```class CakesToCsv(scrapy.Spider)```

🕸 Use ```CrawlSpider``` when you need a "crawling" behavior - extracting the links and following them.
For example, ```class MyCrawlSpider(scrapy.CrawlSpider)```

A basic Scrapy spider looks like this:

In [None]:
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # Your parsing logic here
        pass



In [None]:
It allows you to define rules for following links and applying a callback to the extracted pages.

In [None]:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MyCrawlSpider(CrawlSpider):
    name = 'mycrawlspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']

    rules = (
        Rule(LinkExtractor(allow=('/some-path/',)), callback='parse_page'),
    )

    def parse_page(self, response):
        # Your parsing logic for individual pages
        pass

## Features of a crawler spider

**name**: A unique identifier for the spider.

**allowed_domains**: A list of domains that the spider is allowed to crawl. Requests to domains outside this list will be ignored.

**start_urls**: A list of initial URLs to start the crawl.

**rules**: A tuple of ```Rule``` instances. Each ```Rule``` defines a certain behavior for following links. 
In this example, we have a rule to follow links that match the regular expression '/some-path/' and apply the parse_page callback to each of them.


In [None]:
def parse_page(self, response):
    # Extract data from the page using XPath or CSS selectors
    title = response.css('h1::text').get()
    content = response.css('div.content::text').get()

    yield {
        'title': title,
        'content': content,
    }

### Settings

Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. One way to do so is by setting their custom_settings attribute:


In [None]:
import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"

    custom_settings = {
        "SOME_SETTING": "some value",
    }


### Items

Reference: https://doc.scrapy.org/en/latest/topics/items.html

* The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. 
Spiders may return the extracted data as items, Python objects that define key-value pairs.

* Then inside in your spider, instead of yielding a dictionary you would create a new Item with the scraped data before yielding it.

Example:



In [None]:
from scrapy.item import Item, Field


class CustomItem(Item):
    one_field = Field()
    another_field = Field()


class QuotesSpider(scrapy.Spider):
    ....

	def parse(self, response):
		quote_item = CustomItem()
		for quote in response.css('div.quote'):
			quote_item['one_field'] = quote.css('span.text::text').get()
			quote_item['another_field'] = quote.css('small.author::text').get()
			yield quote_item

## Crawler

In [1]:
#this crawlspider reads first webpage

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

class HackerNewsItem(scrapy.Item):
    title = scrapy.Field()
    comment = scrapy.Field()

class HackerNewsSpider(CrawlSpider):
    name = 'hackernews'
    allowed_domains = ['news.ycombinator.com']
    start_urls = [
        'https://news.ycombinator.com/'
    ]

    custom_settings = {
        'FEEDS': {
            'Y_c.csv': { 
                'format': 'csv',  
                'overwrite': True
            }
        }
    }
    
    rules = (
        # Follow any item link and call parse_item.
        Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = HackerNewsItem()
        # Get the title
        item['title'] = response.xpath('//*[contains(@class, "title")]/a/text()').extract()
        return item


process = CrawlerProcess() #define the crawler
process.crawl(HackerNewsSpider) #attach the spider to the crawler
process.start()

2025-01-10 10:39:56 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-10 10:39:56 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-10 10:39:56 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-10 10:39:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-10 10:39:56 [scrapy.extensions.telnet] INFO: Telnet Password: 90257d905bd75049
2025-01-10 10:39:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-10 10:39:56 [scrapy.crawler] INFO: Overridden 

In [1]:
# This crawl spider reads all pages

import scrapy
from scrapy.crawler import CrawlerProcess


class HackerNewsItem(scrapy.Item):
    title = scrapy.Field()


class HackerNewsSpider(scrapy.Spider):
    name = 'hackernews'

    allowed_domains = ['news.ycombinator.com']
    start_urls = [
        'https://news.ycombinator.com/'
    ]

    custom_settings = {
        'FEEDS': {
            'titles_only.csv': {
                'format': 'csv',
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        # Extract titles from the current page
        for post in response.xpath('//span[@class="titleline"]'):
            item = HackerNewsItem()
            # Extract the title text
            item['title'] = post.xpath('./a/text()').get()
            yield item

        # Follow the pagination link ("More" button)
        next_page = response.xpath('//a[@class="morelink"]/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)


# Execute the crawler
process = CrawlerProcess()
process.crawl(HackerNewsSpider)
process.start()


2025-01-10 11:29:15 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-10 11:29:15 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-10 11:29:15 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-10 11:29:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-10 11:29:15 [scrapy.extensions.telnet] INFO: Telnet Password: e83df2659d050f6d
2025-01-10 11:29:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-10 11:29:15 [scrapy.crawler] INFO: Overridden 

In [1]:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess


class MySpider(CrawlSpider):
    name = 'my_spider'
    custom_settings = {"CLOSESPIDER_PAGECOUNT": 5, "CONCURRENT_REQUEST": 1}

    allowed_domains = ['geeksforgeeks.org']
    start_urls = ['https://www.geeksforgeeks.org/']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        self.logger.info('Visited: %s', response.url)
        # self.logger.info('Title: %s', response.xpath("//h1.text()").get())
        # Extract data here

# Execute the crawler
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2025-01-10 12:09:24 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-10 12:09:24 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.4 (main, May 25 2024, 00:47:07) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-15.2-arm64-arm-64bit
2025-01-10 12:09:24 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-10 12:09:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-01-10 12:09:24 [scrapy.extensions.telnet] INFO: Telnet Password: 4b093711bc31ab99
2025-01-10 12:09:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2025-01-10 12:09:24 [scrapy.crawler] INFO: Overridden 

Q. Why it doesn't stop exactly?

https://stackoverflow.com/questions/34528524/scrapy-closespider-pagecount-setting-dont-work-as-should/34535390#34535390


Set this is you want 100 crawlers simultaneously. 
```"CONCURRENT_REQUEST": 100```

This assumes your cpu can handle 100 parallel processes. You can check CPU thread count.
