# Scrapy documentation

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.

It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

---

## INSTALLATION

you can install Scrapy and its dependencies from PyPI with:

> pip install Scrapy

For more information see [Installation documentation](https://docs.scrapy.org/en/latest/intro/install.html)

----

### SAMPLE SPIDER CODE


```
# file_name = quotes_spider.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

to run your scrapy spider:
> scrapy runspider quotes_spider.py -o quotes.json

## What just happened?

When you ran the command `scrapy runspider quotes_spider.py`, Scrapy looked for a Spider definition inside it and ran it through its crawler engine.

The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the URL for quotes in humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as callback.

Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

---

### Simplest way to dump all my scraped items into a JSON/CSV/XML file?

To dump into a JSON file:

> scrapy crawl myspider -O items.json

To dump into a CSV file:

> scrapy crawl myspider -O items.csv

To dump into a XML file:

> scrapy crawl myspider -O items.xml

For more information see [Feed exports](https://docs.scrapy.org/en/latest/topics/feed-exports.html)

---

scrapy project example : [quotesbot](https://github.com/scrapy/quotesbot)

---

### Learn to Extract data

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. 

Run:

> scrapy shell 'https://quotes.toscrape.com/page/1/'

Using the shell, you can try selecting elements using CSS with the response object:

> ->>> response.css('title')

> [< Selector xpath='descendant-or-self::title' data='< title >Quotes to Scrape</ title>'>]

The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can do:

> ->>>response.css('title::text').getall()

> ['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select only the text elements directly inside < title> element. 

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

> ->>>response.css('title::text').get()

> 'Quotes to Scrape'

As an alternative, you could’ve written:

> ->>>response.css('title::text')[0].get()

> 'Quotes to Scrape'

---


## Run Scrapy from a script

You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via `scrapy crawl`.

Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.

The first utility you can use to run your spiders is `scrapy.crawler.CrawlerProcess`. 

This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands.

Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the `CrawlerRunner.crawl` method.

Here’s an example of its usage, along with a callback to manually stop the reactor after MySpider has finished running.

```
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
```