# Scrappy


![](https://datonauts.com/wp-content/uploads/2016/10/scrapy_architecture.png)

-------


## How does Scrapy compare to BeautifulSoup or lxml?

BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them.

Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them. After all, they’re just parsing libraries which can be imported and used from any Python code.

In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

## Can I use Scrapy with BeautifulSoup?

Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You just have to feed the response’s body into a BeautifulSoup object and extract whatever data you need from it.

Here’s an example spider using BeautifulSoup API, with lxml as the HTML parser:

```python
from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

```

### Starting with Scrapy

In [None]:
!scrapy startproject tutorial

In [None]:
!tree tutorial/

## Spider!!


Guardar el siguiente codigo como *quotes_spider.py* en *tutorial/spiders/*



```python
# start: quotes_spider.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

# End: quotes_spider.py
```

**Que objetos y funciones se definen y cual es su objetivo?**

* name
* start_requests()
* parse()

## Ejecutando nuestro spider
```bash

$ cd tutorial
$ scrapy crawl quotes

...
2018-04-06 17:41:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2018-04-06 17:41:34 [quotes] DEBUG: Saved file quotes-2.html
2018-04-06 17:41:34 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-06 17:41:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5976,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 6, 22, 41, 34, 688923),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'memusage/max': 54030336,
 'memusage/startup': 54030336,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 4, 6, 22, 41, 33, 856990)}
2018-04-06 17:41:34 [scrapy.core.engine] INFO: Spider closed (finished)


```

In [None]:
! ls tutorial/

In [None]:
! cat tutorial/quotes-1.html

## Extracting data

```$ scrapy shell 'http://quotes.toscrape.com/page/1/'```



Ejemplos:


```python



response.css('title') 


response.css('title::text')[0].extract()


response.css("div.quote")


quote = response.css("div.quote")[0]


title = quote.css("span.text::text").extract_first()


author = quote.css("small.author::text").extract_first()


tags = quote.css("div.tags a.tag::text").extract()

```

## Data+Spider


```python

# start: quotes_spider.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
# end: quotes_spider.py
```


## Scraped data store


```scrapy crawl quotes -o quotes.json```

