# `scrapy` basics

See [the scrapy tutorial online here](https://docs.scrapy.org/en/latest/intro/tutorial.html)

In this lesson we are going to explore Scrapy (`scrapy`), a Python library for specifying and running web scraping tasks. Codes that perform the systematic retrieval of remote resources are often called spiders or crawlers. Early examples of crawlers were those that were used to populate the search indexes of search engines like Altavista or Google.

What `scrapy` provides that `bs4` does not is a principled way to describe a scraping task from beginning to end. `bs4` focuses on manipulating a HTML (or HTML-like) document at hand; `scrapy` combines the retrieval step (which we did manually last time with a library like `requests`) and the extract step (which we did with `bs4`) into one artifact. Our objective is to replicate - with `scrapy` the scraping process sketched out in [`bs4` Further Topics](https://eamonnbell-dur.github.io/webscraping-for-humanities/bs4-further-topics.html). This introduction very closely follows the [Scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html).

The first thing to do is to initalise a `scrapy` project. You can do this at the commandline or in the cell below (`!` in a notebook cell passes the command to the shell - `bash` or similar)

In [None]:
!scrapy startproject basics

As the message suggests, this has created a new directory (`basics`) in the current working directory of the notebook. In order to find out where this folder is, click `File > Open...` just underneath the Jupyter logo.

Now, enter the `basics/basics/spiders` folder and create a new file using the `New` button near the top right of the screen. Pick `Text File`. By clicking on the title, rename the file to `quotes_spider.py` and copy and paste the following script into the file:

### Source for `quotes_spider.py` - version 1

---

```python

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        Path(filename).write_bytes(response.body)
        self.log(f'Saved file {filename}')
```

In this script, notwithstanding some of the Python details, we can see there are two key tasks described: `start_requests`, which prepares a `scrapy.Request` for each URI in the list `urls`. The `callback=` argument suggests that the function `parse()` is called once for each of these requests - we might guess this happens after the HTTP request to the URI has been fired, and response has been recieved. Then - for each HTTP request (response) - four things happen, in order:

1. A bit of string processing puts the current page into a variable `page`.
2. We construct a filename using this variable.
3. We write the body of the HTTP response to a file on disk with this name.
4. We announce this fact to the world, via the `self.log()` function.

At this point, we could have done this with the Python standard library, or `requests`, or some combination of both of these. The smart thing about Scrapy is the way it passes information from the response to a request into functions for later processing (or, as we will later see, for firing off further requests).

Notice that we've given the spider a name: `"quotes"`. Because of the structure of the file and the directory tree that we created when we created the Scrapy project, we have a convenient way of running this spider, and we get nice logging for free. This is unlike when we work with `requests` alone. 

The command is `scrapy crawl [[spider_name]]`. (Note we have to `cd` into the project directory before we kick anything off).

In [None]:
!cd basics; scrapy crawl quotes

If we parse the logs, we can see the message `Saved file quotes-1.html`. Similarly, we can double check that these files have been downloaded. (`head -n 20` shows the first ten lines of a file).

In [None]:
!cd basics; head -n 20 quotes-1.html

In order to better understand the power of Scrapy, we are going to modify  `quotes_spider.py` just a little bit. This isn't something we'd do normally, but it serves to illustrate a point. Instead of writing the response body to a file, let's bundle a snippet of it (the first 10 characters) up with a little bit of metadata - in this case, the URI for the resource, the filename that the resource would have had. Scrapy uses the `yield` keyword to achieve this. 

### Source for `quotes_spider.py` - version 2

---


```python

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        
        yield {
            'filename': filename,
            'uri': response.url,
            'body_snippet': response.body[:10]
        }
```

In [None]:
!cd basics; scrapy crawl quotes -O quotes.csv

In [None]:
!cd basics; head quotes.csv

### Source for `quotes_spider.py` - version 3

---

```python
from pathlib import Path

import bs4
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        soup = bs4.BeautifulSoup(response.text)
        
        yield {
            'uri': response.url,
            'title': soup.title.text,
        }
```

In [None]:
!cd basics; scrapy crawl quotes -O quotes.csv

In [None]:
!cd basics; head quotes.csv

At this point, you probably have enough to [go back to the final result with `bs4` from the last day's workshop](https://eamonnbell-dur.github.io/webscraping-for-humanities/bs4-further-topics.html#) and use what you know to create a new spider, called `discogs_spider.py`, which, given a URI of a Discogs.com list, will **yield** the album titles and the links to the cover images for each album in the list.

### Source for `discogs_spider.py` 

---

```{toggle}
```python
from pathlib import Path

import bs4
import scrapy


class DiscogsSpider(scrapy.Spider):
    name = "discogs"

    def start_requests(self):
        urls = [
            'https://www.discogs.com/lists/277616',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        soup = bs4.BeautifulSoup(response.text)
        
        ol_albums = soup.find('ol', id='listitems')
        li_albums = ol_albums.find_all('li')

        
        for li in li_albums:
            album_title = li.find('a').get_text()
            cover_image_link = li.find('img')['src'] 
            yield {
                'uri': response.url,
                'album_title': album_title,
                'cover_image_link': cover_image_link
            }
```
```

In [None]:
!cd basics; scrapy crawl discogs -O discogs.csv

In [None]:
!cd basics; head discogs.csv