<img src="https://scrapinghub.files.wordpress.com/2016/01/scrapylogo.png" />

# Introduction to Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Scrapy runs from the command line.

[Scrapy Docs](https://doc.scrapy.org/en/latest/index.html)

## Installing Scrapy

Run the following in the command line to install Scrapy package:
```
$ pip install scrapy
```
<br>

**Linux (Ubuntu 12.04 or above)**
```
$ sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

$ sudo apt-get install python3 python3-dev  #if running python3 only

$ pip install scrapy
```
<br>

**Mac OS X**   (If pip install fails)  
[installation guide](https://doc.scrapy.org/en/latest/intro/install.html)

## Create a project

In the command line, navigate to the directory where you want to save your project and run:
```
$ scrapy startproject tutorial
```
```tutorial``` is the name of this project. This will create a ```tutorial``` directory with all the files and folders Scrapy needs to run. 



## Spiders

Spiders are a class that you define which Scrapy uses to scrape information from a website (or a group of websites).

In the ```tutorial``` directory, you will find a directory called ```spiders```. Here is where we will save our scrapy script.

The 'bones' of scrapy spider should look like the following:

In [2]:
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "CONCURENT_REQUESTS_PER_DOMAIN": 3,
        "HTTPCASHE_ENABLED": True
    }

    start_urls = ['http://www.example.com']

    def parse(self, response):
        # script goes here!        

The spider 'ExampleSpider' subclasses scrapy.Spider and defines some attributes and methods:

- name: identifies the Spider. It must be unique within a project.

- custom_settings: Here you can set custome settings for your spider.

- start_url: Spider will begin to crawl from this url.

- parse(): a method that will be called to handle the response downloaded for each of the start_urls.

<div class="alert alert-block alert-info">
**TIP: BE NICE TO WEBSITES**:  
Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and it describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard. Hitting a website too hard might harm it, slow it down for others, and might even get your IP banned!
</div>

## Selectors: XPath v CSS
Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. Scrapy supports both CSS and XPath selectors. With XPath, you can extract data based on text elements’ contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may be your best option and save lots of time. 

For additional help with XPath, check out [this tutorial](https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/).

## Scrapy Shell
The best way to test out your code! The scrapy shell is an interactive environment to test selectors on a specific website and see what they extract. This will help you debug your spiders before you run them. In the command line run:

```
$ scrapy shell 'https://www.example.com'
```

## Write Your First Spider

#### Music Fesitval Scraper
We are going to scrape information on music festivals listed on Music Festival Wizard. Specifically, we want to scrape information on all music festivals in the [United States](https://www.musicfestivalwizard.com/festival-guide/us-festivals/) and [Canada](https://www.musicfestivalwizard.com/festival-guide/canada-festivals/). Take a look at both the US and Canada festival lists and notice that they have a similar structure. 

We want our spider to crawl through both lists and follow the links to each individual festival listing where we can then scrape out the information we want.  

Let's save the following script in a file called ```festival.py``` in the ```tutorial/tutorial/spiders``` directory

In [None]:
import scrapy


class FestivalSpider(scrapy.Spider):
    name = 'music_festivals'

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 3,
        "HTTPCACHE_ENABLED": True
    }

    start_urls = [
        'https://www.musicfestivalwizard.com/festival-guide/us-festivals/',
        'https://www.musicfestivalwizard.com/festival-guide/canada-festivals/'
    ]

    def parse(self, response):
        # Extract the links to the individual festival pages


        # Follow pagination links and repeat

Let's use the Scrapy Shell to write and test the Spider. In the command line run:
```
$ scrapy shell 'https://www.musicfestivalwizard.com/festival-guide/us-festivals/'
```

#### Grabbing Links  
This first thing we want our Spider to do is grab the links for the individual festivals so we can go to each of those pages and extract some data. 

If we inspect the first link we see that address for the webpage we want is embeded in an < a > tag. Inspecting the next link shows a simlar tagging. Notice the < a > tags are inside of < span class = 'festivaltitle' >

In the Scrapy Shell try:

```
In [1]: response.xpath('//span[@class="festivaltitle"]/a/@href').extract()
```
<br>
*Breakdown*
- response: an object representing the webpage content that is downloaded by Scrapy
- xpath: the selector we are using the select elements from the page. We include the directions to the elements we want in the parenthesis. 
- extract(): a method to extract elements. There is also an 'extract_first()' method.

<br>
Notice that the above returns all the links to the individual festival pages on the current page. We can now write a for loop to follow each link and scrape that page. 

In [None]:
    def parse(self, response):
        # Extract the links to the individual festival pages
        for href in response.xpath(
                '//span[@class="festivaltitle"]/a/@href'
        ).extract():
            # For each festival link, call 'parse_festival' (defined later)
            yield scrapy.Request(
                url=href,
                callback=self.parse_festival,
                meta={'url':href}
            )

The parse method looks for the individual festival links and then yeilds a new request for each of those links, calling the parse_festival method which we will define later. The meta parameter is essentially a dictionary that is copied to the parse_festival method. 

This only gets the links on page 1 of this list. What if we want to copy the links for all the pages?

#### Next page
After we've looped through all the links on the current page, we want to be able to go to the next page and do the same. Let's inspect the arrow to the next page and get the link. Using the Scrapy Shell try to figure out the Xpath to this link.

In [None]:
    def parse(self, response):
        # Extract the links to the individual festival pages
        for href in response.xpath(
                '//span[@class="festivaltitle"]/a/@href'
        ).extract():
            # For each festival link, call 'parse_festival' (which we will define)
            yield scrapy.Request(
                url=href,
                callback=self.parse_festival,
                meta={'url':href}
            )

        # Follow pagination links and repeat
        next_url = response.xpath(
            '//div[@class="pagination"]/ul/li/a[@class="next page-numbers"]\
            /@href').extract()[0]

        yield scrapy.Request(
            url=next_url,
            callback=self.parse
        )

Now that we have the link to the next page, all we need to go is call parse on that link and the cycle will repeat.

#### Now lets extract some data for each festival!

We are going to define parse_festival to scrape the data we want to get from each of the festival listing. We want to scrape the following:
- Festival Name
- Location
- Dates
- Ticket price
- Festival website
- Logo url
- Lineup

<div class="alert alert-block alert-info">
**TIP: XPATH IN BROWSERS**:  
When inspecting the HTML of a website, right click in the html code on the item you want and copy the XPath information.
</div>

#### Festival Name
The XPath for the festival name looks like this:

//*[@id="post-25879"]/div/header/h1/span

We want to extract the text between the span tags. In the Scrapy shell, lets try the following:

```
In [1]: response.xpath('//h1/span/text()').extract()
```
The Output we get is 
```
Out [1]: [u'Panorama Music Festival 2017']
```

In [None]:
# Festival Name
name = response.xpath('//h1/span/text()').extract()[0]

#### Location
Next lets look at location. The XPath is:
//*[@id="festival-basics"]/text()[1]

Lets look in the scrapy shell and see what the following returns:
```
In [2]: response.xpath('//div[@id="festival-basics"]/text()').extract()

Out[2]:
[u'\r\n',
 u'\r\n',
 u'\r\n',
 u'New York City, NY',
 u'\r\n',
 u'July 28-July 30, 2017',
 u'\r\n',
 u' ',
 u'\r\n',
 u' No',
 u'\r\n',
 u'\r\n\r\n',
 u'\r\n\r\n\r\n\r\n\r\n',
 u'\r\n',
 u'\r\n\r\n',
 u'\r\n\r\n',
 u'\r\n\r\n\r\n',
 u'\r\n\r\n']
```

We get a list with the location at index 3 and dates at index 5.

In [None]:
location = (
    response.xpath('//div[@id="festival-basics"]/text()').extract()[3])

dates = (
    response.xpath('//div[@id="festival-basics"]/text()').extract()[5])

#### Ticket Price
Try this one on your own- Make sure to use the Scrapy Shell!

#### Website
Let's look at the xpath for the festival's official website:
//*[@id="festival-basics"]/a

Notice that the web address is inside the 'a' tag and is assigned to href. To extract an href link we use:  
```
In [3]: response.xpath('//div[@id="festival-basics"]/a/@href').extract()[0]

Out[3]: u'http://www.panorama.nyc/'
```

#### Logo and Lineup
Try to do these on your own.

#### Let's put it all together in parse_festival :

In [None]:
def parse_festival(self, response):
        url = response.request.meta['url']

        name = response.xpath('//h1/span/text()').extract()[0]

        location = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[3])

        dates = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[5])

        tickets = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[7])

        website = (
            response.xpath(
                '//div[@id="festival-basics"]/a/@href').extract()[0]
        )

        logo = (
            response.xpath(
                '//div[@id="festival-basics"]/img/@src').extract()[0]
        )

        lineup = (
            response.xpath(
                '//div[@class="lineupguide"]/ul/li/text()').extract() +
            response.xpath(
                '//div[@class="lineupguide"]/ul/li/a/text()').extract()
        )

        yield {
            'url': url,
            'name': name,
            'location': location,
            'dates': dates,
            'tickets': tickets,
            'website': website,
            'logo': logo,
            'lineup': lineup
        }

Notice we added the 'url' variable from meta which we defined earlier in the parse method. Putting it all together our ```festival.py``` script should look like this:

In [1]:
import scrapy


class FestivalSpider(scrapy.Spider):

    name = 'music_festivals'

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 3,
        "HTTPCACHE_ENABLED": True
    }

    start_urls = [
        'https://www.musicfestivalwizard.com/festival-guide/us-festivals/',
        'https://www.musicfestivalwizard.com/festival-guide/canada-festivals/'
    ]

    def parse(self, response):

        for href in response.xpath(
            '//span[@class="festivaltitle"]/a/@href'
        ).extract():

            yield scrapy.Request(
                url=href,
                callback=self.parse_festival,
                meta={'url': href}
            )

        next_url = response.xpath(
            '//div[@class="pagination"]/ul/li/a[@class="next page-numbers"]\
            /@href'
        ).extract()[0]

        yield scrapy.Request(
            url=next_url,
            callback=self.parse
        )

    def parse_festival(self, response):

        url = response.request.meta['url']

        name = response.xpath('//h1/span/text()').extract()[0]

        location = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[3])

        dates = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[5])

        tickets = (
            response.xpath('//div[@id="festival-basics"]/text()').extract()[7])

        website = (
            response.xpath(
                '//div[@id="festival-basics"]/a/@href').extract()[0]
        )

        logo = (
            response.xpath(
                '//div[@id="festival-basics"]/img/@src').extract()[0]
        )

        lineup = (
            response.xpath(
                '//div[@class="lineupguide"]/ul/li/text()').extract() +
            response.xpath(
                '//div[@class="lineupguide"]/ul/li/a/text()').extract()
        )

        yield {
            'url': url,
            'name': name,
            'location': location,
            'dates': dates,
            'tickets': tickets,
            'website': website,
            'logo': logo,
            'lineup': lineup}


## Run Your Spider

To run your spider, in the command line, go to the project's top level directory and run:

```python
scrapy crawl music_festivals -o festival.json
```

This will run out spider named 'music_festivals' and generate a 'festivals.json' file containing all scraped data in json. Scrapy can also save data in other formats such as a csv file. 