<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Web Scraping and Spiders With Scrapy

_Authors: Dave Yerrington (SF), Sam Stack(DC)_

_Modified for DSI-EAST-2 by Justin Pounders_

---

### Learning Objectives
- Decipher the structure and content of HTML
- Use Beautiful Soup to parse HTML
- Use XPath to select HTML elements
- Practice using Scrapy to get data from Craigslist
- Walk through the construction of a spider built using Scrapy


<a id='introduction'></a>

![What is HTML?](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

One of the largest sources of data in the world is all around us — the web. Most people consume the web in some form every day. One of the most powerful Python tool sets we'll learn allows us to extract and normalize data from unstructured sources such as web pages.  

**If you can see it, it can be scraped, mined, and put into a DataFrame.**

Before we begin the actual process of web scraping with Python, it's important to cover the basic constructs that describe HTML as unstructured data. 

We'll then cover a powerful selection technique called XPath and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org).

<a id='html'></a>

## Hypertext Markup Language (HTML)

---

In the HTML document object model (DOM), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
### Elements
Elements begin and end with opening and closing "tags," which are defined by namespaced, encapsulated strings. These namespaces, which begin and end the elements, must be the same. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

When you have several different titles or paragraphs on a single page, you can assign ID values to namespaces to make more unique reference points. IDs are also useful for labelling nested elements.
```html
<title id ='title_1'>I am the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```


**Elements can have parents and children.**
It's important to remember that an element can be both a parent and a child — whether an element is referred to as a parent or child depends on the specific element you're referencing.


```html
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent.'
        <div id = 'child_2'>I am the child of 'child_1.'
            <div id = 'child_3'>I am the child of 'child_2.'
                <div id = 'child_4'>I am the child of 'child_4.'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```html
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2.'
        <div id = 'child_2'>I am the parent of 'child_3.'
            <div id = 'child_3'> I am the parent of 'child_4.'
                <div id = 'child_4'>I am not a parent. </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can also have attributes. They describe the properties and characteristics of elements. Some affect how the element behaves or looks in terms of the output rendered by the browser.

The most common element is an anchor element. Anchor elements often have an "href" element, which tells the browser where to go after it's clicked. An anchor element is typically formatted in bold type and is sometimes underlined as a visual cue to differentiate it.

**The markup that describes an element with attributes literally looks like this:**

```html
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this:**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

<a id='element-hierarchy'></a>
### Element Hierarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally represented as:**

```html
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```

<a id='html-resources'></a>
### Additional HTML Resources

Read all about the different elements supported by modern browsers:
 * [HTML5 cheat sheet](http://websitesetup.org/html5-cheat-sheet/).
 * [Mozilla HTML element reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).
 * [HTML5 visual cheat sheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf).
 

<a id='practical'></a>

## Using Requests and Beautiful Soup to Extract Information From a Web Page

---

Beautiful Soup is a Python library that's useful for pulling data out of HTML and XML files. It works with many parsers, such as XPath, and can be executed in an IDE, meaning it can be easier to work with when first extracting information from HTML.

Please make sure that the required packages are installed: 

```bash
# Beautiful Soup:
> conda install beautifulsoup4
> conda install lxml

# Or if conda doesn't work:
> pip install beautifulsoup4
> pip install lxml
```

In [1]:
from bs4 import BeautifulSoup

In [2]:
soup = BeautifulSoup(open("sample.html"), "lxml")

In [4]:
soup.extract()

<!DOCTYPE html>
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<h1>Header 1</h1>
<h2>Header 2</h2>
<p>This is a paragraph</p>
<a href="https://www.google.com/">Google it!</a>
<h3>What's in a div?</h3>
<div class="divvy-it-up" id="foobar">
<p id="layer1">I'm in a div.  Yeah!</p>
<div>
<p id="layer2">I'm in a div, too!</p>
</div>
</div>
<div class="todo">
<ul>
<li> Take out trash</li>
<li> Walk dog</li>
</ul>
</div>
<div class="something">
<ol>
<li>One</li>
<li>Two</li>
</ol>
</div>
</body>
</html>

In [6]:
print(soup.title)
print(soup.title.text)

<title>Hello, World!</title>
Hello, World!


In [11]:
ulist = soup.find('ul') # return the _first_ <ul>
for list_item in ulist.find_all('li'):    # return _all_ matches, <li>
    print(list_item.text)

 Take out trash
 Walk dog


In [17]:
# <a href="https://www.google.com/">
a_list = soup.find_all('a')
first_a = a_list[0]    # access the results like a list
link = first_a['href'] # access attributes of an element like a dict
print(first_a)
print(link)

<a href="https://www.google.com/">Google it!</a>
https://www.google.com/


In [27]:
#<div class="divvy-it-up" id="foobar">
#<p id="layer1">I'm in a div.  Yeah!</p>
#<div>
#<p id="layer2">I'm in a div, too!</p>
#</div>
#</div>

div_results = soup.body.find_all('div', 
                       {'id':'foobar', 'class':'divvy-it-up'})
div_results[0].find('p', {'id':'layer2'}).text

"I'm in a div, too!"

### Let's scrape DataTau!

<a id='step1'></a>
### 1) Fetch the content by URL.


In [29]:
# You'll need the requests library in order to fully utilize bs4.
import requests
from bs4 import BeautifulSoup

# Target web page:
url = "https://www.datatau.com/"

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

# The first 700 characters of the content.
#print(html)

200


More information on [request status codes](http://www.restapitutorial.com/httpstatuscodes.html).

<a id='step2'></a>
### 2) Parse the HTML document with Beautiful Soup.

This step allows us to access the elements of the document by XPath expressions.

In [30]:
soup = BeautifulSoup(html, 'lxml')

In [None]:
# This code collects the titles, links and urls for DataTau's homepage

# List to store results
results_list = []

# Get all the <td class="title"... elements
all_td = soup.find_all('td', {'class':'title'})
for element in all_td:
    # start a dictionary to store this item's data
    result = {}
    
    # get the title and full link/url
    a_href = element.find('a')
    if a_href:
        result['title'] = a_href.text   # element text
        result['link'] = a_href['href'] # href link
        
    # get the url domain
    span = element.find('span', {'class':'comhead'})
    if span:
        result['url'] = span.text.strip()[1:-1]
        
    # only store "full" rows of data
    if len(result) == 3:
        results_list.append(result)
        
results_list[0]

In [49]:
import pandas as pd

pd.DataFrame(results_list)

Unnamed: 0,link,title,url
0,https://blog.datalore.io/introducing-datalore/,Introducing Datalore - an intelligent tool for...,datalore.io
1,https://activewizards.com/blog/a-comparative-a...,A Comparative Analysis of Top 6 BI and Data Vi...,activewizards.com
2,https://medium.com/scribd-data-science-enginee...,Search Query Parsing,medium.com
3,https://tinyletter.com/data,Data goes bang: a journey to understand data p...,tinyletter.com
4,https://statsbot.co/blog/event-analytics-defin...,Event Analytics: How to Define User Sessions w...,statsbot.co
5,http://harderchoices.com/2018/02/11/finite-mar...,Finite Markov Decision Process a high-level in...,harderchoices.com
6,https://www.kaggle.com/mrisdal/safely-analyzin...,Tutorial: Analyzing GitHub Projects on Kaggle ...,kaggle.com
7,https://statsbot.co/blog/calculating-customer-...,Using SQL to Estimate Customer Lifetime Value ...,statsbot.co
8,https://activewizards.com/blog/top-15-scala-li...,Top 15 Scala Libraries for Data Science in 2018,activewizards.com
9,https://dwhsys.com/2017/03/25/apache-zeppelin-...,Apache Zeppelin vs. Jupyter Notebook: comparis...,dwhsys.com


<a id='xpath'></a>

## What is XPath?

---

![](assets/obama_wiki.png)

Understanding how to identify elements and attributes within HTML documents gives us the ability to write simple expressions that create structured data. We can think of XPath like a query language for HTML.

To simplify this process, we'll be using the Chrome extension XPath Helper. It's not necessary but highly recommended when building XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en).

XPath expressions can select elements, element attributes, and element text. These selections can apply to a single item or multiple items. Generally, if you're not specific enough, you'll end up selecting multiple elements.


<a id='multiple-selections'></a>
### Multiple Selections

***Multiple selections*** are useful for capturing search results or any repeating element. For instance, the _titles_ from apartment listing search results on Craigslist.


**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**Example HTML Markup**
```html
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four-bedroom, one-and-a-half bath.</a> 
</span>
...
```

**XPath:: Multiple Titles** _Copy this into the XPath Helper Query box_:
```
//a[@class='hdrlnk']
```

**Returns (Ad Titles)**
```
***New Remodeled two bedroom Apartment***
WONDERFUL ONE BR APARTMENT HOME
Beautiful 1bed/1bath Apartment in Russian Hill NO SECURITY DEPOSIT
Knockout SF View|Green Oasis|Private Driveway|Furnished
3BR/3BA Spacious, Beautiful SOMA Loft: 5 month lease
Nob Hill Large Studio - Light, Quiet, Lovely Building
etc...
```

<a id='singular-selections'></a>

### Singular Selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements. Here's an example of a details page on Craigslist:

**URL**

[https://sfbay.craigslist.org/sfc/apa/6161864063.html](https://sfbay.craigslist.org/sfc/apa/6161864063.html)

**HTML Markup**

```html
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
```

**XPath: Single Item**

```
//p[@class='postinginfo'][2]/time
```
**Returns (Time of Posting or Age of Post)**
```
2016-01-12 11:23pm
```

<a id='scrapy'></a>

## A Simple Example Using Scrapy and XPath

---

Below is an example of how to get information out of fake HTML using the XPath capabilities of the Scrapy package. You'll likely need to install the Scrapy package using `conda install scrapy`.   

**Note:** `conda install` will install the necessary dependent packages needed for Scrapy; `pip install` will **not**.

We'll use the `selector` class from the Scrapy library to help us construct our query.

`Selector` classes take the HTML target as an argument and can then utilize several query types to extract information. In this case, we'll specify `XPath`, as our query will utilize XPath language. 

Just like with writing Python scripts, there are several ways you can access the exact same information in HTML. Let's try a few.

In [6]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# HTML structure string:
HTML = """
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
"""

# Option 1: Use the exact class name to get its associated text.
# //span[@class='bestof-text']/text()
best = Selector(text=HTML).xpath("//span[@class='bestof-text']/text()").extract()
best

['best of']

In [9]:
# Option 2: Grabs the entire HTML post where class='bestof-link'.
best =  Selector(text=HTML).xpath("/html/body/div/p/a[@class='bestof-link']")

# Parse the first grabbed chunk of the text for the specific element with class='bestof-text'.
nested_best =  best.xpath("./span[@class='bestof-text']/text()").extract()
nested_best

['best of']

_Option 3 will probably be the most common because there's a good chance you'll want to grab information from several child elements that exist within one parent element._

## Where's Waldo? — XPath Edition

In this example, we'll find Waldo together. Find Waldo as:

- An element.
- An attribute.
- A text element.

In [10]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo I'm not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill Gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">Parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

**Tip:** We can use the asterisk character `*` as a placeholder for "all possible."

```python
# All elements where class='alpha':
Selector(text=HTML).xpath('//*[@class="alpha"]').extract()



# Returns:

[u'<div class="alpha">Bill gates</div>',
 u'<div class="alpha">Zuckerberg</div>']
```


** Find the element 'Waldo': **

In [11]:
# Text contents of the element Waldo:
Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

['Waldo']

**Find the attribute(s) 'Waldo':**

In [12]:
# Contents of all attributes named Waldo:
Selector(text=HTML).xpath('//*[@*="waldo"]').extract()

['<ul id="waldo">\n            <li class="waldo">\n                <span> yo I\'m not here</span>\n            </li>\n            <li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n            <li class="waldo">Last Location:  ???</li>\n            <li class="nerds">\n                <div class="alpha">Bill Gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">Parker</div>\n            </li>\n        </ul>',
 '<li class="waldo">\n                <span> yo I\'m not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

In [13]:
# Contents of all class attributes named Waldo:
Selector(text=HTML).xpath('//*[@class="waldo"]').extract()

['<li class="waldo">\n                <span> yo I\'m not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

**Find the text element Waldo.**

In [14]:
# Gets everything around the text element Waldo:
Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

['<waldo>Waldo</waldo>']

<a id='scrapy'></a>
<a scrapy-spiders></a>
## What is a Scrapy Spider?

---

> *"[Scrapy](http://scrapy.org/) is an application framework for writing web spiders that "crawl" around websites and extract data from them."*

Below we'll walk through the creation of a **spider** using Scrapy. Spiders are automated processes that will crawl through a web page or web pages to collect information.

> **Note:** This code should be written in a script outside of Jupyter.

<a id='scrapy-project'></a>
### 1) Create a new Scrapy project.

In your terminal, `cd` into a directory where you want to create your spider's folder. We recommend the desktop for easy access to the files.
> `scrapy startproject craigslist`

**It should create an output that looks like this:**
<blockquote>
```
New Scrapy project 'craigslist', using template directory '/Users/jmpounders/anaconda3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/jmpounders/dsi-east-2/scrapy/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```
</blockquote>

**That command generates a set of project files:**
<blockquote>
```
├── craigslist
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
```
</blockquote>

Generally, these are our files. We'll go into more detail on these soon.

 * **`scrapy.cfg`:** The project's configuration file.
 * **`craigslist/`:** The project’s Python module — you’ll import your code from here later.
 * **`craigslist/items.py`:** The project’s items file.
 * **`craigslist/pipelines.py`:** The project’s pipelines file.
 * **`craigslist/settings.py`:** The project’s settings file.
 * **`craigslist/spiders/`:** A directory where you’ll store your spiders.
 
Please also add this line to your `craigslist/settings.py` file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>



--- 
<a id='define-item'></a>
### 2) Define an "item."

When we define an item, we're telling our new application what it will be collecting. In essence, an item is an entity that has attributes ("title," "description," "price," etc.) that are descriptive and relate to elements on pages we'll be scraping.  

In more precise terms, this is a model (for those who are familiar with object-relational mapping or relational database terms). Don't worry if this is a foreign concept.  The main idea is to understand that a model has attributes that closely resemble or relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items.
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # Define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3) A spider that crawls.

An item is a model that resembles data on a web page. A spider is something that crawls pages and uses our item model to get and hold items for us.

**Scrapy spiders are Python classes. Let's write our first file, called `craigslist_spider.py`, and put it in our `/spiders` directory.**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "https://atlanta.craigslist.org/search/cto"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory.**

```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * It parsed over the content containing the HTML markup of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory. It should be named based on the end of the URL. In our case, it should create a file called "sfc." This is taken directly from the Scrapy docs and its only point is to illustrate the workflow so far. It's nice to have a reference to our HTML file.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4) XPath and parsing with our spider.

So far, we've defined the fields we'll get, some URLs to fetch, and saved some content to a file. Now, it's about to get interesting.

**We should let our spider know about the item model we created earlier. In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, let's add a new import.**

```python
from craigslist.items import CraigslistItem
```

> **Check:** Why won't it work otherwise?

<br><br><br>
**Let's replace our `parse()` method to find some data from our Craigslist spider response and map them to our item model, `CraigslistItem`.**


```python
def parse(self, response): # Define parse() function. 
    items = [] # Element for storing scraped information.
	hxs = Selector(response) # Selector allows us to grab HTML from the response (target website).
	for sel in hxs.xpath("//li[@class='result-row']/p"): # Because we're using XPath language, we need to specify that the paragraphs we're trying to isolate are expressed via XPath.
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() # Title text from the 'a' element. 
		item['link']  =  sel.xpath("a/@href").extract() # Href/URL from the 'a' element. 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]
                # Price from the result price class nested in a few span elements.
        items.append(item)
	return items # Shows scraped information as terminal output.

```



---

<a id='save-examine'></a>
### 5) Save and examine our scraped data.

By default, we can save our crawled data in a CSV format. To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
```
> scrapy crawl craigslist -o items.csv -t csv
```
</blockquote>

It's always good to iteratively check the data when developing a spider to make sure the set is close to what we want. 

> *Pro tip: The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory from which you ran the `scrapy crawl` command.

<a id='addendum'></a>
## Addendum: Leveraging XPath to Get More Results

---

Generally, a workflow that's useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and, from that, derive your own XPath expressions based on the output.

`text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples of `text()`:**
<blockquote>
```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```
</blockquote>

The XPath selector for this:

<blockquote>
```
//h1/text()
```
</blockquote>

**Here are a few examples of attributes:**

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
```
<h2>Description:</h2>

<div id="description">
Short documentary made for the Plymouth City Museum and Art Gallery regarding the set up of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```
</blockquote>

XPath
<blockquote>
```
//div[@id='description']
```
</blockquote>

---
<a id='follow-links'></a>
### Following Links for More Results

One hundred results is pretty good, but what if we want more? We need to follow the "next" links and find new pages to grab. Using the **`parse()`** method of our spider class, we need to return another type of object.

See [Stack Overflow](https://stackoverflow.com/questions/30152261/make-scrapy-follow-links-and-collect-data) for details!
