<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Web Scraping and Spiders With Scrapy

---

<a id='introduction'></a>

![What is HTML?](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

```
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is a Heading</h1>
        <p>This is a paragraph.</p>
        <p>This is <b>another</b> paragraph.</p>
    </body>
</html>
```

RESULT -> [html_example.html](file:///Users/edoardo/html_example.html )

<a id='element-hierarchy'></a>
### Element Hierarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally represented as:**

```
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```


##  
##  
##  Hypertext Markup Language (HTML)

---

In the HTML document object model (DOM), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
### Elements

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```


**Elements can have parents and children.**
It's important to remember that an element can be both a parent and a child — whether an element is referred to as a parent or child depends on the specific element you're referencing.


```
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent.'
        <div id = 'child_2'>I am the child of 'child_1.'
            <div id = 'child_3'>I am the child of 'child_2.'
                <div id = 'child_4'>I am the child of 'child_4.'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2.'
        <div id = 'child_2'>I am the parent of 'child_3.'
            <div id = 'child_3'> I am the parent of 'child_4.'
                <div id = 'child_4'>I am not a parent. </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes



```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this:**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

## Add style to html: CSS

```
body  { font-family:Verdana, Geneva, sans-serif; font-size:10pt; color:#828282; }
td    { font-family:Verdana, Geneva, sans-serif; font-size:10pt; color:#828282; }

.admin td   { font-family:Verdana, Geneva, sans-serif; font-size:8.5pt; color:#000000; }
.subtext td { font-family:Verdana, Geneva, sans-serif; font-size:  7pt; color:#828282; }

a:link    { color:#000000; text-decoration:none; }
a:visited { color:#828282; text-decoration:none; }
```

<a id='html-resources'></a>
### You Are Now Certified HTML Experts

![](assets/certified.jpg)



<a id='xpath'></a>

## What is XPath?

---

![](assets/obama_wiki.png)

## Example 1
Select all the ```<a>``` nodes in /html/body
```
/html/body/a
```

## Example 2
Select all the ```<a>``` nodes with the attribute class equals to "result-title hdrlnk"
```
//a[@class='result-title hdrlnk']
```

## Example 3
Select the third link in the second ```<p>``` node of /html/body
```
/html/body/p[2]/a[3]
```





## XPath + Python

---

In [3]:
# !pip install scrapy

In [4]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

In [5]:
# HTML structure string:
HTML = """
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
"""

In [6]:
query = "//span[@class='bestof-text']/text()"
best = Selector(text=HTML).xpath(query).extract()
best

['best of']

## Where's Waldo? — XPath Edition

In this example, we'll find Waldo together. Find Waldo as:

- An element.
- An attribute.
- A text element.

In [7]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo I'm not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill Gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">Parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

**Tip:** We can use the asterisk character `*` as a placeholder for "all possible."

```python
# All elements where class='alpha':
Selector(text=HTML).xpath('//*[@class="alpha"]').extract()



# Returns:

[u'<div class="alpha">Bill gates</div>',
 u'<div class="alpha">Zuckerberg</div>']
```


** Find the element 'Waldo': **

In [9]:
# Text contents of the element Waldo:
Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

['Waldo']

**Find the attribute(s) 'Waldo':**

In [10]:
# Contents of all attributes named Waldo:
for e in Selector(text=HTML).xpath('//*[@*="waldo"]').extract():
    print(e)
    print('-'*20)

<ul id="waldo">
            <li class="waldo">
                <span> yo I'm not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill Gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">Parker</div>
            </li>
        </ul>
--------------------
<li class="waldo">
                <span> yo I'm not here</span>
            </li>
--------------------
<li class="waldo">Height:  ???</li>
--------------------
<li class="waldo">Weight:  ???</li>
--------------------
<li class="waldo">Last Location:  ???</li>
--------------------


In [11]:
# Contents of all class attributes named Waldo:

for e in Selector(text=HTML).xpath('//*[@class="waldo"]').extract():
    print(e)
    print('-'*20)

<li class="waldo">
                <span> yo I'm not here</span>
            </li>
--------------------
<li class="waldo">Height:  ???</li>
--------------------
<li class="waldo">Weight:  ???</li>
--------------------
<li class="waldo">Last Location:  ???</li>
--------------------


**Find the text element Waldo.**

In [12]:
# Gets everything around the text element Waldo:
Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

['<waldo>Waldo</waldo>']

#  
#  
## Requests + Beautiful Soup 

---


<a id='step1'></a>
### 1) Fetch the content by URL.



In [15]:
# !pip install bs4


In [16]:
import requests
from bs4 import BeautifulSoup

In [20]:
url = "https://washingtondc.craigslist.org/nva/pet/d/adopt-kitten-or-cat-on/6717564911.html"
response = requests.get(url)
print('Status Code: ',response.status_code)
html = response.text
print(html[:500])

Status Code:  200
<!DOCTYPE html>
<html class="no-js">
<head>
<title>Adopt a Kitten or Cat on October 13th! - pets</title>
    	<link rel="canonical" href="https://washingtondc.craigslist.org/nva/pet/d/adopt-kitten-or-cat-on/6717564911.html">
	<meta name="description" content="AUTUMN KITTENPALOOZA EVENT OCTOBER 13th! Come meet our adoptable kittens and cats on October 13th! There will be 30-40 kittens and young adult cats available for adoption. It'll be held at: The...">
	<meta name="robots" content="noarchive,n


<a id='step2'></a>
### 2) Parse the HTML document with Beautiful Soup.

In [21]:
soup = BeautifulSoup(html, 'lxml')

In [22]:
# Singular element:
soup.html.title

<title>Adopt a Kitten or Cat on October 13th! - pets</title>

In [23]:
# Just the text between elements:
soup.html.title.text

'Adopt a Kitten or Cat on October 13th! - pets'

### 3) Find elements

In [24]:
element = soup.findAll("a", {"class": "header-logo"})
element[0].text

'CL'

In [26]:
for link in soup.findAll("a")[0:5]:
    print(link['href'])

/
/
/nva/
/search/nva/ccc
/search/nva/pet


#  
#  
## What is Scrapy?

---

> *"[Scrapy](http://scrapy.org/) is an application framework for writing web spiders that "crawl" around websites and extract data from them."*

Below we'll walk through the creation of a **spider** using Scrapy. Spiders are automated processes that will crawl through a web page or web pages to collect information.

> **Note:** This code should be written in a script outside of Jupyter.

<a id='scrapy-project'></a>
### 1) Create a new Scrapy project.

In your terminal, `cd` into a directory where you want to create your spider's folder. We recommend the desktop for easy access to the files.
> `scrapy startproject craigslist`

**It should create an output that looks like this:**
<blockquote>
```
2016-01-13 00:12:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-13 00:12:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-13 00:12:45 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'craigslist' created in:
    /Users/davidyerrington/virtualenvs/data/scraping/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```
</blockquote>

**That command generates a set of project files:**
<blockquote>
```
craigslist/
    scrapy.cfg
    craigslist/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
```
</blockquote>

Generally, these are our files. We'll go into more detail on these soon.

 * **`scrapy.cfg`:** The project's configuration file.
 * **`craigslist/`:** The project’s Python module — you’ll import your code from here later.
 * **`craigslist/items.py`:** The project’s items file.
 * **`craigslist/pipelines.py`:** The project’s pipelines file.
 * **`craigslist/settings.py`:** The project’s settings file.
 * **`craigslist/spiders/`:** A directory where you’ll store your spiders.
 
Please also add this line to your `craigslist/settings.py` file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>



--- 
<a id='define-item'></a>
### 2) Define an "item."

**NOTE**: items are defined in the items.py file.

When we define an item, we're telling our new application what it will be collecting. In essence, an item is an entity that has attributes ("title," "description," "price," etc.) that are descriptive and relate to elements on pages we'll be scraping.  

In more precise terms, this is a model (for those who are familiar with object-relational mapping or relational database terms). Don't worry if this is a foreign concept.  The main idea is to understand that a model has attributes that closely resemble or relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items.
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # Define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3) A spider that crawls.

An item is a model that resembles data on a web page. A spider is something that crawls pages and uses our item model to get and hold items for us.

**Scrapy spiders are Python classes. Let's write our first file, called `craigslist_spider.py`, and put it in our `/spiders` directory.**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sfc/apa"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory.**

```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * It parsed over the content containing the HTML markup of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory. It should be named based on the end of the URL. In our case, it should create a file called "sfc." This is taken directly from the Scrapy docs and its only point is to illustrate the workflow so far. It's nice to have a reference to our HTML file.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4) XPath and parsing with our spider.

So far, we've defined the fields we'll get, some URLs to fetch, and saved some content to a file. Now, it's about to get interesting.

**We should let our spider know about the item model we created earlier. In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, let's add a new import.**

```python
from craigslist.items import CraigslistItem
```

> **Check:** Why won't it work otherwise?

<br><br><br>
**Let's replace our `parse()` method to find some data from our Craigslist spider response and map them to our item model, `CraigslistItem`.**


```python
def parse(self, response): # Define parse() function. 
    items = [] # Element for storing scraped information.
	hxs = scrapy.Selector(response) # Selector allows us to grab HTML from the response (target website).
	for sel in hxs.xpath("//li[@class='result-row']/p"): # Because we're using XPath language, we need to specify that the paragraphs we're trying to isolate are expressed via XPath.
		item = CraigslistItem()
        # Title text from the 'a' element. 
        item['title'] =  sel.xpath("a/text()").extract() 
        # Href/URL from the 'a' element. 
		item['link']  =  sel.xpath("a/@href").extract() 
        # Price from the result price class nested in a few span elements.
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]
                
        items.append(item)
	return items # Shows scraped information as terminal output.

```



---

<a id='save-examine'></a>
### 5) Save and examine our scraped data.

By default, we can save our crawled data in a CSV format. To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
```
> scrapy crawl craigslist -o items.csv -t csv
```
</blockquote>

It's always good to iteratively check the data when developing a spider to make sure the set is close to what we want. 

> *Pro tip: The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory from which you ran the `scrapy crawl` command.

<a id='addendum'></a>
## Addendum: Leveraging XPath to Get More Results

---

Generally, a workflow that's useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and, from that, derive your own XPath expressions based on the output.

`text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples of `text()`:**
<blockquote>
```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```
</blockquote>

The XPath selector for this:

<blockquote>
```
//h1/text()
```
</blockquote>

**Here are a few examples of attributes:**

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
```
<h2>Description:</h2>

<div id="description">
Short documentary made for the Plymouth City Museum and Art Gallery regarding the set up of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```
</blockquote>

XPath
<blockquote>
```
//div[@id='description']
```
</blockquote>

---
<a id='follow-links'></a>
### Following Links for More Results

One hundred results is pretty good, but what if we want more? We need to follow the "next" links and find new pages to grab. Using the **`parse()`** method of our spider class, we need to return another type of object.

```python
def parse(self, response):  

    items = [] 
	hxs = scrapy.Selector(response) 
    titles = hxs.xpath("//li[@class='result-row']/p")
    
	for sel in titles:                     
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() 
		item['link']  =  sel.xpath("a/@href").extract() 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]     
        items.append(item)
	#return items 
    print('####',response.url) #print the url
    print(items) #print the items

    # Does the next page exist? Let's get it:
    next_page   = response.xpath("(//a[@class='button next']/@href)[1]")

    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse)

```


### A slightly different version using CrawlSpider

```python
import scrapy
from scrapycraigslist.items import CraigslistItem
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor


class CraigslistSpider(CrawlSpider):
	name = "craigslist"
	allowed_domains = ["sfbay.craigslist.org"]
	start_urls = [
		 "http://sfbay.craigslist.org/search/sfc/apa"
	 ]

	rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_page", follow= True),)


	# You need to use a name different from "parse" now
	# because the function parse is already used internally
	# by CrawlSpider
	def parse_page(self, response): # Define parse() function. 
		items = [] # Element for storing scraped information.
        # Selector allows us to grab HTML from the response (target website).
		hxs = scrapy.Selector(response) 
		# Because we're using XPath language, we need to specify that the paragraphs we're trying to isolate are expressed via XPath.
		for sel in hxs.xpath("//li[@class='result-row']/p"): 
			item = CraigslistItem()
			 # Title text from the 'a' element. 
			item['title'] =  sel.xpath("a/text()").extract()
			# Href/URL from the 'a' element. 
			item['link']  =  sel.xpath("a/@href").extract() 
			# Price from the result price class nested in a few span elements.
			item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]				
			items.append(item)
		print(items)
		return items # Shows scraped information as terminal output.
```