<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Web Scraping and Spiders With Scrapy

_Authors: Dave Yerrington (SF), Sam Stack(DC)_

_Modified for DSI-EAST-2 by Justin Pounders_

---

### Learning Objectives
- Decipher the structure and content of HTML
- Use Beautiful Soup to parse HTML
- Use XPath to select HTML elements
- Practice using Scrapy to get data from Craigslist
- Walk through the construction of a spider built using Scrapy


<a id='introduction'></a>

**What is HTML?**

One of the largest sources of data in the world is all around us — the web. Most people consume the web in some form every day. One of the most powerful Python tool sets we'll learn allows us to **extract and normalize data from unstructured sources such as _web pages_**.  

**If you can see it, it can be scraped, mined, and put into a DataFrame.**

Before we begin the actual process of web scraping with Python, it's important to cover the basic constructs that describe HTML as unstructured data. 

We'll then cover a powerful selection technique called **XPath** and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org) _(which is officially described as an open source and collaborative framework for extracting the data from websites)_ However, many websites have implemented limitations on scrapping so Scrapy may not work on them.

<a id='html'></a>

## Hypertext Markup Language (HTML)

---

In the HTML document object model (DOM) we introduced in previous lesson, **everything is a node**:
 * The document itself is a _document node_.
 * All HTML elements are _element nodes_.
 * All HTML attributes are _attribute nodes_.
 * Text inside HTML elements are _text nodes_.
 * Comments are _comment nodes_.

<a id='elements'></a>
### Elements
Elements begin and end with **opening and closing "tags",** which are defined by namespaced, encapsulated strings. These namespaces, which ***begin and end the elements, must be the same***. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

When you have several different titles or paragraphs on a single page, you can assign **ID values to namespaces to make more unique reference points**. IDs are also useful for ***labelling nested elements*** as we will see next below.
```html
<title id ='title_1'>I am the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```


**Elements can have parents and children.**
It's important to remember that an element can be **both a parent and a child** — whether an element is referred to as a parent or child depends on the specific element you're referencing.


```html
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent.'
        <div id = 'child_2'>I am the child of 'child_1.'
            <div id = 'child_3'>I am the child of 'child_2.'
                <div id = 'child_4'>I am the child of 'child_3.'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```html
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2.'
        <div id = 'child_2'>I am the parent of 'child_3.'
            <div id = 'child_3'> I am the parent of 'child_4.'
                <div id = 'child_4'>I am not a parent. </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can also have attributes. They describe the **properties and characteristics of elements**. Some affect how the element behaves or looks in terms of the output rendered by the browser.

The most common element is an anchor element `<a>` with its most important attribute being an `href` attribute, which tells the browser where to go after it's clicked _(the link's destination)_. An anchor element is typically formatted in bold type and is sometimes underlined as a visual cue to differentiate it.

**The markup that describes an `anchor element <a>` with `href` attribute looks like this:**

```html
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this:**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

<a id='element-hierarchy'></a>
### Element Hierarchy 
_(DOM)_

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally represented as:**

```html
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```

<a id='html-resources'></a>
### Additional HTML Resources

Read all about the different elements supported by modern browsers:
 * [HTML5 cheat sheet](http://websitesetup.org/html5-cheat-sheet/).
 * [Mozilla HTML element reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).
 * [HTML5 visual cheat sheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf).
 

<a id='practical'></a>

## Using Requests and Beautiful Soup to Extract Information From a Web Page

---

**Beautiful Soup is a Python library that's useful for pulling data out of HTML and XML files**. It works with many parsers, such as XPath, and can be executed in an IDE, meaning it can be easier to work with when first extracting information from HTML.

First, make sure that the required packages are installed: 

```bash
# Beautiful Soup (in command prompt):
> conda install beautifulsoup4
> conda install lxml

# Or if conda doesn't work (in jupyter):
> pip install beautifulsoup4
> pip install lxml # lxml provides simple, powerful API for parsing XML and HTML in Python. Parsing - simply means converting a program into an internal format. Like, pandas read_csv converts a csv file into a pandas dataframe
```

In [1]:
from bs4 import BeautifulSoup

In [2]:
# checkout sample.html from directory, '../' goes back one directory from solution code-->skip to run on starter_code,
# since sample.html file is on the same directory level as starter_code
soup = BeautifulSoup(open("../sample.html"), "lxml") 

In [3]:
soup.extract()# extract html document

<!DOCTYPE html>
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<h1>Header 1</h1>
<h2>Header 2</h2>
<p>This is a paragraph</p>
<a href="https://www.google.com/">Google it!</a>
<h3>What's in a div?</h3>
<div class="divvy-it-up" id="foobar">
<p id="layer1">I'm in a div.  Yeah!</p>
<div>
<p id="layer2">I'm in a div, too!</p>
</div>
</div>
<div class="todo">
<ul>
<li> Take out trash</li>
<li> Walk dog</li>
</ul>
</div>
<div class="something">
<ol>
<li>One</li>
<li>Two</li>
</ol>
</div>
</body>
</html>

In [4]:
# more self-explanatory methods to print specific html elements
print(soup.title)
print(soup.title.text)

<title>Hello, World!</title>
Hello, World!


In [5]:
olist = soup.find('ol') # find ordered list from html doc
# find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters
for list_item in olist.find_all('li'): # finding + extracting elements within 'ol' 
    print(list_item.text) # .text finds the text from the given tag, 'li'

One
Two


[Additional reference](https://stackoverflow.com/questions/59780916/what-is-the-difference-between-find-and-find-all-in-beautiful-soup-python) on `find()` vs `find_all()`

In [6]:
# let's use find_all to get the url link within, 1st we can specify the anchor element 'a'
soup.find_all('a')

[<a href="https://www.google.com/">Google it!</a>]

In [7]:
# unpack the output from within list
soup.find_all('a')[0]

<a href="https://www.google.com/">Google it!</a>

In [8]:
# finally, pass additional filter
soup.find_all('a')[0]['href']

'https://www.google.com/'

In [9]:
# another example to extract what's within the paragraph id="layer2", within body-->div class="divvy-it-up"
div_results = soup.body.find_all('div', {'class':'divvy-it-up'})
div_results

[<div class="divvy-it-up" id="foobar">
 <p id="layer1">I'm in a div.  Yeah!</p>
 <div>
 <p id="layer2">I'm in a div, too!</p>
 </div>
 </div>]

In [10]:
div_results[0].find_all('p', {'id':'layer2'})

[<p id="layer2">I'm in a div, too!</p>]

- I like to think of this HTML parsing as starting at the top and peeling off the layers of the HTML file layer by layer until we get what we want.

### Let's buy a car!

![](../assets/craigslist.jpg)

Suppose we want to buy this one: https://atlanta.craigslist.org/atl/ctd/d/miami-2016-toyota-tacoma-double-cab/7408971613.html

<a id='step1'></a>
### 1) Fetch the content by URL.


In [11]:
import requests # http library for python
from bs4 import BeautifulSoup

# Target web page with 1 car listing:
url = "https://atlanta.craigslist.org/atl/ctd/d/miami-2016-toyota-tacoma-double-cab/7408971613.html"

# Establishing the connection to the web page using the .get() method:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print('Status Code: ',response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

# The first 700 characters of the content.
print("\nFirst part of HTML document fetched as string:\n")
print(html[:700])

Status Code:  200

First part of HTML document fetched as string:

<!DOCTYPE html>
<html>
<head>
    
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=Edge">
	<meta name="viewport" content="width=device-width,initial-scale=1">
	<meta property="og:site_name" content="craigslist">
	<meta name="twitter:card" content="preview">
	<meta property="og:title" content="2016 Toyota Tacoma Double Cab - Call Now! - cars &amp; trucks - by...">
	<meta name="description" content="2016 Toyota Tacoma Double Cab TRD Sport Pickup 4D 6 ft - $15,950.00 Call us today! 2016 Toyota Tacoma Double Cab For Sale by Elite Motor Cars of Miami LLC Vehicle Description for this Toyota Tacoma...">
	<meta property="og:description" content="2016 Toyota Tacoma Double Cab 


More information on [request status codes](http://www.restapitutorial.com/httpstatuscodes.html).

<a id='step2'></a>
### 2) Parse the HTML document with Beautiful Soup.

This step allows us to access specific elements of the document by XPath expressions.

In [12]:
soup = BeautifulSoup(html, 'lxml') # parsing the html document

In [13]:
# Scrapped output from url
soup.html 

<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="craigslist" property="og:site_name"/>
<meta content="preview" name="twitter:card"/>
<meta content="2016 Toyota Tacoma Double Cab - Call Now! - cars &amp; trucks - by..." property="og:title"/>
<meta content="2016 Toyota Tacoma Double Cab TRD Sport Pickup 4D 6 ft - $15,950.00 Call us today! 2016 Toyota Tacoma Double Cab For Sale by Elite Motor Cars of Miami LLC Vehicle Description for this Toyota Tacoma..." name="description"/>
<meta content="2016 Toyota Tacoma Double Cab TRD Sport Pickup 4D 6 ft - $15,950.00 Call us today! 2016 Toyota Tacoma Double Cab For Sale by Elite Motor Cars of Miami LLC Vehicle Description for this Toyota Tacoma..." property="og:description"/>
<meta content="https://images.craigslist.org/00g0g_88Ey8KXYStuz_0ak06T_600x450.jpg" property="og:image"/>
<meta content="https://atlanta.craigslist.

In [14]:
# Singular element:
soup.html.title

<title>2016 Toyota Tacoma Double Cab - Call Now! - cars &amp; trucks - by...</title>

In [15]:
# Just the "text" between the tags: (notice the <title> tags get dropped)
print(soup.html.title.text)

2016 Toyota Tacoma Double Cab - Call Now! - cars & trucks - by...


In [16]:
# Find single or multiple elements.
# First parameter:
element = soup.find_all("a", {"class": "header-logo"}) # find header-logo in the "soup.html" output
element[0]

<a class="header-logo" href="/" name="logoLink">CL</a>

## Leo's Advice on Scrapping
- It's always advisable to right click the item you want to extract in the webpage and click Inspect.
- In the window that opens up, you can directly see the tag, the class, etc of the item you want as shown below. The item you selected is already highlighted
- It's then super simple to write the `soup.findAll()` function as shown in the next cell



![](../assets/inspect.png)

In [17]:
price_search = soup.findAll('span', {"class": "price"}) # find price in the "soup.html" output
price_search[0].text

'$15950.00'

<a id='step3'></a>
### 3) Let's try to create a dataframe with ALL car listings.

In [19]:
# update our url to obtain ALL car listings in ATL:
response = requests.get("https://atlanta.craigslist.org/search/cto")

In [20]:
soup = BeautifulSoup(response.text, "lxml") # repeat step2 from above
result_list = soup.find_all('div', {'class':'result-info'}) # explore from url-->right click on a car listing-->inspect
print(len(result_list)) # confirm number of listings per page

120


In [21]:
# extracting only car name (text), price and hood details per listing to pass into dataframe
results = []
for result in result_list: # loop over each car listing
    car = {} # empty dict to house text, price, hood per car listing
    car['text'] = result.find('a', {'class':'result-title hdrlnk'}).text # use inspect on url to cross-check
    car['price'] = int(result.find('span', {'class':'result-price'}).text.replace('$','').replace(',',''))
    hood = result.find('span', {'class':'result-hood'})
    car['hood'] = hood.text.replace('(','').replace(')','') if hood else None # null if there is no hood information
    results.append(car)
    
print(results[0]) # print 1st car listing to verify

{'text': '1998 FORD ECONOLINE E250', 'price': 3000, 'hood': ' Atlanta city of atlanta '}


In [22]:
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,text,price,hood
0,1998 FORD ECONOLINE E250,3000,Atlanta city of atlanta
1,H1 Hummer,25000,Oxford city of atlanta
2,2006 FORD POLICE INTERCEPTOR,4900,Marietta city of atlanta
3,2009 Ford Crown Victoria Police Interceptor,6000,Marietta city of atlanta
4,2004 Nissan frontier,3900,Stockbridge city of atlanta
...,...,...,...
115,F150,3000,Stone mountain otp east
116,2014 Kia Soul +,8995,Lilburn otp east
117,2005 NISSAN UD 2300 LP DIESEL 75K MILES,18500,Duluth city of atlanta
118,2013 Volkswagen Passat SE,13500,Brookhaven city of atlanta


**Additional Practice**

- How would you get the next 120 results?
    - One probable solution has been shown below, to loop over all pages 
- How would you get the text associated with a particular car?
    - Self-practice to modify from above code

In [23]:
# to automate ALL listings on the link:
num_results = 0
results = []

while True:
    # Update the URL for each page
    url = f'https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s={num_results}' # url updates every 120 listings(click next to confirm)
    print(url)
    
    response = requests.get(url).text
    soup = BeautifulSoup(response, "lxml")
    
    if num_results==0:
        # For the first page, find the total number of results
        max_results=int(soup.find('span', {'class':'totalcount'}).text)
        print(f'Max results: {max_results}')
    
    # Find all postings
    result_list = soup.find_all('div', {'class':'result-info'})
    
    # Iterate over each posting to find text, price and hood
    for result in result_list:
        car = {}
        car['text'] = result.find('a', {'class':'result-title hdrlnk'}).text
        car['price'] = int(result.find('span', {'class':'result-price'}).text.replace('$','').replace(',',''))
        hood = result.find('span', {'class':'result-hood'})
        car['hood'] = hood.text.replace('(','').replace(')','') if hood else None
        results.append(car)
    
    # Increment to the next page
    num_results=num_results+120
    
    # Break the loop if the next page is out of the max number of listings
    if num_results>=max_results:
        break

https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=0
Max results: 2997
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=120
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=240
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=360
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=480
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=600
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=720
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=840
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=960
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=1080
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=1200
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=1320
https://atlanta.craigslist.org/d/cars-trucks-by-owner/search/cto?s=1440
https://atlanta.craigslist.org/d/cars-trucks-by-owner/sea

<a id='xpath'></a>

## What is XPath?

---

![](../assets/obama_wiki.png)

Understanding how to identify elements and attributes within HTML documents gives us the ability to write simple expressions that create structured data, like we saw above, from a webpage to dataframe. 

We can think of **XPath like a _query language_ for HTML**.

To simplify this process, we'll be using the Chrome extension XPath Helper. It's not necessary but highly recommended when building XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en).

XPath expressions can select elements, element attributes, and element text. These selections can apply to a single item or multiple items. Generally, if you're not specific enough, you'll end up selecting multiple elements.


<a id='multiple-selections'></a>
### Multiple Selections

***Multiple selections*** are useful for capturing search results or any repeating element. For instance, the _titles_ from apartment listing search results on Craigslist.


**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**Example HTML Markup** _(varies depending on listing changes, but is a fragment of what we can see from "inspect" window of an apartment listing)_
```html
...
<a href="https://sfbay.craigslist.org/sfc/apa/d/san-francisco-spacious-top-floor-bdr/7394875800.html" data-id="7394875800" class="result-title hdrlnk" id="postid_7394875800">/:/:/:/:/: Spacious top floor 1 Bdr Apartment /:/:/:/:/:</a>
...
```

**XPath:: Multiple Titles** _Copy this into the XPath Helper Query box (Ctrl + Shift + X is a shortcut to open the query box while still on apartments listing chrome page)_:
```
//a[@class='result-title hdrlnk']
```

**Returns (All Ad Titles on page)**
```
/:/:/:/:/: Spacious top floor 1 Bdr Apartment /:/:/:/:/:
Edwardian Charm, Modern Convenience, Spacious 2 BR Apt
2Bedrooms / 1 Bath $2,000
Beautifully maintained and updated Pac Heights condo - 2 Bed 2 Bath
etc...
```

<a id='singular-selections'></a>

### Singular Selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements. Here's an example of a details page on Craigslist:


**HTML Markup**

```html
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
```

**XPath: Single Item** --> *gets 2nd p class='postinginfo', value inside time tag*

```
//p[@class='postinginfo'][2]/time 
```
**Returns (Time of Posting or Age of Post)**
```
2016-01-12 11:23pm
```

<a id='scrapy'></a>

## A Simple Example Using Scrapy and XPath

---

Below is an example of how to get information out of fake HTML using the XPath capabilities of the **Scrapy** package. You'll likely need to install the Scrapy package using `conda install scrapy`.   

**Note:** `conda install` will install the necessary dependent packages needed for Scrapy; `pip install` will **not**.

We'll use the `selector` class from the Scrapy library to help us construct our query.

`Selector` classes take the HTML target as an argument and can then utilize several query types to extract information. In this case, we'll specify `XPath`, as our query, which will then utilize XPath language. 

Just like with writing Python scripts, there are several ways you can access the exact same information in HTML. Let's try a few.

In [23]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

# HTML structure string:
HTML = """
<body>
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
</body>
"""

In [24]:
# Option 1: Use the exact class name to get its associated text.
best = Selector(text=HTML).xpath("//span[@class='bestof-text']/text()").extract()
best

['best of']

In [25]:
# Option 2: Use the contains() function to extract any text that includes the text 'best of.'
best = Selector(text=HTML).xpath("//span[contains(text(), 'best of')]/text()").extract()
best

['best of']

In [26]:
# Option 3: Grabs the entire HTML post where class='bestof-link'.
best =  Selector(text=HTML).xpath("/html/body/div/p/a[@class='bestof-link']")
# Parse the first grabbed chunk of the text for the specific element with class='bestof-text'.
print(best)

nested_best =  best.xpath("./span[@class='bestof-text']/text()").extract()
print(nested_best)

[<Selector xpath="/html/body/div/p/a[@class='bestof-link']" data='<a class="bestof-link" data-flag="9" ...'>]
['best of']


**Note:**: _Option 3 will probably be the most common because there's a good chance you'll want to grab information from several child elements that exist within one parent element._

## [Where's Waldo?](https://waldo.fandom.com/wiki/Waldo) — XPath Edition

In this example, we'll find Waldo together. Find Waldo as:

- An element.
- An attribute.
- A text element.

In [27]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo I'm not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill Gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">Parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

- Find the **element** 'Waldo'

In [28]:
# Text contents of the element Waldo using the syntax we saw previously-->returns 'element' within <waldo> tags
Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

['Waldo']

- Find the **attribute(s)** 'Waldo'

We can use the asterisk character `*` as a placeholder for "**All possible**."

In [29]:
# Contents of "all attributes" named Waldo:
Selector(text=HTML).xpath('//*[@*="waldo"]').extract()

['<ul id="waldo">\n            <li class="waldo">\n                <span> yo I\'m not here</span>\n            </li>\n            <li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n            <li class="waldo">Last Location:  ???</li>\n            <li class="nerds">\n                <div class="alpha">Bill Gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">Parker</div>\n            </li>\n        </ul>\n        \n        <ul id="tim">\n            <li class="tdawg">\n                <span>yo im here</span>\n            </li>\n        </ul>\n        <li>stuff</li>\n        <li>stuff2</li>\n        \n        <div id="cooldiv">\n            <span class="dsi-rocks">\n               YO!\n            </span>\n        </div>\n        \n        \n        <waldo>Waldo</waldo>\n    </body>\n</html>\n',
 '<li class="waldo">\n                <span> yo I\'m not here<

In [30]:
# Contents of "all class attributes" named Waldo:
Selector(text=HTML).xpath('//*[@class="waldo"]').extract()

['<li class="waldo">\n                <span> yo I\'m not here</span>\n            </li>\n            <li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n            <li class="waldo">Last Location:  ???</li>\n            <li class="nerds">\n                <div class="alpha">Bill Gates</div>\n                <div class="alpha">Zuckerberg</div>\n                <div class="beta">Theil</div>\n                <div class="animal">Parker</div>\n            </li>\n        </ul>\n        \n        <ul id="tim">\n            <li class="tdawg">\n                <span>yo im here</span>\n            </li>\n        </ul>\n        <li>stuff</li>\n        <li>stuff2</li>\n        \n        <div id="cooldiv">\n            <span class="dsi-rocks">\n               YO!\n            </span>\n        </div>\n        \n        \n        <waldo>Waldo</waldo>\n    </body>\n</html>\n',
 '<li class="waldo">Height:  ???</li>\n            <li class="waldo">Weight:  ???</li>\n   

- Find the **text element** 'Waldo'

In [31]:
# Gets everything around the text element Waldo:
Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

['<waldo>Waldo</waldo>\n    </body>\n</html>\n']

- Personally, I only use beautiful soup. it's been able to solve most of my web scrapping needs so far

<a id='scrapy'></a>
<a scrapy-spiders></a>
## What is a Scrapy Spider? 
<mark style="background-color: lightgray">_[BONUS knowledge. Not required for Lab. The craigslist webpage has has restrictions in place to prevent crawling, thus we're unable to retrieve content post following below steps]_</mark>

---

> *"[Scrapy](http://scrapy.org/) is an application framework for writing web spiders that "crawl" around websites and extract data from them."*

Below we'll walk through the creation of a **spider** using Scrapy. Spiders are automated processes that will crawl through a web page or web pages to collect information.

> **Note:** This code should be written in a script outside of Jupyter.

<a id='scrapy-project'></a>
### 1) Create a new Scrapy project.

In your terminal, `cd` into a directory where you want to create your spider's folder. We recommend the desktop for easy access to the files.
> `scrapy startproject craigslist`

**It should create an output that looks like this:**

```
New Scrapy project 'craigslist', using template directory '/Users/jmpounders/anaconda3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/jmpounders/dsi-east-2/scrapy/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```

**That command generates a set of project files:**

```
├── craigslist
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg
```

Generally, these are our files. We'll go into more detail on these soon.

 * **`scrapy.cfg`:** The project's configuration file.
 * **`craigslist/`:** The project’s Python module — you’ll import your code from here later.
 * **`craigslist/items.py`:** The project’s items file.
 * **`craigslist/pipelines.py`:** The project’s pipelines file.
 * **`craigslist/settings.py`:** The project’s settings file.
 * **`craigslist/spiders/`:** A directory where you’ll store your spiders.
 
Please also add this line to your `craigslist/settings.py` file (at the last line) before continuing:
 
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```



--- 
<a id='define-item'></a>
### 2) Define an "item."

When we define an **item**, we're telling our new application **what it will be collecting**. In essence, an item is an entity that has attributes ("title," "description," "price," etc.) that are descriptive and relate to elements on pages we'll be scraping.  

In more precise terms, this is a model (for those who are familiar with object-relational mapping or relational database terms). Don't worry if this is a foreign concept.  The main idea is to understand that a model has attributes that closely resemble or relate to elements on our target web page(s).

**Paste this inside `items.py`:** 
```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items.
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # Define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3) A spider that crawls.

An item is a model that resembles data on a web page. A spider is something that crawls pages and uses our item model to get and hold items for us.

**Scrapy spiders are Python classes. Let's write our first file, called `craigslist_spider.py`, and put it in our `/spiders` directory.**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "https://atlanta.craigslist.org/search/cto"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory.**

Type below in command prompt, from inside the `/craigslist/craigslist` directory:
```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * It parsed over the content containing the HTML markup of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory. It should be named based on the end of the URL. In our case, it should create a file called "sfc." This is taken directly from the Scrapy docs and its only point is to illustrate the workflow so far. It's nice to have a reference to our HTML file.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4) XPath and parsing with our spider.

So far, we've defined the fields we'll get, some URLs to fetch, and saved some content to a file. Now, it's about to get interesting.

**We should let our spider know about the item model we created earlier. In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, let's add a new import.**

```python
from craigslist.items import CraigslistItem
```

> **Check:** Why won't it work otherwise?

<br><br><br>
**Let's replace our `parse()` method to find some data from our Craigslist spider response and map them to our item model, `CraigslistItem`.**


```python
def parse(self, response): # Define parse() function. 
    items = [] # Element for storing scraped information.
	hxs = Selector(response) # Selector allows us to grab HTML from the response (target website).
	for sel in hxs.xpath("//li[@class='result-row']/p"): # Because we're using XPath language, we need to specify that the paragraphs we're trying to isolate are expressed via XPath.
		item = CraigslistItem()
        item['title'] =  sel.xpath("a/text()").extract() # Title text from the 'a' element. 
		item['link']  =  sel.xpath("a/@href").extract() # Href/URL from the 'a' element. 
		item['price'] =  sel.xpath('span/span[@class="result-price"]/text()').extract()[0]
                # Price from the result price class nested in a few span elements.
        items.append(item)
	return items # Shows scraped information as terminal output.

```



---

<a id='save-examine'></a>
### 5) Save and examine our scraped data.

By default, we can save our crawled data in a CSV format. To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
```
> scrapy crawl craigslist -o items.csv -t csv
```
</blockquote>

It's always good to iteratively check the data when developing a spider to make sure the set is close to what we want. 

> *Pro tip: The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory from which you ran the `scrapy crawl` command.

<a id='addendum'></a>
## Addendum: Leveraging XPath to Get More Results

---

Generally, a workflow that's useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and, from that, derive your own XPath expressions based on the output.

`text()` selects only the **text of a given element (_between the tags_)**, and `@attribute_name` is used to select **attributes**.

**Example of `text()`:**

```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```

The XPath selector for this:

```
//h1/text()
```

**Example of `attributes`:**

Consider the below description contained inside a `<div>` tag with `id="description"`:

```
<h2>Description:</h2>

<div id="description">
Short documentary made for the Plymouth City Museum and Art Gallery regarding the set up of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```

The XPath selector for this:
```
//div[@id='description']
```

---
<a id='follow-links'></a>
### Following Links for More Results

One hundred results is pretty good, but what if we want more? We need to follow the "next" links and find new pages to grab. Using the **`parse()`** method of our spider class, we need to return another type of object.

See [Stack Overflow](https://stackoverflow.com/questions/30152261/make-scrapy-follow-links-and-collect-data) for details!
