<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to HTML and XPath

_Authors: Kiefer Katovich (SF), Dave Yerrington (SF)_

---

### Learning Objectives
- Understand scraping basics
- Get familiar with import.io service
- Understand the structure and content of HTML
- Utilize XPath to extract information from HTML 


### Student Pre-Work
*Before this lesson, you should already:*
- Understand basic HTML concepts
- Have worked with Beautiful Soup
- Have signed up for import.io

### Lesson Guide

- [Introduction](#introduction)
- [HTML](#html)
    - [Elements](#elements)
    - [Attributes](#attributes)
- [What is XPath?](#xpath)
    - [Absolute References](#xpath_absolute)
    - [Relative References](#xpath_relative)
    - ["Where's Waldo?" Exercise](#waldo_exercise)
- [1 vs. N Selectors](#1_v_n)
- [Demo Code](#demo)
    - [Scrape DataTau](#scrape_tau)
- [Independent Practice](#ind_practice)

---

<a id='introduction'></a>
## Introduction: Scraping Overview (10 min)

Web scraping is a technique for extracting information from websites. It focuses on transforming unstructured data on the web into structured data that can be stored and analyzed.

There are a variety of ways to "scrape" what we want from the web:

- Using Third-party services (import.io).
- By writing our own Python apps that pull HTML documents and parse them.
  - Mechanize
  - Scrapy
  - Requests
  - Libxml/XPath
  - Beautiful Soup
  - Regular expressions

## Discussion: What Do You Think Would Be the Most Challenging Aspect of Scraping Information?


_If you were asked to scrape Craigslist property listings and put them in a DataFrame, what would hold you up?_

<a id='html'></a>
## HTML Review

In the HTML document object model (DOM), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

## HTML in IPython

You can write HTML-style text in a Jupyter notebook the same way you can style text in markdown. 
```
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```
<p>I am a paragraph.</p>
<strong>I am bold.</strong>


<a id='elements'></a>
## Elements
Elements begin and end with **opening and closing tags**, which are defined by namespaced, encapsulated strings. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

_Note: The tags **title, p, and strong** are represented here._

## Element Parent/Child Relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace, like so:**  `<p></p>`

**Elements can have parents and children:**

```html
<body>
    <div>I am inside the parent element.
        <div>I am inside a child element.</div>
        <div>I am inside another child element.</div>
        <div>I am inside yet another child element.</div>
    </div>
</body>
```

<a id='attributes'></a>

## Element Attributes

Elements can also have attributes! Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- ID
- Href
- Title
- Name

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


## Can You Identify an Attribute, an Element, a Text Item, and a Child Element?

```HTML
<html>
   <title id="main-title">All this scraping is making me itch!</title>
   <body>
       <h1>Welcome to my Home Page</h1>
       <p id="welcome-paragraph" class="strong-paragraph">
           <span>Hello friends, let me tell you about this cool hair product.</span>
           <ul>
              <li>It's cool.</li>
              <li>It's fresh.</li>
              <li>It can tell the future.</li>
              <li>Always be closing.</li>
           </ul>
       </p>
   </body>
```

**Bonus:** What's missing? 

<a id='xpath'></a>

## Enter XPath

XPath uses path expressions to select nodes or node sets in an HTML/XML document. These path expressions look a lot like the expressions you see when you work with a traditional computer file system.

## XPath Features

XPath includes more than 100 built-in functions to help us select and manipulate HTML or XML documents. XPath has functions for:

- String values.
- Numeric values.
- Date and time comparison.
- Sequence manipulation.
- Boolean values.
- And more.

## Basic XPath Expressions

XPath comes with a wide array of features, but the basics of selecting data are the most common problems XPath can help you solve.

You'll use **XPath** most often for selecting data from HTML documents. There are two ways you can **select elements** within HTML using **XPath**:

- Absolute references.
- Relative references.

<a id='xpath_absolute'></a>
## XPath:  Absolute References

_For our XPath demonstration, we'll use Scrapy, which is using Libxml under the hood. Libxml provides the basic functionality for XPath expressions._

In [None]:
# Pip install Scrapy.
# Pip install --upgrade zope2.
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

HTML = """
<html>
    <body>
        <span id="only-span">good</span>
    </body>
</html>
"""
# The same thing, but an "absolute" reference.
Selector(text=HTML).xpath('/html/body/span/text()').extract()


<a id='xpath_relative'></a>
## XPath: Relative References

Relative references in XPath match the "ends" of structures. Because there's only a single `span` element, `//span/text()` matches **one element**.

In [None]:
Selector(text=HTML).xpath('//span/text()').extract()

## Selecting Attributes

Attributes can be found **within a tag**, such as `id="only-span"` within our `span` attribute. We can get the attribute by using the `@` symbol **after** the **element reference**.


In [None]:
Selector(text=HTML).xpath('//span/@id').extract()

<a id='waldo_exercise'></a>
## Where's Waldo? — XPath Edition (~10 min)

In this example, we will find Waldo together. Find Waldo as:

- An element.
- An attribute.
- A text element.

In [None]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo I'm not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill Gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">Parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

In [None]:
# # Text contents of the element Waldo:
# Selector(text=HTML).xpath('/html/body/waldo/text()').extract()

# # Contents of all class attributes named Waldo:
# Selector(text=HTML).xpath('//*[@class="waldo"]').extract()

# # Contents of all attributes named Waldo:
# Selector(text=HTML).xpath('//*[@*="waldo"]').extract()

# # Gets everything around the text element Waldo:
# Selector(text=HTML).xpath("//*[text()='Waldo']").extract()

<a id='1_v_n'></a>

## 1 vs. N Selections

When selecting elements via relative reference, it's possible that you'll select multiple items. But, it's still possible to select single items if you're specific enough.

**Singular Reference**
- **Index** starts at **1**.
- Selections by offset.
- Selections by "first" or "last."
- Selections by **unique attribute value**.


In [None]:
HTML = """
<html>
    <body>
    
        <!-- Search Results -->
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=751hUX_q0Do" title="Rappin with Gas">Rapping with gas</a>
           <span class="link-details">This is a great video about gas.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=97byWqi-zsI" title="Casio Rapmap">The Rapmaster</a>
           <span class="link-details">My first synth ever.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=TSwqnR327fk" title="Cinco Products">Cinco Midi Organizer</a>
           <span class="link-details">Midi files at the speed of light.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=8TCxE0bWQeQ" title="Baddest Gates">BBG Baddest Moments</a>
           <span class="link-details">It's tough to be a gangster.</span>
        </div>
        
        <!-- Page stats -->
        <div class="page-stats-container">
            <li class="item" id="pageviews">1,333,443</li>
            <li class="item" id="somethingelse">bla</li>
            <li class="item" id="last-viewed">01-22-2016</li>
            <li class="item" id="views-per-hour">1,532</li>
            <li class="item" id="kiefer-views-per-hour">5,233.42</li>
        </div>
        
    </body>
</html>
"""

span = Selector(text=HTML).xpath('/html/body/div/li[@id="kiefer-views-per-hour"]/text()').extract()
span

#### Selecting the first element in a series of elements.

In [None]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[2]

#### Selecting the last element in a series of elements.

In [None]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[-1]

#### Selecting all elements matching a selection.

In [None]:
Selector(text=HTML).xpath('//span').extract()

### Selecting Elements Matching an _Attribute_

This will be one of the most common ways you will select items. HTML DOM elements will be differentiated based on their "class" and "ID" variables. Mainly, these types of attributes are used by web developers to refer to specific elements or a broad set of elements to apply visual characteristics to using CSS.

```HTML 
//element[@attribute="value"]
```

**Generally:**

- "Class" attributes within elements usually refer to multiple items.
- "ID" attributes are supposed to be unique but aren't always.

_CSS stands for cascading style sheets. These are used to abstract the definition of visual elements on a micro and macro scale for the web. They are also our best friend as data miners. They give us strong hints and cues as to how a web document is structured._

<a id='demo'></a>

## Let's Code

 - How can we get a series of only text items for the page statistics section of our page?
 - We want to know only how many times Kiefer views the YouTube videos page per hour.

In [None]:
# Get all of the text elements for the page statistics section.
Selector(text=HTML)

In [None]:
#Option 1: Get only the text for "Kiefer's" number of views per hour.
Selector(text=HTML).xpath('//div[@class="page-stats-container"]/li[5]/text()').extract()



In [None]:
#Option 2: Get only the text for "Kiefer's" number of views per hour.
Selector(text=HTML).xpath('//li[@id="kiefer-views-per-hour"]/text()').extract()

## A Quick Note:  Requests

The requests module is the gateway to interacting with the web using Python. We can:

 - Fetch web documents as strings.
 - Decode JSON.
 - Perform basic data munging with web documents.
 - Download static files that are not text:
  - Images.
  - Videos.
  - Binary data.

Take some time and read up on requests:

http://docs.python-requests.org/en/master/user/quickstart/

<a id='scrape_tau'></a>

## Let's Scrape DataTau Headlines

DataTau is a great site for data science news. Let's take their headlines using Python **requests** and practice selecting various elements.

Using the <a href="https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en">XPath Helper Chrome plugin</a> _(cmd-shift-x)_ and the Chrome Inspect feature, let's explore the structure of the page.

_Here's a <a href="https://www.youtube.com/watch?v=i2Li1vnv09U">concise video</a> that demonstrates the basic Inspect feature in Chrome._

In [None]:
# Please only run this frame once to avoid hitting the site too hard all at once.
import requests

response = requests.get("http://www.datatau.com")
HTML = response.text  
HTML[0:150]           # View the first 150 characters of the HTML index document for DataTau.

#### Selecting Only the Headlines

We'll use XPath Helper to inspect the markup that comprises the **title** to find a pattern. As there is more than one **title**, we expect to find a series of elements representing the **title** data we're interested in.

![](https://snag.gy/m4K3UE.jpg)

In this example, we are referencing the **first center**, **third table row (`tr[3]`)** within the second **`td` having a class of `"title"` (`td[@class="title"][2]`)**, and the anchor tag within a **(`a/text()`)**.


In [None]:
import pandas as pd

titles = Selector(text=HTML).xpath('//td[@class="title"]/a/text()').extract()
titles[0:10] # The first five titles.

## How Do We Get the URLs From the Titles?

In [None]:
urls = Selector(text=HTML).xpath('//td[@class="title"]/a/@href').extract()
urls[::-1]
#<a href="http://tech.marksblogg.com/faster-queries-google-cloud-dataproc.html"> Thirty-three-times-faster queries on Google Cloud's Dataproc using Facebook's Presto.</a>
# titles[0:5] # The first five titles.

#### How can we get the site domain after the title within the parentheses (i.e., stitchfix.com)?

In [None]:
domains = Selector(text=HTML).xpath("//span[@class='comhead']/text()").extract()

In [None]:
domains[0:5]

#### How about the points?

In [None]:
points = Selector(text=HTML).xpath('//td[@class="subtext"]/span/text()').extract()
points[0:5]

#### How about the "more link?"
Hint: You can use `element[text()='exact text']` to find text element matching specific text.

In [None]:
next_link = Selector(text=HTML).xpath('//a[text()="More"]/@href').extract()
next_link

<a id='ind_practice'></a>

## Independent Practice/Lab

For the next 30 minutes, try to grab the following:

- Story titles.
- Story URL (href).
- Domain.
- Points.

Stretch:
- Author.
- Comment count.

Then, put your results into a DataFrame.

- Perform a basic analysis of domains and point distributions.

**Bonus**

Automatically find the next "more" link and mine the next page(s) until none exist  Logically, you can code each page with this pseudocode:

1) Does the next link exist (a tag with `text == "More"`)?
2) Fetch URL, prepended with domain (`datatau.com/(extracted link here)`).
3) Parse the page with `Selector(text=HTML).xpath('').extract()` to find the elements.
4) Add to DataFrame.

_Note: You might want to set a limit — something like 2–3 total requests per attempt — to avoid unnecessary transfer._


In [None]:
import requests, numpy as np

def parse_url(url="http://www.datatau.com", data=False):
    
    response  =  requests.get(url)
    links     =  Selector(text=response.text).xpath("//td[@class='title']/a/@href").extract()
    titles    =  Selector(text=response.text).xpath("//td[@class='title']/a/text()").extract()
    points    =  Selector(text=response.text).xpath("//td[@class='subtext']/span/text()").extract()
    domains   =  Selector(text=response.text).xpath("//td[@class='title']/span/text()").extract()
    authors   =  Selector(text=response.text).xpath("//td[@class='subtext']/a[contains(@href, 'user')]/text()").extract()
    comments  =  Selector(text=response.text).xpath("//td[@class='subtext']/a[contains(@href, 'item')]/text()").extract()

    expected_length = 30
    
    # Adding [np.nan]*(expected_length - len(points)) to the end of the lists will fill in missing
    # values at the end of results that sometimes don't exist naturally.
    scraped = dict(
        titles   =  titles[:30], 
        links    =  links[:30], # :30 Because of that "more" link.
        points   =  points + [np.nan]*(expected_length - len(points)),
        domains  =  domains + [np.nan]*(expected_length - len(domains)),
        authors  =  authors + [np.nan]*(expected_length - len(authors)),
        comments =  comments + [np.nan]*(expected_length - len(comments))
    )
    
    df = pd.DataFrame(scraped)
    
    if type(data) != bool:
        data = df.append(data)
    else:
        data = df
        
    # If there's data, append them. If not, it's the first iteration, so there's no need.
    # Find "more" link:
    more_anchor  =  Selector(text=response.text).xpath("//a[text() = 'More']/@href").extract()
    
    if len(more_anchor) > 0:
        more_url  =  "http://www.datatau.com%s" % more_anchor[0]
        print "Fetching %s..." % more_url
        return parse_url(more_url, data=data)
    else:
        return data.reset_index()
       
        
df = parse_url("http://www.datatau.com")
df