# Overview
**What is web scraping?** Programmatically fetching and then extracting data from web pages by relying on the semi-predictability of HTML.

**Why should web scraping be used?** To obtain data when the server does not provide a public API, exports (e.g. CSV), or database access. These options are listed starting with the most desirable one.

**When should web scraping be used?** Some data is formatted conveniently as a table and may be manually “scraped” with Copy + Paste. Other information such as [flight fares on Kayak](https://www.kayak.com/flights/BOS-DCA/2017-03-04-flexible) is formatted horizontally or distributed throughout the page. Web scraping is at least even with database access because it is also highly unreliable and maintenance-intensive. At this tier, there are no guarantees from the developers. They may introduce backward-incompatible changes to the data model or HTML document at any time without notice. Manual labor is often more accurate and cost-effective.

This is not to say implementing web scraping is inherently a poor decision. Rather one should be particularly thoughtful. In some circumstances, web scraping is the best and even only viable option.

# Relevant Skills and Technologies
- Chrome Developer Tools ([Keyboard Shortcuts](https://developers.google.com/web/tools/chrome-devtools/inspect-styles/shortcuts))
    - Open Chrome Developer Tools: **Cmd + Opt + I**
    - Open / switch between inspect element mode and browser window: **Cmd + Shift + C**
    - Execute CSS Selector: **$$("&lt;selector&gt;")**
    - Execute XPath: **$x("&lt;xpath&gt;")**
- HTML
    - [Elements, Tags, and Content](http://www.w3schools.com/html/html_elements.asp)
    - [Attributes](http://www.w3schools.com/html/html_attributes.asp)
- [CSS Selector](http://www.w3schools.com/cssref/css_selectors.asp)
- [XPath](http://www.w3schools.com/xml/xpath_intro.asp)

# Case Studies
- **[ISO 639](https://www.loc.gov/standards/iso639-2/php/code_list.php)** Fetch the data from this web page and extract the ISO 639-2 code, ISO 639-1 code, and the English name for each language.  Return the data as a list of lists of strings.
    - Should you select using classes, IDs, paths, or something else?
    - Should you select using CSS selectors, XPATH, or server-side code?

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import requests
from lxml import html


def main():

    """
    Get the language code data.

    Returns
    -------
    list
        Collection of table rows.
    """

    response = requests.get(url='https://www.loc.gov/standards/iso639-2/php/code_list.php')

    # This is the equivalent XPath: "//table//table[1]//tr".
    # Note that this is 1-indexed.
    selector = 'table table:first-of-type tr'
    nodes = html.fromstring(response.text).cssselect(selector)

    return nodes


if __name__ == '__main__':
    as_data_frame = False
    nodes = main()

    # Convert from the lists of HTML elements to that of strings.
    nodes_parsed = [[child.text for child in node.iterchildren()] for node in nodes]

    if as_data_frame:
        data_frame = pd.DataFrame(nodes_parsed[1:], columns=nodes_parsed[0])
        print(data_frame.head())
    else:
        for node in nodes_parsed[:5]:
            print(node)


# More Practice
- **[Osprey Bags and Packs](http://www.ospreypacks.com/us/en/category/packs-and-bags/)**