## Session 9 Examples

### New topics and Python packages involved in this notebook:
* "requests" package to act like a web browser, requesting web pages automatically.
* "re" the RegEx library for pattern matching in arbitrary text content (especially NON-XML content)
* "lxml" package for parsing XML or HTML which is similar but often very sloppy and invalid.

In [2]:
import requests
import lxml.html
import re
import pandas as pd
from time import sleep

# Extracting Sailboats for Sale

The website I'm using is http://www.sailboatlistings.com/

where a specific State's listings URL has the form: http://www.sailboatlistings.com/location/Michigan

In [3]:
def get_listings_for_state(state=None) -> lxml.etree:
    """Fetch web page from www.sailboatlistings.com for a single state and parse into
    an etree (DOM).  

    :param state: if unspecified, will prompt user for input through the console."""

    state_list = ['Alaska', 'Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
             'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
             'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
             'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
             'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
             'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota',
             'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'Washington D.C.',
             'West Virginia', 'Wisconsin', 'Wyoming']

    while state is None:
        state = input('Enter the name of a state (e.g. Michigan) to search:')
        state = state.strip().title()

        # make certain the state is valid before requesting the URL with it:
        if state not in state_list:
            print('\nPlease enter a valid state or territory from this list:')
            print(state_list)
            state = None
            continue

    state = state.strip()

    # In the search URL below, the 'mh' key says how many results to return per page. Default is 100.
    url = 'https://www.sailboatlistings.com/cgi-bin/saildata/db.cgi' \
                  + '?db=default&sbltable=1&mh=5000&uid=default&view_records=1&sb=5&so=descend' \
                  + '&state=' + state 
        
    # Fetch the url web page, and convert the response into an HTML document tree:
    tree = None
    while tree is None:
        try:
            r = requests.get(url)
            tree = lxml.html.fromstring(r.content)
            break
        except (ConnectionError, ConnectionRefusedError) as e:
            print('Error retrieving web page.  Retrying in 10 seconds...')
            sleep(10)
    return tree

## Here's what one table row looks like within the (invalid) HTML from that web page:

`<TR>
    <td><a href="http://www.sailboatlistings.com/view/45613">Details</a></td>
    <td background="/tb/b1.jpg">48</font></td>
    <td background="/tb/b2.jpg">2013</td>
    <td background="/tb/b3.jpg">
        <a href="/cgi-bin/saildata/db.cgi?...&manufacturer=Beneteau...">Beneteau</a></td>
    <td background="/tb/b4.jpg">
        <a href="/cgi-bin/saildata/db.cgi?...&model=48 Oceanis...">48 Oceanis</a></td>
    <td background="/tb/b5.jpg">
        <a href="/cgi-bin/saildata/db.cgi?...&city=Muskegon...">Muskegon</a></td>
    <td background="/tb/b6.jpg">Michigan</td>
    <td background="/tb/b7.jpg" align="right">$ 455,000</td>
    <td><a href="http://www.sailboatlistings.com/view/45613">Details</a></td>`

### How to extract selected boat data from the table?  
They are in an HTML table, with 1 row per boat.  A minor complication is that some cells can be empty.

* Cell 1: link to another web page with the listing details
* Cell 2: Length in feet or feet and inches.
* Cell 3: Year made
* Cell 4: Manufacturer
* Cell 5: Model
* Cell 6: City/port where it's supposedly located
* Cell 7: State of listing/ownership
* Cell 8: Listed price
* Cell 9: A duplicate link to details, but it may contain a thumbnail picture.

In [4]:
tree = get_listings_for_state('Michigan')

## A simple XPath query:
Next is a very simple XPath query to extract just the HTML `<title>` tag's text. 

In this case `xpath()` returns either a `list` of strings or `None` if it fails to match at all.

## XPath return values

The return values of XPath evaluations vary, depending on the XPath expression used:

* True or False, when the XPath expression has a boolean result
* a float, when the XPath expression has a numeric result (integer or float)
* a (unicode) string, when the XPath expression has a string result.
* a list of items, when the XPath expression has a list as result. The items may include elements (also comments and processing instructions), strings and tuples. Text nodes and attributes in the result are returned as strings (the text node content or attribute value). Namespace declarations are returned as tuples of strings: (prefix, URI).




In [5]:
title = tree.xpath('//title/text()')
print('Page title: ', title[0])

Page title:  Michigan sailboats for sale by owner.


### Here's an xpath that only extracts the links to detail pages:

It works by finding a nested sequence of `<tr>  <td> .. <a href...>`
where the content (link url) in the href attribute contains the string '/view/'
and then matching (returning) only the href portion.

In [6]:
boat_detail_links = tree.xpath("//tr/td[1]/a[contains(@href,'/view/')]/@href")
print('Found ', len(boat_detail_links), 'boat listings.')
boat_detail_links[:10]

Found  591 boat listings.


['https://www.sailboatlistings.com/view/69804',
 'https://www.sailboatlistings.com/view/69827',
 'https://www.sailboatlistings.com/view/76277',
 'https://www.sailboatlistings.com/view/29925',
 'https://www.sailboatlistings.com/view/53788',
 'https://www.sailboatlistings.com/view/66305',
 'https://www.sailboatlistings.com/view/35713',
 'https://www.sailboatlistings.com/view/76229',
 'https://www.sailboatlistings.com/view/67181',
 'https://www.sailboatlistings.com/view/36236']

In [8]:
boat_detail_links[-5:]

['https://www.sailboatlistings.com/view/73046',
 'https://www.sailboatlistings.com/view/73664',
 'https://www.sailboatlistings.com/view/73177',
 'https://www.sailboatlistings.com/view/31834',
 'https://www.sailboatlistings.com/view/65703']

### Here's another XPath that matches the same thing in a more convoluted way:

In [6]:
boat_detail_links = tree.xpath("//tr/td[@background='/tb/b1.jpg']/preceding-sibling::td/a/@href")
print('Found ', len(boat_detail_links), 'boat listings.')
boat_detail_links[:10]

Found  579 boat listings.


['https://www.sailboatlistings.com/view/69804',
 'https://www.sailboatlistings.com/view/69827',
 'https://www.sailboatlistings.com/view/69619',
 'https://www.sailboatlistings.com/view/29925',
 'https://www.sailboatlistings.com/view/53788',
 'https://www.sailboatlistings.com/view/35713',
 'https://www.sailboatlistings.com/view/66305',
 'https://www.sailboatlistings.com/view/65905',
 'https://www.sailboatlistings.com/view/71049',
 'https://www.sailboatlistings.com/view/67181']

### Here's an xpath that only extracts the prices:

It works by finding a table cell with `<a href...>` having 'Details' as its text.
Then it goes "up" the tree one level with `/../` and then finds the 7th sibling `<td>`
and returns the text it contains.


In [7]:
boat_prices = tree.xpath("//tr/td/a[text()='Details']/../following-sibling::td[7]/text()")
boat_prices[:10]

['$ 300,000',
 '$ 124,900',
 '$ 159,900',
 '$ 280,000',
 '$ ',
 '$ 524,500',
 '$ 99,000',
 '$ 174,950',
 '$ 197,000',
 '$ 367,000']

### Alternate XPath for same results?
Because of the way this website's table has different background images per cell, we could also get the prices this way:

In [8]:
boat_prices = tree.xpath("//tr/td[@background = '/tb/b7.jpg']/text()")
boat_prices[:10]

['$ 300,000',
 '$ 124,900',
 '$ 159,900',
 '$ 280,000',
 '$ ',
 '$ 524,500',
 '$ 99,000',
 '$ 174,950',
 '$ 197,000',
 '$ 367,000']

### We COULD construct similar XPath queries for each column like above...
and then we'd quickly have a list of the values for each boat for sale.  ** What's wrong with that algorithm? **

The problem with the approaches above is that each XPath query is independent of the others. 
Thus, we can't be sure whether the matches we found for links (with IDs) belong to the same boats as the prices found.

So the correct approach is to first locate where all the ** table ROWS ** are in the
document tree and ** iterate through the rows ** to pull out all the data we want from each,
one at a time.  The function below works that way.

In [9]:
def get_listings_from_tree(tree) -> pd.DataFrame:
    """Extract the boat sale listings and return a Pandas DataFrame.

    :param tree: an etree object parsed from the HTML."""

    # The following xpath matches every table row that contains a link called "Details".
    # It finds the link, then backs "up" the tree 2 levels to stop on the grandfather <tr>:
    boat_rows = tree.xpath("//tr/td/a[text()='Details']/../..")

    # Now we can iterate through the boat_rows, searching each time within a
    # different <tr> ... </tr>, containing one boat's data:

    boats = []  # let's load the data into a list of dictionaries
    for r in boat_rows:
        boat = {}

        detail_url = r.xpath("td[1]/a/@href")[0]  # href from 1st cell
        # Get the ID from the href url using a regex:
        boat['id'] = re.findall(r'/view/(\d+)', detail_url)[0]

        length = r.xpath("td[2]/text()")
        if length:
            boat['length'] = length[0]
        else:
            boat['length'] = ''

        year = r.xpath("td[3]/text()")
        if year:
            boat['year'] = year[0]
        else:
            boat['year'] = ''

        mfg = r.xpath("td[4]/a/text()")
        if mfg:
            boat['mfg'] = mfg[0]
        else:
            boat['mfg'] = ''

        model = r.xpath("td[5]/a/text()")
        if model:
            boat['model'] = model[0]
        else:
            boat['model'] = ''

        city = r.xpath("td[6]/a/text()")
        if city:
            boat['city'] = city[0]
        else:
            boat['city'] = ''

        boat['state'] = r.xpath("td[7]/text()")[0]
        boat['price'] = r.xpath("td[8]/text()")[0]
        
        """Next, remove the $ and comma symbols from prices.
        In particular, the $ signs when rendered in Github pages sometimes 
        erroneously invoke Math equation formatting which corrupts the 
        proper DataFrame display. """ 
        boat['price'] = boat['price'].replace('$','').replace(',', '').strip()

        boats.append(boat)  # Add data from row into the list

    # Return a DataFrame constructed from the list of dicts:
    columns = ['year', 'mfg', 'model', 'length', 'price', 'city', 'state']
    return pd.DataFrame(boats, columns=columns)

In [10]:
df = get_listings_from_tree(tree)
df

Unnamed: 0,year,mfg,model,length,price,city,state
0,1988,Texas Boat Works,,63,300000,North Shore Marina Grand Haven,Michigan
1,1985,Sutton Boatworks,Sutton 42 Schooner,55,124900,Lake Huron,Michigan
2,1985,Amel,Mango Special,53,159900,Ft Lauderdale,Michigan
3,2003,Bavaria,Bavaria 50,50,280000,,Michigan
4,2013,cradle,Oceanis 48,48,,Muskegon,Michigan
5,2008,Passport Yachts,Passport 470 CC three stateroom,47,524500,La Salle,Michigan
6,1992,Carrol Marine,Tripp 47,47,99000,Macatawa,Michigan
7,2002,Hunter,466,46,174950,Detroit,Michigan
8,2003,Fountaine Pajot,Bahia 46,45.9,197000,Caribbean Martinique,Michigan
9,2017,Jeanneau,,45.1,367000,Kemah - Texas,Michigan
