In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

# Scraping
<!-- requirement: images/async.png -->
<!-- requirement: small_data/Stanford-Tech-Listing.html -->

Today we'll talk about "scraping": how to get unstructured data and turn it into something usable. We'll primarily focus on _web scraping_. Python has mature tools that make this pretty easy.

The basic workflow is:

1. Find the data you want on the web.
2. Inspect the webpage and figure out how to select the content you want. This usually involves some combination of
    - Viewing the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / [regular expressions](https://docs.python.org/2/howto/regex.html)__ in Python.
    - If the page is more complicated (and/or written in good style), we want to use the HTML parse tree => __[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) / [lxml](http://lxml.de/lxmlhtml.html)__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

As an example, suppose we want to crawl the list of "Available Technologies" being licensed by MIT at http://tlo.mit.edu/explore-mit-technologies/view-technologies and store their basic info, their associated patents, and the reference counts on their associated patents.

## HTTP Requests and Responses


Before we delve into scraping, it's helpful to have a rudimentary understanding of how information is transmitted between servers and clients. Hypertext Transfer Protocol (HTTP) is a messaging protocol that describes the types and structure of messages that can be used for communication between servers and clients. Communication occurs in a request-response cycle, in which a client sends a request to the server which then replies with a response.

There are several kinds of requests we can send, but the most common is GET. Any time you open your web browser and navigate to a website, you are making a GET request. As the name implies, GET is a request for the server to send the client information. Another common request is POST, for sending information to a server. POST requests are common in web applications, but are uncommon in the context of web scraping, so we won't discuss them further here.

In [2]:
import requests

response = requests.get('http://www.google.com')

The server we've sent this GET request to returns a response. The response has several components but the most significant are the status code and the content.

In [3]:
response.status_code, response.reason

(200, 'OK')

In [4]:
response.content[:100]

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content'

In [5]:
from IPython.core.display import HTML
HTML(response.text)

0,1,2
,,Advanced searchLanguage tools


We see that we got a response code of 200, corresponding with 'OK'; Google's servers were able to handle our request and respond without issue. They also sent us some HTML content that our browser could render. This is a typical response, though we may encounter other kinds of content including XML and JSON files, or even images, audio, or video.

In [6]:
# audio file
response = requests.get('https://ia800304.us.archive.org/25/items/ird059/tcp_d1_01_the_swedish_rhapsody_irdial.mp3')

with open('numbers_station.mp3', 'wb') as f:
    f.write(response.content)

In [7]:
from IPython.display import Audio
Audio('numbers_station.mp3')

In [8]:
# nonexistent page
response = requests.get('https://github.com/foo/bar')
response.status_code, response.reason

(404, 'Not Found')

In [9]:
HTML(response.text)

Status codes belong to 5 general categories:
- 1xx: Informational -- the request has been received and is processing
- 2xx: Success -- the request has been accepted
- 3xx: Redirection -- the client must go somewhere else to fulfill the request
- 4xx: Client error -- the request was faulty and could not be fulfilled
- 5xx: Server error -- the request was valid but the server could not fulfill it

The response to a GET request is only determined by the URI requested. Often the URI will be structured in two pieces: a hierarchical URL and a query string. The query string is useful for communicate parameters that modify our GET request beyond the hierarchical path to the requested resource.

In [10]:
response = requests.get('https://www.google.com/search?q=http&as_sitesearch=launchschool.com&num=1')
HTML(response.text)

0,1,2,3,4,5,6,7,8,9,10,11
Search OptionsAny timePast hourPast 24 hoursPast weekPast monthPast yearAll resultsVerbatim,"About 129 resultsIntroductory HTTP - Beginner-friendly book on HTTP - Launch Schoolhttps://launchschool.com/books/httpIntroduction to HTTP, the stateless protocol underlying all of the web. This book will help beginners understand how web applications work, and why building ... 12345678910NextAdvanced searchSearch Help Send feedbackGoogle Home Advertising Programs Business Solutions Privacy Terms About Google","Hypertext Transfer ProtocolThe Hypertext Transfer Protocol is an application protocol for distributed, collaborative, and hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web. WikipediaStandard port: 80People also search forUniform Resource LocatorHTMLWorld Wide Web HTTPSFile Transfer ProtocolPHP",,,,,,,,,
,1,2,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,Next
,,,,,,,,,,,
,,,,,,,,,,,
,AllImagesVideosNewsShoppingBooksMaps,AllImagesVideosNewsShoppingBooksMaps,,,,,,,,,

0,1,2,3,4,5,6,7,8,9,10,11
,1,2,3,4,5,6,7,8,9,10,Next


In [11]:
# equivalently...
HTML(requests.get('https://www.google.com/search', params={'q': 'http', 'as_sitesearch': 'launchschool.com', 'num': 1}).text)

0,1,2,3,4,5,6,7,8,9,10,11
Search OptionsAny timePast hourPast 24 hoursPast weekPast monthPast yearAll resultsVerbatim,"About 129 resultsIntroductory HTTP - Beginner-friendly book on HTTP - Launch Schoolhttps://launchschool.com/books/httpIntroduction to HTTP, the stateless protocol underlying all of the web. This book will help beginners understand how web applications work, and why building ... 12345678910NextAdvanced searchSearch Help Send feedbackGoogle Home Advertising Programs Business Solutions Privacy Terms About Google","Hypertext Transfer ProtocolThe Hypertext Transfer Protocol is an application protocol for distributed, collaborative, and hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web. WikipediaStandard port: 80People also search forUniform Resource LocatorHTMLWorld Wide Web HTTPSFile Transfer ProtocolPHP",,,,,,,,,
,1,2,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,Next
,,,,,,,,,,,
,,,,,,,,,,,
,AllImagesVideosNewsShoppingBooksMaps,AllImagesVideosNewsShoppingBooksMaps,,,,,,,,,

0,1,2,3,4,5,6,7,8,9,10,11
,1,2,3,4,5,6,7,8,9,10,Next


While we can formulate very detailed requests with many parameters, a very important feature of HTTP is that it is stateless. The server does not store information about a request after it has been fulfilled. Therefore all the information needed to fulfill a request must be contained in the request itself.

Remember HTTP is simply a messaging protocol, specifying the request-response cycle and the structure of the messages. The web server responding to requests is therefore neither aware of any applications used to fulfill requests nor of how the client generates requests or makes use of responses. This high level of abstraction is necessary to standard web traffic for diverse services across the internet.

## Understanding URLs


Let's try to find the correct URL to use.

- _First try_:  Aha, a list categories at the bottom.  Let's click on a few -- what do we see?  Many are empty, the categories are not obviously mutually exclusive, okay.  Maybe there's a better way.
- _Second try_: Let's just click the search button on http://tlo.mit.edu/explore-mit-technologies/view-technologies.  Okay, better but it only gives us 10 at a time.  Are we going to have to click through all of the following pages?  Let's just click on page 2 to see what happens.
- _Third try_: Aha, the URL for page 2 is http://tlo.mit.edu/technologies?search_api_views_fulltext=&page=1.  We can just advance the page number programmatically to visit all of the pages.

> Sometimes you can play with the query string to change other options.  In the past, we were able to set `limit=1000` to get all of the technologies listed on one page.  This no longer works, but there could be an equivalent parameter we haven't noticed.  In the end, you are limited by what the web server will provide.

For now, we'll just worry about getting the technologies off of the first page.

In [12]:
import requests

url = "http://tlo.mit.edu/technologies"
response = requests.get(url, params={"search_api_views_fulltext": ""})
print response.url
response.text[:1000] + "..."

http://tlo.mit.edu/technologies?search_api_views_fulltext=


u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"\n  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">\n\n<head profile="http://www.w3.org/1999/xhtml/vocab">\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return functio

## HTML and the DOM


To get started:

- Pull up http://tlo.mit.edu/technologies?search_api_views_fulltext= in Chrome.  
- Open __View->Developer->Developer Tools__.  
- Right click on one of the technology titles, and choose __"Inspect Element"__.

What are we looking at?  Well... this is the structure of the webpage.  Nested _tags_ of different _types_ and having a variety of _attributes_.

What we learned above:

  - All of the technologies are underneath ("_descendants of_")   `<section id="block-system-main">`
  - In fact, each of them is in its own `<div class="views-row">`
  
Now we're ready to move on:


## Parsing HTML

Now, we need to parse the raw HTML and actually grab the links of detailed info. The two main parser libraries in Python are `BeautifulSoup` and `lxml`. `lxml` is much faster (it leverages several C libraries), but it's also worse at dealing with malformed, crummy HTML. Because parsing speed isn't our bottleneck here, we'll use `BeautifulSoup`.

In [13]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text)
print soup.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return function()



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [14]:
parent = soup.find('section', attrs={'id': 'block-system-main'}) #Find (at most) *one*
parent

<section class="block block-system clearfix" id="block-system-main">\n<div class="view view-tech-search view-id-tech_search view-display-id-page_1 view-dom-id-bca125065b5252d664e210f9b3bbcd8f">\n<div class="view-content">\n<div class="views-row views-row-1 views-row-odd views-row-first">\n<div class="views-field views-field-title"> <span class="field-content"><a href="/technologies/heterogeneous-organic-gel-catalyst-photo-controlled-chemistry">Heterogeneous Organic Gel Catalyst for Photo-controlled Chemistry</a></span> </div>\n<div class="views-field views-field-field-header-and-body"> <span class="field-content">\n<h2>Applications</h2>\n      \n\nRadical polymerization is often used in the design and fabrication of new materials for a variety of applications, including but not limited to coatings, adhesives, and gels used in manufacturing processes.  \n\n\n  \n    \n    \n          <h2>Problem...</h2></span> </div> </div>\n<div class="views-row views-row-2 views-row-even">\n<div class

In [15]:
tech_divs = parent.find_all('div', attrs={'class':'views-row'})  #Find *all*
len(tech_divs)

10

## CSS selectors


This pattern of nested finds, based on tag type, ID, and class, is very common. It's so common that there are two special convenience languages for such traversals: [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp) and [XPath](http://www.w3schools.com/xml/xpath_intro.asp) (which works for all XML, not just HTML). We'll be using CSS selectors, which are more common for HTML and easier to learn.

With CSS selectors, we can write the above in a more concise and expressive way:

```python
tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
```

All selectors work like 'find_all'.  Some basic building examples of selectors are:

 - `'mytag'` picks out all tags of type `mytag`.
 - `'#myid'` picks out all tags whose _id_ is equal to `myid`
 - `'.myclass'` picks out all tags whose _class_ is equal to `myclass`
 - `'mytag#myid'` will pick all tags of type `mytag` **and** `id` equal to `myid` (analogously for `'mytag.myclass'`)
 - If `'selector1'` and `'selector2'` are two selectors, then there is another selector `'selector1 selector2'`.  It picks out all tags satisfying `selector2` that are __descendants__(*) of something satisfying `selector1`, i.e. it's like our nested find.
 
 (*) It doesn't have to be a _direct_ descendant.  I.e. it can be a grand-grand-...-grand-child of something satisfying `selector1`.  For direct descendants we'd instead write `'selector1 > selector2'`
 
Let's just see this in action:

In [16]:
print soup.select('section#block-system-main')[0].prettify()[:400]

<section class="block block-system clearfix" id="block-system-main">
 <div class="view view-tech-search view-id-tech_search view-display-id-page_1 view-dom-id-bca125065b5252d664e210f9b3bbcd8f">
  <div class="view-content">
   <div class="views-row views-row-1 views-row-odd views-row-first">
    <div class="views-field views-field-title">
     <span class="field-content">
      <a href="/technologi


In [17]:
print soup.select('section#block-system-main div.views-row')[0].prettify()[:400]

<div class="views-row views-row-1 views-row-odd views-row-first">
 <div class="views-field views-field-title">
  <span class="field-content">
   <a href="/technologies/heterogeneous-organic-gel-catalyst-photo-controlled-chemistry">
    Heterogeneous Organic Gel Catalyst for Photo-controlled Chemistry
   </a>
  </span>
 </div>
 <div class="views-field views-field-field-header-and-body">
  <span cla


In [18]:
tech_divs = soup.select('section#block-system-main div.views-row')
len(tech_divs)

10

Now we're ready to pull out some key pieces of info:

- The technology's "title" (the text in the `<a>` element)
- The link to follow for more info on the technology (the `href` attribute of the `<a>`)
- And a short blurb about the text (in the following `<span>`)

Let's write some code to extract this.  But before we do, let's discuss what _form_ the output should take: It is often convenient to store data in a dictionary (i.e. as a _key-value_ hashtable) - in other words, to name the bits of data you are collecting.  One big advantage is that this makes it easier to add in extra fields progressively.

Let's see what the code looks like:

In [None]:
firsta = tech_divs[0].select('a')[0]
firsta.text, firsta['href']

We're going to use a "named tuple" to store our key-value data.
We could also have used a dictionary, with strings as keys.
Named tuples have some advantages:
 - Better notation with autocomplete, x.field_name instead of x['field_name']
 - If you change your object structure later and fail to update your
   code to include the new fields, this will make it easier to find.
 - They are immutable, preventing certain sorts of bugs.

... and some disadvantages:
 - If you want to augment object structure you need a new type
   (or to go back and fill your code)
 - They are immutable.

In [None]:
from collections import namedtuple
TechBasic = namedtuple('TechBasic', 'title, url, short')

def td_info(td):
    la = td.select('a')
    ls = td.select('span')
    if len(la) != 1 or len(ls) != 2:
        print "Uh oh! We did something wrong for:"
        print "\n".join(">>> " + line for line in td.prettify().split("\n"))
        return
    return TechBasic(title=la[0].text, url=la[0]['href'], short=ls[1].text)

tech_links = filter(None, [td_info(td) for td in tech_divs])

tech_links[0]

## Fetching subsequent pages


Now that we can get all of the information off of one page, let's figure out how to grab the data from all the subsequent pages, too.  As a first step, let's encapsulate the parsing into function.  For reasons that will become apparent soon, we take a HTTP response as an argument, instead of doing the request in the function.

In [None]:
def get_techs(response):
    soup = BeautifulSoup(response.text)
    tech_divs = soup.select('section#block-system-main div.views-row')
    return filter(None, [td_info(td) for td in tech_divs])

To start off, we'll only get data from the first 10 pages.  Since we'll be using them often, we'll write a function to give us the arguments to `requests.get` to obtain the $n^{th}$ page.

In [None]:
LIMIT = 10
def get_page_args(i):
    return {"url": url,
            "params": {"search_api_views_fulltext": "", "page": i}}

**Solution 1:** The first solution is to run the requests serially.  This is very slow.

In [None]:
%%timeit -n1 -r1
# Slow version

techs = [get_techs(requests.get(**get_page_args(i))) for i in xrange(LIMIT)]

The problem is that connecting to a remote server and fetching the pages takes a while. Scraping web pages is usually _IO-bound_ and not CPU-bound (that is, we spent most of our time waiting for data and not processing it). Fortunately, Python gives us lots of different ways to deal with this problem.

**Solution 2:** We can use Python's [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) interface, which can easily parallelize a map.  This is a very straightforward API to use.  The drawback of this is that it spins up independent processes, which have a potentially significant download time.

In [None]:
# We define this function in a separate cell due to a problematic
# interaction between timeit and multiprocessing.
def get_page(i):
    return requests.get(**get_page_args(i))

In [None]:
%%timeit -n1 -r1
# faster version -- using multiprocessing

from multiprocessing import Pool
p = Pool(3)

#def get_page(i):
#    return requests.get(url, {"search_api_views_fulltext": "", "page": i})

responses = p.map(get_page, xrange(LIMIT))
techs = [get_techs(response) for response in responses]

**Solution 3:** For requests, there is a special library called [requests-futures](https://github.com/ross/requests-futures) which returns a placeholder object that holds a promise to return the webpage sometime later (in the "future").  This allows us to continue making other fetching requests while waiting for the first result to return.

![Synchronous vs. Asynchronous](images/async.png)

Requests-futures works by combining the `requests` library with `concurrent.futures`.  For a faster, though harder to debug, alternative, you can look at [`grequests`](https://github.com/kennethreitz/grequests).

In [None]:
%%timeit -n1 -r1
# faster version using requests-futures
from requests_futures.sessions import FuturesSession

session = FuturesSession(max_workers=5)
futures = [session.get(**get_page_args(i)) for i in xrange(LIMIT)]

techs = [get_techs(future.result()) for future in futures]

It'll take a little bit, but now we can scrape all of the listed technologies.  To avoid getting nested lists, we use a double comprehension, which flattens the result.

In [None]:
from requests_futures.sessions import FuturesSession

session = FuturesSession(max_workers=5)
techs = [tech 
         for future in [session.get(**get_page_args(i)) for i in xrange(61)]
         for tech in get_techs(future.result())]
len(techs)

In [None]:
techs[150]

**Exercises:**

1. Write a function `get_tech_details` to parse the individual page for each technology.  It can capture the longer description, the inventors, patent information, etc.

2. Use `requests_futures` to get all of the individual pages.  Parse them with your `get_tech_details` function to produce a more detailed database of MIT inventions.

## Scrapy in Python


If you are really interested in crawling, consider using `scrapy`.  [Scrapy](http://scrapy.org/) is a specialized python package for scraping websites.  In particular, it has a few features:
1. The HTML is parsed and accessed through a `response` object in a `parse` method which in turn supports `response.xpath` and `response.css` methods, allowing one to use `xpath` and `css` selectors on the response DOM, respectively.
1. Data is stored in `scrapy.Item` objects (which are similar to `namedtuple`s) or as python dictionaries.
1. Scrapy is object-oriented and calls it's own `parse` method (a generator) that `yield`s values.
1. You can limit your crawls through specifying the class property `allowed_domains` and definite the starting point of your crawl using the class property `start_urls`.
1. You can also build pipelines of crawling and data extraction steps to make sure crawling code more usable.
1. Additional scraping steps (e.g. scraping entries in a directory like in the example above) can be accessed via `scrapy.Request`.
1. It has command lines arguments to allow you to interactively play with the the `response` object from a webpage (`scrapy shell`) or view a page as the library renders it, which may be different from how your browser renders it (`scrapy view`).

The following is a canonical `scrapy` example:

In [None]:
import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

### More complicated example

Suppose we had picked Stanford instead of MIT.  Let's try to do the same thing (it's a bit harder to get a good listing URL, so I just downloaded one).

In [None]:
from collections import namedtuple
from urlparse import urljoin

from bs4 import BeautifulSoup, Comment

with open("small_data/Stanford-Tech-Listing.html", "r") as fin:
    soup = BeautifulSoup(fin)

In [None]:
#BeautifulSoup doesn't seem to support 'or' selectors, so:
selector = lambda x: x.has_attr("id") and x["id"].startswith("output_row")
tech_rows = soup.find_all(selector)[1:]
# Alternate -- showing how to go up and down the tree
#tech_rows = soup.find('tr', attrs={'id':'output_row_1'}).parent.findAll('tr')[1:]
print len(tech_rows)
print tech_rows[0].prettify()
print tech_rows[-1].prettify()

**Details:** Let's quickly break down that last line for two bits of Python syntax that we haven't explicitly talked about
```python
selector = lambda x: x.has_attr('id') and x['id'].startswith('output_row')
```
This is a *lambda expression* -- a short, inline, unnamed function. Lambdas are pretty limited, so you should define a named function for anything complicated  
```
tech_rows = soup.find_all(selector)[1:]      
                                   ^^^^
```
This is list slice notation (we already used this above with `[:2]`).  In this case, we're taking all but the zero-th entry (which is a list header).

**UH OH:** 
When originally preparing this, I was using Anaconda.  The same code only showed about _254_ of the _1727_ entries -- BeautifulSoup was incorrectly parsing the file.  These sorts of things are not entirely uncommon, so sometimes it helps to double-check.

In [None]:
# Warning: This is hacky code!
TechBlurb = namedtuple('TechBlurb', 'docket techid url title')
def parse_tr(tr):
    link = tr.select("td.output_data a")[0]
    
    docket = link.text
    url = link["href"]
    techid = url.split("=")[1]
    title = tr.select("td.output_data")[2].text
    return TechBlurb(docket=docket, techid=techid, url=url, title=title)
tech_blurbs=map(parse_tr, tech_rows)

In [None]:
# And this isn't much better!
def find_comment_by_text_in(soup, comment_text):
    return soup.find(text=lambda text: isinstance(text, Comment) and comment_text in text)

TechDetailed = namedtuple('TechDetailed', 'blurb, abstract, similar')
SimilarHint = namedtuple('SimilarHint', 'techid, docket, title')
def get_tech_details(response):
    # We're doing a lot of chaining with implicit assumptions here -- 
    #   it might fail in all sorts of way, in which case we give up.
    soup = BeautifulSoup(response.text)
    contents = soup.find_all('form')[1]
    abstract = (find_comment_by_text_in(contents, 'Abstract')
        .find_next_sibling('hr')
        .find('div')
        .text)
        
    def parse_similar_tr(r):
        tds = r.find_all('td')
        if len(tds) < 3:
            return None
        return SimilarHint (
            techid = tds[0].find('a')['href'].split('=')[1], 
            docket = "S"+tds[0].text.strip(), 
            title  = tds[2].text.strip()
        )

    similar_trs = (find_comment_by_text_in(soup.find_all('form')[1], 'Similar Tech')
                      .find_next_sibling('table')
                      .find('div')
                      .find('table')
                      .find('table')
                      .find_all('tr'))
    similar = filter(None, [parse_similar_tr(tr) for tr in similar_trs])
    
    return TechDetailed(blurb=blurb, abstract=abstract, similar=similar)

In [None]:
## Since the point is to show that something goes wrong, let's not wait until the end!
# imap_unordered lets you use the results of the map as they are produced (rather than storing them)
# and with no guarantee on order.

url_base="http://techfinder.stanford.edu/"

for blurb in tech_blurbs:
    response = requests.get(urljoin(url_base, blurb.url))
    try:
        get_tech_details(response)
    except:
        print "Something went wrong!"
        break

#### Remark:

When we run the above code, it tells us that [this technology](http://techfinder.stanford.edu/technology_detail.php?ID=30261) did not have a list of similar technologies.  But going to the web page shows that it does!  What went wrong?

In [None]:
url = 'http://techfinder.stanford.edu/technology_detail.php'
soup = BeautifulSoup(requests.get(url, params={"ID": 30261}).text)
contents = soup.find_all('form')[1]
print contents

If we go and look at the same part of the **raw** HTML, we find that there is no `</form>` there:
```HTML
<!--- Applications --->
<h3>Applications</h3><br/>
<ul><li>Imaging apoptosis<ul type="circle" style="margin-bottom:0in"></li><li>Research</li><li>Clinical<ul type="circle" style="margin-bottom:0in"></li><li>Monitor therapeutic efficacy in cancer patients</li><li>Anti-cancer drug selection</ul></ul></li></ul><br/>

<!--- Advantages --->
<h3>Advantages</h3><br/>
<ul><li>High specificity for caspase-3 and -7</li><li>High sensitivity</li><li>Non-invasive</li><li>Biocompatible</li><li>Small size of probe allows:<ul type="circle" style="margin-bottom:0in"></li><li>Deep tissue penetration</li><li>More extensive biodistribution</ul></li><li>PET probes:<ul type="circle" style="margin-bottom:0in"></li><li>High tumor/muscle ratio in apoptotic tumors</li><li>High uptake value in apoptotic tumors</ul></li><li>Fluorescent probe:<ul type="circle" style="margin-bottom:0in"></li><li>Possess NIR spectral properties</ul></li><li>May help promote personalized cancer medicine</li><li>Potential for probe design strategy to be applied to other enzyme targets</li></ul><br/>
```

What there **is** is _malformed HTML_ that is bad enough to confuse BeautifulSoup.  (Note that it's not nearly bad enough to confuse a web browser however).  If you look at more examples, you will find even worse ones -- a stray `</html>` in the middle of a document is not unheard of.  

To fix this, we can pre-"tidy" the page before feeding it to BeautifulSoup using `pytidylib`.

In [None]:
from tidylib import tidy_document
url='http://techfinder.stanford.edu/technology_detail.php'

tidy_page, __ = tidy_document(requests.get(url, params={"ID": 30261}).text)
soup = BeautifulSoup(tidy_page)
contents = soup.find_all('form')[1]
print contents

### Exercises


1. Go back and modify `get_tech_details` to use this 'tidy' approach.

2. Sometimes web servers are slow and/or unreliable, and sometimes your connection is.  If we were to run the above test twice, we'd probably find that some of the failures were just due to a connection error.  We didn't notice this because the _outer_ `try` / `except` is also catching these.  So: Modify `get_tech_details` to allow up to 3 retries. <br/>Bonus points if you actually look at what exceptions `urllib` throws in those cases instead of a general catch-all mechanism.  Alternate type of bonus points if you figure out how to do it using the `retrying` package.  You can test these by throttling your internet on and off to simulate an unreliable connection.

### Exit Tickets

1. How would you design a web scraping app such that the user interface remained responsive? One that is robust to poor internet connections?
1. How would you deal with messy/malformed HTML/XML?

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*