Data Scraping in a Nutshell
=====

This is a notebook that has been adapted from the Harvard course CS109 taught by Verena Kaynig-Fittkau. You can see the course website at <a href="http://cs109.github.io/2015/">this URL</a>.

We will explore the process of extracting information directly from the web (not using an API). The primary module we will be using is called beautiful soup. However, we will also be using an HTML module that interprets HTML in the ipython console.

In [1]:
## all imports
from IPython.display import HTML
import numpy as np
import urllib
import bs4 #this is beautiful soup
print(bs4.__version__)

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_context("talk")
sns.set_style("white")

4.11.1


Python data scraping
====================

* Why scrape the web?
    - vast source of information
    - automate tasks
    - keep up with sites
    - fun!

*** Can you think of examples ? ***
  
(Stock market monitoring, sports data, airline prices, etc.)

* copyrights and permission:
    - be careful and polite
    - give credit
    - care about media law
    - don't be evil (no spam, overloading sites, etc.)

Robots.txt
==========

* specified by web site owner
* gives instructions to web robots (aka your script)
* is located at the top-level directory of the web server

For example, you can have a look at

http://google.com/robots.txt


*** What does this one do? ***
User-agent: Google
Disallow:

User-agent: *
Disallow: /

Answer: This file allows google to search through everything on the server, while all others should stay completely away.

Things to consider:
-------------------

* can be just ignored
* can be a security risk - *** Why? ***

Answer: You are basically telling everybody who cares to look into the file where you have stored sensitive information.


Scraping with Python:
=====================

* scraping is all about HTML tags
* bad news: 
    - need to learn about tags
    - websites can be ugly

HTML
=====

* HyperText Markup Language

* standard for creating webpages

* HTML tags 
    - have angle brackets
    - typically come in pairs

This is an example for a minimal webpage defined in HTML tags. The root tag is `<html>` and then you have the `<head>` tag. This part of the page typically includes the title of the page and might also have other meta information like the author or keywords that are important for search engines. The `<body>` tag marks the actual content of the page. You can play around with the `<h2>` tag trying different header levels. They range from 1 to 6. 

In [21]:
s = """<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <h2> Test </h2>
    <p>Hello world!</p>
  </body>
</html>"""

h = HTML(s)
h

Useful Tags
===========

* heading
`<h1></h1> ... <h6></h6>`

* paragraph
`<p></p>` 

* line break
`<br>` 

* link with attribute

`<a href="http://www.example.com/">An example link</a>`


Scraping with Python:
=====================

* example of a beautifully simple webpage:

http://www.crummy.com/software/BeautifulSoup

Scraping with Python:
=====================

* good news: 
    - some browsers help
    - look for: inspect element
    - need only basic html
    
*** Try 'Ctrl-Shift I' in Chrome ***

*** Try 'Command-Option I' in Safari ***



Scraping with Python
==================

* different useful libraries:
    - urllib
    - beautifulsoup
    - pattern
    - LXML
    - ...
    

The following cell just defines a url as a string and then reads the data from that url using the `urllib` library. If you uncomment the print command you see that we got the whole HTML content of the page into the string variable source.

In [22]:
url = 'http://www.crummy.com/software/BeautifulSoup'
source = urllib.request.urlopen(url).read()
print(source)


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:992)>

Exercise :
======

* Is the word 'Alice' mentioned on the beautiful soup homepage?
* How often does the word 'Soup' occur on the site?
* At what index occurs the substring 'alien video games' ?

In [None]:
## is 'Alice' in source?
print(b'Alice' in source)

## count occurences of 'Soup'
print(source.count(b'Soup'))

## find index of 'alien video games'
position =  source.find(b'high-profile projects')
print(position)

## quick test to see the substring in the source variable
## you can access strings like lists
print(source[position:position + 21])

## or the tidier version:
print(source[position:position + len('high-profile projects')])

Beautiful Soup
==============

* designed to make your life easier
* many good functions for parsing html code

Some examples
=============

Now we create a beautiful soup object from the string variable source. Note that the `prettify()` function formats the output to show the different levels of the HTML code. 

In [None]:
## get bs4 object
soup = bs4.BeautifulSoup(source,'lxml')
 
## compare the two print statements
# print soup
# print soup.prettify()

## show how to find all a tags
soup.findAll('a')

In [None]:
## ***Why does this not work? ***
soup.findAll('Soup')

The last command only returns an empty list, because `Soup` is not an HTML tag. It is just a string that occours in the webpage.

Some examples
=============

In [None]:
## get attribute value from an element:
## find tag: this only returns the first occurrence, not all tags in the string
first_tag = soup.find('a')

## get attribute `href`
first_tag.get('href')

In [None]:
## get all links in the page
link_list = [l.get('href') for l in soup.findAll('a')]
## filter all external links
# create an empty list to collect the valid links
external_links = []

# write a loop to filter the links
# if it starts with 'http' we are happy
for l in link_list:
       if l[:4] == 'http':
            external_links.append(l)

# this throws an error! It says something about 'NoneType'

In [None]:
# lets investigate. Have a close look at the link_list:
link_list

# Seems that there are None elements!
# Let's verify
print(sum([l is None for l in link_list]))

# So there are two elements in the list that are None!

In [None]:
# Let's filter those objects out in the for loop
external_links = []

# write a loop to filter the links
# if it is not None and starts with 'http' we are happy
for l in link_list:
    if l is not None and l[:4] == 'http':
        external_links.append(l)
        
external_links

Note: The above `if` condition works because of lazy evaluation in Python. The `and` statement becomes `False` if the first part is `False`, so there is no need to ever evaluate the second part. Thus a `None` entry in the list gets never asked about its first four characters. 

In [None]:
# another option for the if statement
# didn't know about the startswith function until it was pointed out in class. Thanks!
# and we can put this in a list comprehension as well, it almost reads like sentence.

[l for l in link_list if l is not None and l.startswith('http')]

Parsing the Tree
================



In [None]:
# redefining `s` without any line breaks
s = """<!DOCTYPE html><html><head><title>This is a title</title></head><body><h3> Test </h3><p>Hello world!</p></body></html>"""
## get bs4 object
tree = bs4.BeautifulSoup(s,'lxml')

## get html root node
root_node = tree.html

## get head from root using contents
head = root_node.contents[0]

## get body from root
body = root_node.contents[1]

## could directly access body
tree.body

Exercise:
=====

* Find the `h3` tag by parsing the tree starting at `body`
* Create a list of all __Hall of Fame__ entries listed on the Beautiful Soup webpage
    - hint: it is the only unordered list in the page (tag `ul`)


In [None]:
## get h3 tag from body
body.contents[0]

In [None]:
## use ul as entry point
entry_point = soup.find('ul')

## get hall of fame list from entry point
## skip the first entry 
hall_of_fame_list = entry_point.contents[1:]

## reformat into a list containing strings
tmp = []
for li in hall_of_fame_list:
    tmp.append(li.contents)
    
tmp

`tmp` now is actually a list of lists containing the hall of fame entries. 
I had to ask a collegue to solve this for me, so thanks to Ray, here is some 
advanced Python on how to print really just one entry per list item.

The cool things about this are: 
* The use of `""` to just access the `join` function of strings.
* The `join` function itself
* that you can actually have two nested for loops in a list comprehension

In [None]:
test =  ["".join(str(a) for a in sublist) for sublist in tmp]
print('\n'.join(test))

Advanced Example
================

Designed by Katharine Jarmul
----------------------------

https://github.com/kjam/python-web-scraping-tutorial




Scraping Happy Hours
====================

Scrape the happy hour list of LA for personal preferences
https://downtownla.com/explore/dining-and-drinks/happy-hour-finder

This example is part of her talk about data scraping at PyCon2014. She is a really good speaker and I enjoyed watching her talk. Check it out: http://www.youtube.com/watch?v=p1iX0uxM1w8

In [None]:
stuff_i_like = ['burger', 'sushi', 'sweet potato fries', 'BBQ']
found_happy_hours = []
my_happy_hours = []
# First, I'm going to identify the areas of the page I want to look at
url = 'https://downtownla.com/explore/dining-and-drinks/happy-hour-finder'
#'http://www.downtownla.com/3_10_happyHours.asp?action=ALL'
source = urllib.request.urlopen(url).read()
tables = bs4.BeautifulSoup(source)

In [None]:
# Then, I'm going to sort out the *exact* parts of the page
# that match what I'm looking for...
for t in tables.findAll('div', {'class': 'gcard-label'}):
    text = t.text
    print(text)
    for s in t.findNextSiblings():
        text += '\n' + s.text
    found_happy_hours.append(text)

print("The scraper found %d happy hours!" % len(found_happy_hours))

In [None]:
# Now I'm going to loop through the food I like
# and see if any of the happy hour descriptions match
for food in stuff_i_like:
    for hh in found_happy_hours:
        # checking for text AND making sure I don't have duplicates
        if food in hh and hh not in my_happy_hours:
            print("YAY! I found some %s!" % food)
            my_happy_hours.append(hh)

print("I think you might like %d of them, yipeeeee!" % len(my_happy_hours))

In [None]:
# Now, let's make a mail message we can read:
message = str('Hey Katherine,\n\n\n')
message += str('OMG, I found some stuff for you in Downtown, take a look.\n\n')
message += str('\n==============================\n'.join(my_happy_hours))
#message = message.encode('???')
# To read more about encoding:
# http://diveintopython.org/xml_processing/unicode.html
message = message.replace('\t', '').replace('\r', '')
message += '\n\nXOXO,\n Your Py Script'

print(message)