In [1]:
import pandas as pd
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Webscraping

Week 4 | Day 4

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe how web scraping works conceptually
- Explain how to Web Scraping works using python
- Define how to approach scraping project data

# Webscraping

In data science work, it is often necessary to retrieve data from websites. Occassionally, sites will provide an API that allows their data to be easily accessed, but often this isn't the case. When an API is not available, the only real option is to build a webscraper. 


A webscraper retrieves the webpage in the same way your browser retrieves the page, but because we're doing it with code, we are able to parse the resulting site's content.


**So how can we retrieve webpage content programmatically?**<br>
The first step is to understand how HTTP works...

## HTTP

Hypertext Transfer Protocol, or HTTP, is a text-based standard that allows clients and servers to communicate over TCP/IP. 

**HTTP  = the language computers communicate with**<br>
**TCP/IP = the channel over which that communication takes place**

HTTP is based on a client-server model. A client makes a request for some resource, and the server responds with the status of that request and the resource if available.

## HTTP Requests

There are two common types of HTTP requests: **GET and POST**

### GET Requests

GET requests are by far the most common, they simply ask the server to retrieve some resource, typically a webpage, and to return it.

<img src="http://i.imgur.com/qBG7jmB.png" width="900">

### POST Requests

A POST request is nearly identical to a GET request, but includes a payload of some sort in the request body. 

<img src="http://i.imgur.com/mzWB0wD.png" width=900>

## Typical Use Cases

GET requests are the standard way to request a webpage (as your browser would do). Some simple forms will use get as well. 

More sophisticated forms will utilize a POST request. GET requests pass parameters in the URL, while POST requests do not. This tends to make POST request more secure. 

N.B. Do not rely on POST alone as a security measure!

## So once you make a request, naturally you expect a response...


In the language of http, responses are provided first as a code

## HTTP Response Codes

- 1XX - Informational
- 2XX - Success
- 3XX - Redirection
- 4XX - Client Error
- 5XX - Server Error

### Response Codes - The Greatest Hits

- **200 - OK** - The requested action was successfully executed
- **301 - Moved Permanently** - The resource has been relocated (and will not be back, so please stop asking me)
- **400 - Bad Request** - The the client request is malformed in some way
- **403 - Forbidden** - The requesting client (i.e. you) does not have permission to view the resource
- **404 - Not Found** - The resource can't be found at the moment (may be in the future, so check back later)
- **405 - Method Not Allowed** - Used GET when only POST was applicable for example
- **418 - I'm a teapot** - For when the server is a teapot
- **420 - NOT an HTTP code** - you're thinking of something else
- **429 - Too Many Requests** - They're on to you and if you keep it up, they'll block you permenantly
- **500 - Internal Server Error** -Some non-specific bad happened on their end
- **502 - Bad Gateway** - The server was waiting on another resource and it ended badly
- **503 - Service Unavailable** - The server is overloaded or down at the moment

## So that is the basic language of the web, now how do we actually use this to get our content...

## Python Requests

<img src="http://i.imgur.com/qpfNAPb.png" width="900">

Requests allows us to send the server a request using (POST or GET) and in return we receive our response code and content where applicable.

## First, we make a request to retrieve a website

In [2]:
import requests

In [15]:
r = requests.get('http://www.imdb.com')

## We can check the response code

In [16]:
r

<Response [200]>

### Check: What is a 200? Is that good or bad for what we're trying to do?

## Let's see the request headers we sent

In [5]:
r.request.headers

{'Connection': 'keep-alive', 'Cookie': '__cfduid=d722634c18102b0ad56dc7bc2f2d90c6f1477768734', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.10.0'}

In [6]:
# we can print those out nicely
for k, v in r.request.headers.items():
    print(k + ':', v)

('Connection:', 'keep-alive')
('Accept-Encoding:', 'gzip, deflate')
('Accept:', '*/*')
('User-Agent:', 'python-requests/2.10.0')
('Cookie:', '__cfduid=d722634c18102b0ad56dc7bc2f2d90c6f1477768734')


## We can also see the response headers

In [7]:
r.headers

{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Strict-Transport-Security': 'max-age=31556900; includeSubDomains', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare-nginx', 'Connection': 'keep-alive', 'Cache-Control': 'private, max-age=0', 'Date': 'Sat, 29 Oct 2016 19:18:55 GMT', 'X-Frame-Options': 'DENY', 'Content-Type': 'text/html; charset=utf-8', 'CF-RAY': '2f98f9e0dfb3183a-EWR'}

In [8]:
for k, v in r.headers.items():
    print(k + ':', v)

('Date:', 'Sat, 29 Oct 2016 19:18:55 GMT')
('Content-Type:', 'text/html; charset=utf-8')
('Transfer-Encoding:', 'chunked')
('Connection:', 'keep-alive')
('Vary:', 'Accept-Encoding')
('Cache-Control:', 'private, max-age=0')
('X-Frame-Options:', 'DENY')
('Strict-Transport-Security:', 'max-age=31556900; includeSubDomains')
('Content-Encoding:', 'gzip')
('Server:', 'cloudflare-nginx')
('CF-RAY:', '2f98f9e0dfb3183a-EWR')


## Let's see what content came back

In [9]:
r.content

'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?qteYN9s39JuqPSkaxUEz">\n        <link rel="shortcut icon" href="favicon.ico">\n          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n        <title>Hacker News</title>\n      </head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">\n        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>\n                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n              <a href="newest">new</a> | <a href="newcomments">comments</

## We can wrap that in HTML to see the code

In [10]:
from IPython.core.display import HTML
HTML(r.content.decode('utf-8'))

0
Hacker News  new | comments | show | ask | jobs | submit login
"1. Intel driven MacBook Pros have secondary ARM processor for Touch ID and security (techcrunch.com)  100 points by eth0up 3 hours ago | hide | 47 comments 2. Total Nightmare: USB-C and Thunderbolt 3 (fosketts.net)  390 points by sfoskett 3 hours ago | hide | 194 comments 3. ClintonCircle / DNC (mit.edu)  19 points by koolba 26 minutes ago | hide | discuss 4. Thomas Piketty's Capital in the 21st Century, in 20 minutes [video] (boingboing.net)  114 points by based2 4 hours ago | hide | 47 comments 5. ESPN Loses 621,000 Subscribers; Worst Month in Company History (outkickthecoverage.com)  197 points by kelukelugames 3 hours ago | hide | 215 comments 6. Photographers of 1870s London Documented Their Disappearing City (hyperallergic.com)  73 points by brudgers 4 hours ago | hide | 4 comments 7. Frink (frinklang.org)  25 points by tosh 1 hour ago | hide | 2 comments 8. ICANN’s First Test of Accountability (afilias.info)  41 points by ayh 3 hours ago | hide | 1 comment 9. A Map of the World Won Japan’s Prestigious Design Award (spoon-tamago.com)  142 points by igrs 8 hours ago | hide | 54 comments 10. GraphicsMagick Image Processing System (graphicsmagick.org)  75 points by based2 4 hours ago | hide | 27 comments 11. Strategies of Human Mating (2006) [pdf] (weimag.ch)  30 points by networked 2 hours ago | hide | 3 comments 12. Step-by-step tutorial to build a modern JavaScript stack from scratch (github.com)  80 points by lobo_tuerto 5 hours ago | hide | 43 comments 13. Peter Thiel Defends His Most Contrarian Move Yet: Supporting Trump (nytimes.com)  22 points by endswapper 23 minutes ago | hide | 6 comments 14. Universal adversarial perturbations (arxiv.org)  35 points by legatus 3 hours ago | hide | 14 comments 15. AMD 2013 APUs to Include ARM Cortex-A5 Processor for TrustZone Capabilities (anandtech.com)  13 points by colinscape 1 hour ago | hide | 3 comments 16. AtomBombing: A Code Injection That Bypasses Current Security Solutions (ensilo.com)  34 points by brakmic 3 hours ago | hide | 12 comments 17. WikiWikiWeb is back online (c2.com)  69 points by delian66 7 hours ago | hide | 19 comments 18. CRISPR hacks expand uses of gene-editing toolbox (spectrumnews.org)  50 points by jonbaer 6 hours ago | hide | discuss 19. EasyMVP – Android library with annotation processing and bytecode weaving (github.com)  37 points by joblack33 4 hours ago | hide | discuss 20. Turned down 18 times. Then Paul Beatty won the Booker (theguardian.com)  10 points by lermontov 2 hours ago | hide | 1 comment 21. The Science of Eggs (luckypeach.com)  5 points by Hooke 1 hour ago | hide | discuss 22. Wickr Inc – When Honesty Disappears Behind the VCP Mountain (vulnerability-db.com)  16 points by QUFB 3 hours ago | hide | 8 comments 23. WordPress creator slams Wix: ‘Your app editor is built with stolen code’ (venturebeat.com)  134 points by jaredtking 3 hours ago | hide | 37 comments 24. VirtualBox 5.1.8 (virtualbox.org)  82 points by based2 4 hours ago | hide | 69 comments 25. A Learned Representation for Artistic Style (arxiv.org)  25 points by EvgeniyZh 4 hours ago | hide | discuss 26. Casual Introduction to Low-Level Graphics Programming (stephaniehurlburt.com)  93 points by deafcalculus 9 hours ago | hide | 22 comments 27. Show HN: React-decoration – A collection of decorators for React Components (github.com)  23 points by mbasso 4 hours ago | hide | 2 comments 28. Show HN: RethinkDB change feeds for indexing Algolia (github.com)  24 points by rlancer 3 hours ago | hide | 4 comments 29. Enlightenment Scholarship by the Numbers (2014) [pdf] (stanford.edu)  4 points by pepys 1 hour ago | hide | discuss 30. Academia, Love Me Back (vivatiffany.wordpress.com)  79 points by gluxon 3 hours ago | hide | 37 comments More"
Guidelines  | FAQ  | Support  | API  | Security  | Lists  | Bookmarklet  | DMCA  | Apply to YC  | Contact Search:

0,1,2
,Hacker News  new | comments | show | ask | jobs | submit,login

0,1,2
1.0,,Intel driven MacBook Pros have secondary ARM processor for Touch ID and security (techcrunch.com)
,,100 points by eth0up 3 hours ago | hide | 47 comments
,,
2.0,,Total Nightmare: USB-C and Thunderbolt 3 (fosketts.net)
,,390 points by sfoskett 3 hours ago | hide | 194 comments
,,
3.0,,ClintonCircle / DNC (mit.edu)
,,19 points by koolba 26 minutes ago | hide | discuss
,,
4.0,,"Thomas Piketty's Capital in the 21st Century, in 20 minutes [video] (boingboing.net)"


## Exercise

- Using the requests library, retrieve a webpage of your choosing with a GET request
- Examine the response code, the headers, and the content
- Use ```IPython.core.display's HTML()``` to display the page in your notebook 
- Compare the results with the actual page you requested in your browser

## Webscraping - The Struggle is real

- Robots.txt
- User Agent
- Ajax

## Ajax - The enemy of the webscraper

In [11]:
r2 = requests.get('https://www.google.com/#q=data+science')

In [12]:
r2

<Response [200]>

In [13]:
# notice anything missing?
HTML(r2.content.decode('latin-1'))

0,1,2
,,Advanced searchLanguage tools


## What is AJAX?

>Conventional web applications transmit information to and from the server using synchronous requests. It means you fill out a form, hit submit, and get directed to a new page with new information from the server.

>With AJAX, when you hit submit, JavaScript will make a request to the server, interpret the results, and **update the current screen**. In the purest sense, the user would never know that anything was even transmitted to the server.

## How do you handle Ajax?

If a site uses ajax on content you need to scrape, **you will have to use a browser object** to retrieve it. 

The difference between a library like requests and an actual browser object is that requests just sends and receives text. The browser object "renders" the webpage just like Firefox or Chrome does. 

So how do we do this? We'll need to libraries to accomplish this...


- Selenium

- PhantomJS

## Selenium

- Selenium is a browser automation library (used extensively in testing)<br>

 <img src="http://i.imgur.com/WLs22wp.png" width=500>

## PhantomJS

PhantomJS is a "headless" browswer. It allows us all the functionality available in a full browser, but without the overhead of a UI.

<img src="http://i.imgur.com/hN5trU9.png" width="500">

## Using Selenium with PhantomJS

In [14]:
from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='/Users/ac/Downloads/\
phantomjs-2.1.1-macosx/bin/phantomjs')
driver.set_window_size(1024, 768) 
driver.get('https://www.google.com/#q=data+science')

WebDriverException: Message: 'phantomjs' executable needs to be in PATH. 


In [None]:
# .page_source gives us our document
HTML(driver.page_source)

## Exercise
1. Pip install selenium 
2. Download and unzip phantomJS 2.1.1 from https://bitbucket.org/ariya/phantomjs/downloads
3. Use the library to pull down an ajax-based page such as Google search results

# Now how do we get the content we want from the page?

## DOM

> The Document Object Model (DOM) is a programming interface for HTML and XML documents. It provides a structured representation of the document and it defines a way that the structure can be accessed from programs so that they can change the document structure, style and content. The DOM provides a representation of the document as a structured group of nodes and objects that have properties and methods. Essentially, it connects web pages to scripts or programming languages.

## Typical Web Page Structure

    <html>
        <head>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>

In [None]:
page_html = """
    <html>
        <head>
        <title>Super Cool Website!</title>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>
"""

## We're going to feed this full HTML into a library called Beautiful Soup

<img src="http://i.imgur.com/klVeXY7.png" width="800">

## Coding BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

## Pass the HTML into the BS object

In [None]:
soup = BeautifulSoup(page_html, "lxml")

From there it can be searched and parsed

## Print the html

In [None]:
print(soup.prettify())

## Let's now do some parsing of the HTML using the DOM

## Get the title

In [None]:
soup.title

In [None]:
soup.title.text

## Find - get the first result

In [None]:
soup.find('div')

## FindAll - get all matching results

In [None]:
i = 0
for d in soup.findAll('div'):
    print(i, d)
    print('\n')
    i += 1

## Get the page's text

In [None]:
print(soup.text)

## Get the class of an element

In [None]:
# find returns the first result
soup.find('div')['class']

## Search by the id of an element

In [None]:
print(soup.find(id='main'))

## Search by the class

In [None]:
#  note the underscore after class
print(soup.findAll(class_='extraFancy'))

## Get the children of an element

In [None]:
my_ul = soup.find('ul')

In [None]:
print(my_ul)

In [None]:
my_ul.findChildren()

## Exercise

Using Requests and BeautifulSoup, pull down hacker news and print out the headlines and the story links in your notebook

In [None]:
# solution

hn = requests.get('http://news.ycombinator.com')

In [None]:
#solution

hn.content

In [None]:
# solution

# pass the content into BS
hn_soup = BeautifulSoup(hn.content, "lxml")

In [None]:
# solution

for link in hn_soup.findAll('a', class_='storylink'):
    print(link.text)
    print(link['href'])
    print('\n')

## Now for the Easy Way

## Import.io

Using the URL, go to "http://www.zillow.com/new-york-city-ny/apartments/"

## Independent Practice

1. Programmatically run a google search for 'Data Science' using Selenium and PhantomJS

2. Retrieve only the links and their titles using BS - avoid getting the ads in your list

3. Place those into a DataFrame

In [None]:
#solution

from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='/Users/ac/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.set_window_size(1024, 768) 
driver.get('https://www.google.com/#q=data+science')

In [None]:
# solution

g_soup = BeautifulSoup(driver.page_source)

In [None]:
# solution

link_list = []
title_list = []

for t in g_soup.findAll('h3', class_='r'):
    full_path = t.find('a')['href']
    full_title = t.find('a').text
    if  'search' not in full_path:
        link_list.append(full_path[7:])
        title_list.append(full_title)

In [None]:
# solution

pd.DataFrame(zip(title_list, link_list), columns=['title', 'links'])