## Objectives

Today we'll learn to things:  
1. How to turn the entire internet into a data source  
2. How to do that amazingly fast using threading



![webscraping](images/webscrape.png)

* Learning the basics of webscraping
 - HTTP vs HTML vs CSS
 - Requests
 - Session-based scraping
 - Crawling
 - Best practices
* Unlearning the basics: there's a lot more depth
 - Common issues
 - Tor and Selenium
* An intro to threading

## Learning the basics of webscraping
---------

### HTTP vs HTML vs CSS

* What are the difference between these terms?
* CSS tags are what allow us to access the information we need

### The requests library

* If you can see it, you can scrape it
* A simple, powerful library
* Also consider `urllib` and `urllib2`
* Along with a parsing library like `bs4` (aka BeautifulSoup), this the defacto webscraping suite in python

In [14]:
import requests
url = 'http://www.galvanize.com'
r = requests.get(url)
r.status_code

200

In [23]:
from IPython.core.display import HTML
HTML(r.text) 

In [16]:
r.text[:1000]

'<!DOCTYPE html>\n<html lang="en">\n\t<head>\n\t\t<script src="//cdn.optimizely.com/js/2974420093.js"></script>\n\t\t<meta charset="utf-8">\n    \t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n\n\t    <title>Galvanize | Learn Code, Analytics, Data Science | Startup Space</title>\n\t\t<script type="text/javascript">\n\t\t\tif (top != self) {\n\t\t\t\tdocument.getElementsByTagName("HTML")[0].style.display = "none";\n\t\t\t\ttop.location=self.location;\n\t\t\t}\n\t\t</script>\n\t\t<link rel="icon" type="image/png" href="http://www.galvanize.com/wp-content/themes/galvanize/favicon.ico">\n\t\t<link rel="stylesheet" type="text/css" href="http://www.galvanize.com/wp-content/themes/galvanize/css/bootstrap.min.css">\n\t\t<link rel="stylesheet" type="text/css" href="http://www.galvanize.com/wp-content/themes/galvanize/style.css" />\n\t\t<link href="//maxcdn.bootstrapcdn.com

### Session-based Scraping

In [24]:
# Check out http://galvanizesf.roomzilla.net
# This site uses HTTP Basic Auth, one of the forms of authentication possible with this library
z = requests.get('http://galvanizesf.roomzilla.net', auth=('', 'gVIP543'))
# HTML(z.text)

![formdata](images/chrome-view-post-data.gif)

Courtesy of https://wpscholar.com/blog/view-form-data-in-chrome/

In [18]:
z = requests.get('https://accounts.craigslist.org/login/home')
# HTML(z.text)

In [19]:
pwd = 'pass_for_demo'

In [20]:
s = requests.Session()
form_data = {'step':'confirmation',
         'p': 0,
         'rt': '',
         'rp': '',
         'inputEmailHandle':'conor.murphy@galvanize.com',
         'inputPassword':pwd}
headers = {"Host": "accounts.craigslist.org",
           "Origin": "https://accounts.craigslist.org",
           "Referer": "https://accounts.craigslist.org/login",
           "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"}
s.headers.update(headers)
z = s.post('https://accounts.craigslist.org/login', data=form_data)
z

<Response [200]>

In [25]:
# HTML(z.text)

### Crawling

* A crawler sees two types of webpages
 - **seeds**: the urls the crawler begins with
 - **the crawl frontier**: the urls a crawler discovers on the sites it visits
* You can build a simple crawler using the `requests` library and a parser.  For more advanced crawling, you'll have to deal with issues such as:
 - selection policies
 - politeness policies (to keep from overloading sites)
 - revisitation
 - parallelizing the work of different crawlers
* Try using Scrapy for crawling


### Best Practices

* Always save all raw data
* Follow this workflow:
 - Scrape
 - Store (unstructured, often with MongoDB)
 - Parse
 - Store (structured)
 - Analyze

## Unlearning the basics: there's a lot more depth
---------

Common pitfalls:

* rate limiting
* blocking of certain ISP's (especially from AWS)
* A/B testing

* Tor and Selenium are two tools to allow you to work around some of these limitations
* In essence, Tor disguises the originator of a message by daisy-chaining it across many users
* Selenium automates web browsing, making you appear like normal user

## An intro to threading
---------

* Threads can be thought of multiple programs running simultaneously in the same process
* Threads share the same scope/process
* Threading is particularly helpful with I/O bound problems, like making GET requests to websites

In [8]:
import _thread # this is the deprecated module for demonstration purposes.  Also see `threading`
import time

def print_time(threadName, delay):
    '''
    INPUT: name of thread as a string, delay time in seconds
    OUTPUT: None, prints time 5 times
    '''
    count = 0
    while count < 5:
        time.sleep(delay)
        count += 1
        print("{}: {}".format(threadName, time.ctime(time.time())))

_thread.start_new_thread( print_time, ("Thread-1", 2, ) )
_thread.start_new_thread( print_time, ("Thread-2", 4, ) )
_thread.start_new_thread( print_time, ("Thread-3", 1, ) )
_thread.start_new_thread( print_time, ("Thread-4", 5, ) )

123145510092800

Thread-3: Mon Apr  3 13:10:29 2017
Thread-1: Mon Apr  3 13:10:30 2017
Thread-3: Mon Apr  3 13:10:30 2017
Thread-3: Mon Apr  3 13:10:31 2017
Thread-2: Mon Apr  3 13:10:32 2017
Thread-1: Mon Apr  3 13:10:32 2017
Thread-3: Mon Apr  3 13:10:32 2017
Thread-4: Mon Apr  3 13:10:33 2017
Thread-3: Mon Apr  3 13:10:33 2017
Thread-1: Mon Apr  3 13:10:34 2017
Thread-2: Mon Apr  3 13:10:36 2017
Thread-1: Mon Apr  3 13:10:36 2017
Thread-4: Mon Apr  3 13:10:38 2017
Thread-1: Mon Apr  3 13:10:38 2017
Thread-2: Mon Apr  3 13:10:40 2017
Thread-4: Mon Apr  3 13:10:43 2017
Thread-2: Mon Apr  3 13:10:44 2017
Thread-4: Mon Apr  3 13:10:48 2017
Thread-2: Mon Apr  3 13:10:48 2017
Thread-4: Mon Apr  3 13:10:53 2017


## Lab Overview
---------

In [None]:
import queue
from threading import Thread

start = time.time()

# specify sitemap to get all site links
url_list = ['http://www.nytimes.com'] ## USE YOUR URLS HERE

# create the queue instance and add urls to the queue
q = queue.LifoQueue()
[q.put(url) for url in url_list]

# define how the URL transformations
def grab_data_from_queue():
    while not q.empty(): # check that the queue isn't empty
        url = q.get() # get the item from the queue
        r  = requests.get(url) # request the url

        ## ADD YOUR CODE HERE

        q.task_done() # specify that you are done with the item

# create and start threads
for i in range(12): # aka the number of threads
    t1 = Thread(target = grab_data_from_queue) # target is the above function
    t1.start() # start the thread

q.join()

print('This code took {} seconds'.format(time.time()-start))