In [2]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Web scraping and crawling

Now we're moving forward in terms of difficulty - writing code to traverse and capture data from the web.

You largely already have the skills necessary to do this, the major skill is being able to parse the structure and text of a HTML document. Now we are simply going to put together the mental map of how to instruct a program to walk.

# Orders of complexity

There is an increasing level of difficulty in how one scrapes web pages and the intransigence of your target should be the determining factor in which approach you implement (i.e. don't buy a bazooka to go to a knife fight).

* Exploiting regularly structured urls (`requests`)
* Crawling a site with typically static content (`scrapy`)
* Crawling a site with dynamic content and human restrictions (`selenium`)

## So let's continue - regularly structured urls

To illustrate this approach, I want to use company financial filings since they contain a wealth of information. For any publicly traded company, you can access all of their filings through the [SEC Edgar website](https://www.sec.gov/edgar/searchedgar/companysearch.html).

However, to access the filings you will need to have a company's CIK number (this is used to disambiguate companies). Fortunately, the SEC provides that search function for you.

<img src='../images/edgar_search.png'>

Now, the trick here is that once you press the search button and get the results you should check the url bar.

<img src='../images/edgar_url.png'>

Notice anything....pertinent? Repeatable?

The trick is that you make sure that the url has your search query (`Google` in our case) in plain text - then modify the search term in place and try the new url. Does it work? If it does...you can 'scrape' any site easily.

## Exercise

I want you to scrape all the CIKs for the following list of companies.

In [4]:
#Exercise

companies = ['Google', 'Zebra', 'Cisco', 'Oracle', 'Amazon']

And now with these CIKs I want you to pull all filing descriptions. Keep them associated with the CIK and save them to a file in a folder you create in `classdata`.

In [18]:
#Exercise


Pretty good! But one issue with our lazy scraping - what about pages that have more than 40 descriptions?

In [42]:
#Exercise


And you could just as easily change this to follow the links and download the original documents that were filed

# Crawling static content 

You could just as easily continue using requests and this type of logic to crawl an entire web page (find all `<a>` tags, follow them, track which urls have already been travelled to, etc.)

Here we will work with a library called Scrapy. One of the benefits of Scrapy is that they have a cloud that you can deploy your scraper to 

In [67]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
[K    100% |################################| 256kB 3.3MB/s 
[?25hCollecting service-identity (from scrapy)
  Downloading service_identity-17.0.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy)
  Downloading cssselect-1.0.3-py2.py3-none-any.whl
Collecting parsel>=1.1 (from scrapy)
  Downloading parsel-1.2.0-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-1.18.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading PyDispatcher-2.0.5.tar.gz
Collecting queuelib (from scrapy)
  Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from scrapy)
  Downloading Twisted-17.9.0.tar.bz2 (3.0MB)
[K    100% |################################| 3.0MB 305kB/s 
[?25hCollecting attrs (from service-identity->scrapy)
  Downloading attrs-17.3.0-py2.py3-none-any.whl
Collecting zope.interface>=4.0.2 (from Twisted>=13.1.0->scrapy)
  Download

The downside of Scrapy is that it requires quite a bit of boilerplate to get going. A spider has to be initialized as a class that is an instance of `scrapy.Spider`. 

However, after that it is pretty simple. It will have one function, `parse` and emits them.

In [72]:
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        #Yields the title and url of a story
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        #Yields a response follow object with the next page data
        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

Now for the other pain - we actually can't run Scrapy code in the Jupyter notebook easily. You'll actually need to write it as a script (which I have alreaday done for you in this file [here](scrapy_example.py)

However, we can run the bash command to execute this file from the Jupyter notebook. We just need to put the `!` in front of it so the notebook shell knows that we are executing a bash command.

We can also even store the output as a python variable to then interact with it in the notebook!

In [73]:
blog_urls = !scrapy runspider scrapy_example.py

In [75]:
blog_urls

['2017-12-27 14:29:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)',
 "2017-12-27 14:29:06 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}",
 '2017-12-27 14:29:06 [scrapy.middleware] INFO: Enabled extensions:',
 "['scrapy.extensions.telnet.TelnetConsole',",
 " 'scrapy.extensions.corestats.CoreStats',",
 " 'scrapy.extensions.logstats.LogStats',",
 " 'scrapy.extensions.memusage.MemoryUsage']",
 '2017-12-27 14:29:06 [scrapy.middleware] INFO: Enabled downloader middlewares:',
 "['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',",
 " 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',",
 " 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',",
 " 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',",
 " 'scrapy.downloadermiddlewares.retry.RetryMiddleware',",
 " 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',",
 " 'scrapy.downloadermiddlewares.httpcompression.HttpCompr

Alternatively, you could save the urls to a file directly from the scraping code if you did not yield the titles and instead wrote it to a file. 

Writing a real spider is a bit more complicated and will require usage of python scripts and bash commands. I will attempt to do as much in the notebook as possible, but we will need to switch back and forth.

In [76]:
!scrapy startproject scraping_tutorial

New Scrapy project 'scraping_tutorial', using template directory '/Users/adampah/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /Users/adampah/Dropbox (Personal)/Teaching/CSSMA/kphd-css/lessons/scraping_tutorial

You can start your first spider with:
    cd scraping_tutorial
    scrapy genspider example example.com


We will simply follow the tutorial for now and scrape quotes from `quotes.toscrape.com` to be nice and not unleash 20 spiders on an unsuspecting website.

So where to start? The first one is understanding the directory structure that was created.

In [78]:
!ls -l scraping_tutorial

total 16
drwxr-xr-x@ 9 adampah  staff  306 Dec 27 14:45 [1m[36mscraping_tutorial[m[m
-rw-r--r--@ 1 adampah  staff  278 Dec 27 14:45 scrapy.cfg


You have a config file (as noted by the `cfg` extension) and the directory that contains all of the code. Right now the config file will be pretty barren, but this is where you will add options for your project to manage the multiple scrapers. 

Right now the bigger deal is in the `scraping_tutorial` folder

In [82]:
ls -l scraping_tutorial/scraping_tutorial/

total 64
-rw-r--r--@ 1 adampah  staff     0 Dec 27 14:13 __init__.py
drwxr-xr-x@ 2 adampah  staff    68 Dec 27 14:13 [1m[36m__pycache__[m[m/
-rw-r--r--@ 1 adampah  staff   295 Dec 27 14:45 items.py
-rw-r--r--@ 1 adampah  staff  1914 Dec 27 14:45 middlewares.py
-rw-r--r--@ 1 adampah  staff   296 Dec 27 14:45 pipelines.py
-rw-r--r--@ 1 adampah  staff  3236 Dec 27 14:45 settings.py
drwxr-xr-x@ 4 adampah  staff   136 Dec 27 14:13 [1m[36mspiders[m[m/


Since this is a relatively simple scraper, we will really just care about adding spiders to the spiders folder (which is currently empty)

In [83]:
ls -l scraping_tutorial/scraping_tutorial/spiders

total 16
-rw-r--r--@ 1 adampah  staff  161 Dec 27 14:13 __init__.py
drwxr-xr-x@ 2 adampah  staff   68 Dec 27 14:13 [1m[36m__pycache__[m[m/


And now we have to create our Quotes spider. To do this you will need to start a python script `quotes_spider.py` in the `spiders/` folder. The url is `'http://quotes.toscrape.com/page/1/'` and all you need to do is write the body of the page into a html file.

In [85]:
!cd scraping_tutorial && scrapy crawl quotes

2017-12-27 15:08:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scraping_tutorial)
2017-12-27 15:08:20 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scraping_tutorial', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'scraping_tutorial.spiders', 'SPIDER_MODULES': ['scraping_tutorial.spiders']}
2017-12-27 15:08:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.telnet.TelnetConsole']
2017-12-27 15:08:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.Ret

In [86]:
!ls scraping_tutorial/

quotes-1.html     [1m[36mscraping_tutorial[m[m scrapy.cfg


And we can change up what we iterate through and instead of getting the page body and saving it, we could instead yield the quotes directly by selecting the appropriate css elements.

In [88]:
!cd scraping_tutorial && scrapy crawl quotes

2017-12-27 15:19:46 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scraping_tutorial)
2017-12-27 15:19:46 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['scraping_tutorial.spiders'], 'BOT_NAME': 'scraping_tutorial', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'scraping_tutorial.spiders'}
2017-12-27 15:19:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-12-27 15:19:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.Ret

You can build far more complicated scrapers (scraping and processing scraped items), but this a sufficient introduction. 

For the last part, we will tackle the most complicated approach - scraping dynamic content by impersonating a human with a real web browser.

In [89]:
!pip install selenium

Collecting selenium
  Downloading selenium-3.8.0-py2.py3-none-any.whl (941kB)
[K    100% |################################| 942kB 1.0MB/s 
[?25hInstalling collected packages: selenium
Successfully installed selenium-3.8.0
[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


You will also need to move the `geckodriver` into `/usr/local/bin/` (or follow the windows instructions)

Now watch for something totally crazy.

In [92]:
!python selenium_example.py

Traceback (most recent call last):
  File "/Users/adampah/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/site-packages/selenium/webdriver/common/service.py", line 74, in start
    stdout=self.log_file, stderr=self.log_file)
  File "/Users/adampah/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/subprocess.py", line 950, in __init__
    restore_signals, start_new_session)
  File "/Users/adampah/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/subprocess.py", line 1544, in _execute_child
    raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "selenium_example.py", line 4, in <module>
    driver = webdriver.Firefox()
  File "/Users/adampah/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/site-packages/selenium/webdriver/firefox/webdriver.py", line 148, in __init__
    self.service.start()
  File "/Users/adam

Yup, that's right. It started an entire web browser (Firefox in this case). This is why selenium is the most powerful (and costly) solution to scraping. 

So now let's inspect this code:

In [93]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()


You start from the webdriver with the browser of choice (you can choose). 

Using `driver.get()` you give a url address.

Once there, you can give instructions to search for a specific element by it's name. In this case `q` is the input field for search the site.

As a pre-emptive move, the code clears the box and then sends the query `pycon`

It then hits return and checks to make sure that no results are returned before closing.

Simple, right?

Now let's try to search for `Obama` on CNN.

In [94]:
#Exercise


Amazing! **But complicated**. We can also use the forward and back buttons for the browser

In [107]:
driver.back()

In [108]:
driver.forward()

And you could print (and thus save the page source) or put it into beautiful soup

In [109]:
driver.page_source

'<html class="js no-flash geolocation websockets localstorage webworkers no-touchevents fontface supports textshadow csscolumns csscolumns-width no-csscolumns-span csscolumns-fill csscolumns-gap csscolumns-rule csscolumns-rulecolor csscolumns-rulestyle csscolumns-rulewidth no-csscolumns-breakbefore no-csscolumns-breakafter no-csscolumns-breakinside flexbox csstransforms3d no-mobile no-phone no-tablet mobilegradea no-ios no-android no-iospre10 no-iemobile no-ieunsupported no-ie11unsupported no-ie no-edge" style=""><head><script src="http://connect.facebook.net/signals/config/731697573629176?v=2.8.6&amp;r=stable" async=""></script><script async="" src="//connect.facebook.net/en_US/fbevents.js"></script><script type="text/javascript" src="http://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="http://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></sc

But this won't work magic, if it's not in the source in your browser then it won't be in the source for selenium either.

We can also find all/multiple elements with the same name.

In [110]:
headlines = driver.find_elements_by_class_name("cnn-search__result-headline")

In [112]:
headlines

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="386c0cd9-a87f-6e42-a189-3c05e0c5a21b")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="54f31c16-64f7-304d-a5df-f1d74d68eb7a")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="304c30b5-e277-c44b-b648-de2f70c8f528")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="5b6f32ea-8718-6041-aef6-4913804be23d")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="f4fc62db-c913-f042-8b02-521c1880a2c9")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="967631c3-8725-7547-bacc-0fb9b1b1d260", element="bcc32614-009c-1c4c-bcad-1a49365ae5b4")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement 

In [118]:
for hl in headlines:
    print(hl.text)

Prince Harry and Meghan Markle should snub President Trump
All tweets belong in the Library of Congress
Justice Dept. watchdog criticizes past handling of sexual harassment allegations
Prince Harry asked if Obama invited to wedding
Gallup: Obama, Hillary Clinton remain most admired
Prince Harry sidesteps Obama-Trump wedding guest list controversy
Tax cuts might create jobs. But where are the workers?
5 things for December 27: Snow, anthrax threats, infrastructure bill
9 most memeable political personalities of 2017
Obama cautions irresponsible social media use


# Final exercise

Pick your poison (`requests`, `scrapy`, or `selenium`) and scrape Instagram. Pick from one of these five users as your initial seed account

* hueypotatothefrenchie
* bigchunkymonkey
* eggnogthebulldog
* thepawsofbrixton.andcash
* chicago_bulldogs

Go through and identify user accounts that comment on photos from the given account.

# The value of accessing inaccessible content

This week we are examining at Malmgren RD, Ottino JM, Amaral LAN. (2010). The role of mentorship on protégé performance. Nature 463, 622-626.

The article relied entirely on data from the [Math Genealogy Network](https://www.genealogy.math.ndsu.nodak.edu) and [MathSciNet](https://mathscinet.ams.org/mathscinet/) to construct lineages of mentors and individual productivity. This research is not possible without extracting and combining these two data sources. Put together it unlocks the possibility of examining an important and previously inaccessible question at scale. 

Importantly, both of these websites lack the resources to provide an API to download the data.