# Web scraping and crawling

Now we're moving forward in terms of difficulty - writing code to traverse and capture data from the web.

You largely already have the skills necessary to do this, the major skill is being able to parse the structure and text of a HTML document. Now we are simply going to put together the mental map of how to instruct a program to walk.

# Orders of complexity

There is an increasing level of difficulty in how one scrapes web pages and the intransigence of your target should be the determining factor in which approach you implement (i.e. don't buy a bazooka to go to a knife fight).

* Exploiting regularly structured urls (`requests`)
* Crawling a site with typically static content (`scrapy`)
* Crawling a site with dynamic content and human restrictions (`selenium`)



## So let's continue - regularly structured urls

To illustrate this approach, I want to use company financial filings since they contain a wealth of information. For any publicly traded company, you can access all of their filings through the [SEC Edgar website](https://www.sec.gov/edgar/searchedgar/companysearch.html).

However, to access the filings you will need to have a company's CIK number (this is used to disambiguate companies). Fortunately, the SEC provides that search function for you.

<img src='../images/edgar_search.png'>

Now, the trick here is that once you press the search button and get the results you should check the url bar.

<img src='../images/edgar_url.png'>

Notice anything....pertinent? Repeatable?

The trick is that you make sure that the url has your search query (`Google` in our case) in plain text - then modify the search term in place and try the new url. Does it work? If it does...you can 'scrape' any site easily.


## Exercise

I want you to scrape all the CIKs for the following list of companies. and save them to a folder you create in `../data/classdata/company_searches/`

In [None]:
#Exercise

companies = ['Google', 'Zebra', 'Cisco', 'Oracle', 'Amazon']

print(companies)


And now with these CIKs I want you to pull all filing descriptions. Keep them associated with the CIK and save them to a file in a folder you create in `../data/classdata/sec_descriptions`.

In [None]:
#Exercise


Pretty good! But one issue with our lazy scraping - what about pages that have more than 40 descriptions?

And you could just as easily change this to follow the links and download the original documents that were filed

# Browser Driven Scraping

For the last part, we will tackle the most complicated approach - scraping dynamic content by impersonating a human with a real web browser.

In [None]:
#!pip install --upgrade selenium

We will need to download and install the `geckodriver` according to your system instructions (You will also need to move the `geckodriver` into `/usr/local/bin/` 
or `C:\Windows\System32\`

Now watch for something totally crazy.

In [1]:
!python selenium_example.py

Yup, that's right. It started an entire web browser (Firefox in this case). This is why selenium is the most powerful (and costly) solution to scraping. 

So now let's inspect this code:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()


You start from the webdriver with the browser of choice (you can choose). 

Using `driver.get()` you give a url address.

Once there, you can give instructions to search for a specific element by it's name. In this case `q` is the input field for search the site.

As a pre-emptive move, the code clears the box and then sends the query `pycon`

It then hits return and checks to make sure that no results are returned before closing.

Simple, right?

Now let's try to search for `Obama` on CNN.

In [None]:
#Exercise


Amazing! **But complicated**. We can also use the forward and back buttons for the browser

In [None]:
driver.back()

In [None]:
driver.forward()

And you could print (and thus save the page source) or put it into beautiful soup

In [None]:
driver.page_source

But this won't work magic, if it's not in the source in your browser then it won't be in the source for selenium either.

We can also find all/multiple elements with the same name.

In [None]:
headlines = driver.find_elements_by_class_name("cnn-search__result-headline")

In [None]:
headlines

In [None]:
for hl in headlines:
    print(hl.text)

# The value of accessing inaccessible content

This week we are examining at Malmgren RD, Ottino JM, Amaral LAN. (2010). The role of mentorship on protégé performance. Nature 463, 622-626.

The article relied entirely on data from the [Math Genealogy Network](https://www.genealogy.math.ndsu.nodak.edu) and [MathSciNet](https://mathscinet.ams.org/mathscinet/) to construct lineages of mentors and individual productivity. This research is not possible without extracting and combining these two data sources. Put together it unlocks the possibility of examining an important and previously inaccessible question at scale. 

Importantly, both of these websites lack the resources to provide an API to download the data.