In [2]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Web scraping and crawling

Now we're moving forward in terms of difficulty - writing code to traverse and capture data from the web.

You largely already have the skills necessary to do this, the major skill is being able to parse the structure and text of a HTML document. Now we are simply going to put together the mental map of how to instruct a program to walk.

# Orders of complexity

There is an increasing level of difficulty in how one scrapes web pages and the intransigence of your target should be the determining factor in which approach you implement (i.e. don't buy a bazooka to go to a knife fight).

* Exploiting regularly structured urls (`requests`)
* Crawling a site with typically static content (`scrapy`)
* Crawling a site with dynamic content and human restrictions (`selenium`)

## So let's continue - regularly structured urls

To illustrate this approach, I want to use company financial filings since they contain a wealth of information. For any publicly traded company, you can access all of their filings through the [SEC Edgar website](https://www.sec.gov/edgar/searchedgar/companysearch.html).

However, to access the filings you will need to have a company's CIK number (this is used to disambiguate companies). Fortunately, the SEC provides that search function for you.

<img src='../images/edgar_search.png'>

Now, the trick here is that once you press the search button and get the results you should check the url bar.

<img src='../images/edgar_url.png'>

Notice anything....pertinent? Repeatable?

The trick is that you make sure that the url has your search query (`Google` in our case) in plain text - then modify the search term in place and try the new url. Does it work? If it does...you can 'scrape' any site easily.

## Exercise

I want you to scrape all the CIKs for the following list of companies.

In [4]:
#Exercise

companies = ['Google', 'Zebra', 'Cisco', 'Oracle', 'Amazon']

In [14]:
#Answer
import requests
import bs4
import re

ciks = []

companies = ['Google', 'Zebra', 'Cisco', 'Oracle', 'Amazon']

sec_url = 'https://www.sec.gov/cgi-bin/browse-edgar?company={0}&owner=exclude&action=getcompany'

cikre = re.compile('CIK=[0-9]{10}')

for company in companies[0]:
    response = requests.get(sec_url.format(company))
    ciks += cikre.findall(response.text)

And now with these CIKs I want you to pull all filing descriptions. Keep them associated with the CIK and save them to a file in a folder you create in `classdata`.

In [18]:
#Exercise


In [40]:
#Answer
import os
try:
    os.mkdir('../classdata/sec_descriptions')
except:
    pass

cik_url = 'https://www.sec.gov/cgi-bin/browse-edgar?{0}&owner=exclude&action=getcompany&Find=Search'

for cik in ciks:
    response = requests.get(cik_url.format(cik))
    #Start the soup
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    tds = soup.findAll(attrs={'class':'small'})
    #Now write it out
    with open('../classdata/sec_descriptions/{0}.txt'.format(cik.split('=')[-1]), 'w') as wfile:
        for td in tds:
            print(td.text.encode('utf-8'), file=wfile)

Pretty good! But one issue with our lazy scraping - what about pages that have more than 40 descriptions?

In [42]:
#Exercise


In [65]:
#Answer
cik_url = 'https://www.sec.gov/cgi-bin/browse-edgar?{0}&owner=exclude&action=getcompany&Find=Search'

def tdscrape(url, tds=[]):
    print(url)
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    tds += soup.findAll(attrs={'class':'small'})
    #Round and roud we go if we find a next button
    next_button = soup.findAll('input', attrs={'value':"Next 40"})
    if next_button != []:
        next_button_url_part = next_button[0].attrs['onclick'].split("location='")[-1].strip("'")
        tdscrape('https://www.sec.gov/' + next_button_url_part, tds = tds)
    return tds

for cik in ciks:
    #Start the soup
    tds = tdscrape(cik_url.format(cik))
    #Now write it out
    with open('../classdata/sec_descriptions/{0}.txt'.format(cik.split('=')[-1]), 'w') as wfile:
        for td in tds:
            print(td.text.encode('utf-8'), file=wfile)
            
    

https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001582104&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001088811&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001107288&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001144026&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001158094&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001281881&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0000916845&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0000893810&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0000835446&owner=exclude&action=getcompany&Find=Search
https://www.sec.gov/cgi-bin/browse-edgar?CIK=0001590206&owner=exclude&action=getcompany&Find=Search


And you could just as easily change this to follow the links and download the original documents that were filed

# Crawling static content 

You could just as easily continue using requests and this type of logic to crawl an entire web page (find all `<a>` tags, follow them, track which urls have already been travelled to, etc.)

Here we will work with a library called Scrapy. One of the benefits of Scrapy is that they have a cloud that you can deploy your scraper to 

In [67]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
[K    100% |################################| 256kB 3.3MB/s 
[?25hCollecting service-identity (from scrapy)
  Downloading service_identity-17.0.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from scrapy)
  Downloading cssselect-1.0.3-py2.py3-none-any.whl
Collecting parsel>=1.1 (from scrapy)
  Downloading parsel-1.2.0-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-1.18.0-py2.py3-none-any.whl
Collecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading PyDispatcher-2.0.5.tar.gz
Collecting queuelib (from scrapy)
  Downloading queuelib-1.4.2-py2.py3-none-any.whl
Collecting Twisted>=13.1.0 (from scrapy)
  Downloading Twisted-17.9.0.tar.bz2 (3.0MB)
[K    100% |################################| 3.0MB 305kB/s 
[?25hCollecting attrs (from service-identity->scrapy)
  Downloading attrs-17.3.0-py2.py3-none-any.whl
Collecting zope.interface>=4.0.2 (from Twisted>=13.1.0->scrapy)
  Download

The downside of Scrapy is that it requires quite a bit of boilerplate to get going. A spider has to be initialized as a class that is an instance of `scrapy.Spider`. 

However, after that it is pretty simple. It will have one function, `parse` and emits them.

In [72]:
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        #Yields the title and url of a story
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        #Yields a response follow object with the next page data
        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

Now for the other pain - we actually can't run Scrapy code in the Jupyter notebook easily. You'll actually need to write it as a script (which I have alreaday done for you in this file [here](scrapy_example.py)

However, we can run the bash command to execute this file from the Jupyter notebook. We just need to put the `!` in front of it so the notebook shell knows that we are executing a bash command.

We can also even store the output as a python variable to then interact with it in the notebook!

In [73]:
blog_urls = !scrapy runspider scrapy_example.py

In [75]:
blog_urls

['2017-12-27 14:29:06 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)',
 "2017-12-27 14:29:06 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}",
 '2017-12-27 14:29:06 [scrapy.middleware] INFO: Enabled extensions:',
 "['scrapy.extensions.telnet.TelnetConsole',",
 " 'scrapy.extensions.corestats.CoreStats',",
 " 'scrapy.extensions.logstats.LogStats',",
 " 'scrapy.extensions.memusage.MemoryUsage']",
 '2017-12-27 14:29:06 [scrapy.middleware] INFO: Enabled downloader middlewares:',
 "['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',",
 " 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',",
 " 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',",
 " 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',",
 " 'scrapy.downloadermiddlewares.retry.RetryMiddleware',",
 " 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',",
 " 'scrapy.downloadermiddlewares.httpcompression.HttpCompr

Alternatively, you could save the urls to a file directly from the scraping code if you did not yield the titles and instead wrote it to a file. 

Writing a real spider is a bit more complicated and will require usage of python scripts and bash commands. I will attempt to 