### Scrapy
- Has its own API, making it easy to write and debug spiders
    - **Spider**: program that crawls/scrapes a website
- Can run in command line or in jupyter
- Lessons will use jupyer, for reference: [API docs](https://doc.scrapy.org/en/latest/topics/api.html) & [examples](https://doc.scrapy.org/en/latest/intro/examples.html)
- Built on top of Twisted reactor
    - Even driven, cannot be restarted within a session
    - Need to restart kernel each time running a scrapy script in jupyter (command line is better)

### Basic scraper
- Everyday sexism project: collection of user submitted stories of experiencing or observing sexism
    - Entries are stored/displayed as simple text entries
    - Categories are user provided
    - No API or other method to access data aside from the webpage
- Scraper should:
    - Pull data from every page
    - Save entry text, category labels, date posted, name provided by poster
- This simple scraper will save the entire webpage as an html file
    - Provide one or more starting urls
    - Scraper sends request to the server for the information at that url
    - Server responds either with information or error code
    - Scrapy processes response in accordance with the instructions it's been given

In [3]:
# imports in each cell because of kernel restarts
import scrapy
from scrapy.crawler import CrawlerProcess

class ESSpider(scrapy.Spider):
    # naming spiders is important when running more than one spider of
    # the same class simultaneously
    name = 'ESS'
    
    # starting url(s)
    start_urls = ['http://www.everydaysexism.com',]
    
    # download all code and save to mainpage.html file
    def parse(self, response):
        with open('mainpage.html', 'wb') as f:
            f.write(response.body)

# instantiate crawler
process = CrawlerProcess()

# start crawler with spider
process.crawl(ESSpider)
process.start()

2018-08-16 17:07:47 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-16 17:07:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 26 2018, 08:42:37) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-08-16 17:07:47 [scrapy.crawler] INFO: Overridden settings: {}
2018-08-16 17:07:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-08-16 17:07:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.dow

### Parsing HTML
- Webpages are written in HTML, XML, and CSS
- Inspecting webpages:
    - Elements tab shows the code for page
    - Element will be highlight to show where on the webpage a piece of code is referring to
- Everyday Sexism Project:
    - Each submission is contained within an `<article>` element
    - Each article has a nested `<header>`, `<section>`, and `<footer>` elemnts
    - Submissions are standardized in format, can use a template to build parsing instructions
- Getting information out of html code and into parser
    - Regex: ignores context
        - HTML is context based, content described by tag depends on where that tag is positioned, opened, and closed
        - Combining html and regex results in fragile code that breaks easily
    - Xpath: context aware parser natively incorporated in scrapy (along with CSS-based)
        - Treats nested code like file paths
        - Each level is referred to as a node
    - Xpath expressoin to find the title of a submission in ESP page: //article/header/h2/a/@title
        - //article: locate every article node in the page
        - /header: find header node nested directly underneath article node
        - /h2: find the h2 node nested directly under the header node
        - /a: find the a node nested directly under the h2 node
        - /@title: find the title attribute within the a node
    - [XPath tester](http://www.freeformatter.com/xpath-tester.html) for experimenting with Xpath
    
### Storing output as JSON
- Scrapy writes information to a file
- Raw html isn't very useful
- JSON stores data in a format similar to python dictionaries
    - Useful for non-flat (semi structured) data where rows can have different numbers of columns
    - JSON is easy to read back into python


In [1]:
# imports in each cell because of kernel restarts
import scrapy
from scrapy.crawler import CrawlerProcess

class ESSpider(scrapy.Spider):
    # naming spiders is important when running more than one spider of
    # the same class simultaneously
    name = 'ESS'
    
    # starting url(s)
    start_urls = ['http://www.everydaysexism.com',]
    
    # xpath to parse response
    def parse(self, response):
        
        # iterate over every <article> element on page
        for article in response.xpath('//article'):
            
            # yield a dictionary with desired values
            yield {
                'name': article.xpath('header/h2/a/@title').extract_first(),
                'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
                'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }

# pass arguments telling script how to run the crawler
process = CrawlerProcess({
    'FEED_FORMAT': 'json', # store in json format
    'FEED_URI': 'firstpage.json', # name storage file
    'LOG_ENABLED': False # turn off logging
})

# start crawler with spider
process.crawl(ESSpider)
process.start()
print('success')

success


### Checking the output
- Should have a json file 'firstpage.json'
- **Note**:
    - Scrapy will attempt to append new info to end of old json file if try running code more than once in a row
    - Will end up with an 'end of file' notation in the middle of the file
    - Delete json file before re-running or else will receive a 'trailing file' error

In [2]:
import pandas as pd

firstpage = pd.read_json('firstpage.json', orient='records')
print(firstpage.shape)
print(firstpage.head())

(10, 4)
        date            name  \
0 2018-08-13           Penny   
1 2018-08-13           Nope.   
2 2018-08-13        Michelle   
3 2018-08-13  UK IT Employee   
4 2018-08-13              me   

                                                tags  \
0                                     [Public space]   
1                                        [Workplace]   
2                                           [School]   
3  [UK Equality; UK Gender Pay Gap; Conscious and...   
4                                           [School]   

                                                text  
0  [I was walking along the road the other day at...  
1  [A technician position opened up at my job. Ou...  
2  [I have had numerous displeasant encounters wi...  
3  [I work for a large multi-national IT organisa...  
4  [Once in high school, a boy pinned me to the w...  


### Recursion