# Before we start

- Web scraping can lead to messy data.
- Read websites' terms and conditions.
- Respect websites' robots.txt.
- Be nice!
- The firefox inspector is you friend!

# HTML 101

Wepbages are written in HTML. You can read more on the [w3 html tutorial](http://www.w3schools.com/html/).

We will work with a simple [example page](example.html).

In [55]:
html = open("example.html").read()

In [56]:
print html

<html>
    <head>
        <title>Test page</title>
    </head>
    <body>
        <h1 id="title">This is an example page</h1>
        <h2 id="list-header">This is a list</h2>
        <ul>
            <li>This is a list item!</li>
            <li>Another item!</li>
        </ul>
        <h2>Links</h2>
        This is a <a href="example2.html">link</a> to a second page.
        <h2>Image</h2>
        This is a histogram (whose axes should be labeled!).
        <img src="histogram.png" width="50%" alt="Histogram of something.">
    </body>
</html>


In [1]:
from bs4 import BeautifulSoup

In [58]:
soup = BeautifulSoup(html, "lxml")

In [36]:
soup.body

<body>\n<h1 id="title">This is an example page</h1>\n<h2 id="list-header">This is a list</h2>\n<ul>\n<li>This is a list item!</li>\n<li>Another item!</li>\n</ul>\n<h2>Links</h2>\n        This is a <a href="example2.html">link</a> to a second page.\n        <h2>Image</h2>\n        This is a histogram (whose axes should be labeled!).\n        <img src="histogram.png" width="50%"/>\n</body>

In [37]:
soup.body.ul("li")

[<li>This is a list item!</li>, <li>Another item!</li>]

In [38]:
soup("li")

[<li>This is a list item!</li>, <li>Another item!</li>]

In [39]:
soup('img')

[<img src="histogram.png" width="50%"/>]

In [40]:
soup.img

<img src="histogram.png" width="50%"/>

In [41]:
for img in soup("img"):
    print img.attrs

{'src': 'histogram.png', 'width': '50%'}


In [42]:
soup.img['src']

'histogram.png'

In [43]:
soup.li.parent

<ul>\n<li>This is a list item!</li>\n<li>Another item!</li>\n</ul>

In [44]:
soup.li.parent.name

'ul'

In [45]:
soup.li.parent.text

u'\nThis is a list item!\nAnother item!\n'

In [51]:
soup.ul.contents

[u'\n', <li>This is a list item!</li>, u'\n', <li>Another item!</li>, u'\n']

In [52]:
soup.ul.children

<listiterator at 0x10d93bfd0>

In [53]:
list(soup.ul.children)

[u'\n', <li>This is a list item!</li>, u'\n', <li>Another item!</li>, u'\n']

In [59]:
soup(lambda tag: tag.has_attr('alt'))

[<img alt="Histogram of something." src="histogram.png" width="50%"/>]

# Scrapy

In [2]:
import scrapy

In [3]:
from urlparse import urljoin
class TestSpider(scrapy.Spider):
    name = 'test_spider'
    start_urls = ['http://localhost:8888/files/example.html']
    # can be done from command line
    custom_settings = {'FEED_FORMAT': 'json',
                       'FEED_URI': 'result.json'}
    
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        for list_item in soup('li'):
            yield {'list_item': list_item.text,
                   'url': response.url}
        for link in soup('a'):
            target = urljoin(response.url, link['href'])
            yield scrapy.Request(target, callback=self.parse)

# Running

The spider could now be run using

    scrapy runspider test_spider
   
or

    scrapy runspider --output=results.csv test_spider

Also try

    scrapy shell

In [4]:
from scrapy.crawler import CrawlerProcess

In [5]:
process = CrawlerProcess()

2017-01-20 14:24:01 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
2017-01-20 14:24:01 [scrapy] INFO: Overridden settings: {}


In [6]:
process.crawl(TestSpider)

2017-01-20 14:24:02 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-01-20 14:24:02 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

<Deferred at 0x10524aa28>

In [7]:
process.start()

2017-01-20 14:24:03 [scrapy] DEBUG: Crawled (200) <GET http://localhost:8888/files/example.html> (referer: None)
2017-01-20 14:24:03 [scrapy] DEBUG: Scraped from <200 http://localhost:8888/files/example.html>
{'url': 'http://localhost:8888/files/example.html', 'list_item': u'This is a list item!'}
2017-01-20 14:24:03 [scrapy] DEBUG: Scraped from <200 http://localhost:8888/files/example.html>
{'url': 'http://localhost:8888/files/example.html', 'list_item': u'Another item!'}
2017-01-20 14:24:03 [scrapy] DEBUG: Crawled (200) <GET http://localhost:8888/files/example2.html> (referer: http://localhost:8888/files/example.html)
2017-01-20 14:24:03 [scrapy] DEBUG: Scraped from <200 http://localhost:8888/files/example2.html>
{'url': 'http://localhost:8888/files/example2.html', 'list_item': u'This is a list item from page 2!'}
2017-01-20 14:24:03 [scrapy] DEBUG: Scraped from <200 http://localhost:8888/files/example2.html>
{'url': 'http://localhost:8888/files/example2.html', 'list_item': u'Another