# Using Scrapy To Scrape For Specific Information

This is a notebook that demonstrates the use of the Scrapy python library to extract specific information from a website and then organizes said information into a pandas dataframe, displayed as a table.

Included is a page-turning web-crawler so that we can acquire all the relevant data from the website.

In [None]:
!pip install scrapy

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


In [None]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'quotes.json'
})

process.crawl(QuotesSpider)
process.start()


In [None]:
import pandas as pd

data = pd.read_json('quotes.json')
data.head()
