# Introduction to Scrapy

This notebook provides introduction to Scrapy, a powerful Python framework for scraping and crawling trough webpages. In general there are two approaches to scraping a page:


*   **using CSS selectors** - CSS selectors are family of patterns used to style a webpage. As styling involves personalization, CSS selectors are useful to extract information on a particular component of a page using its "personal characteristics".
*   **using XPATH** - XPATH is a family of expressions used to navigate over XML documents. The latter are very similar to HTML documents, with the main difference being hte fact that tags are defined by the user and are not built-in. As a result, XPATH is also useful to navigate overl HTML documents.

The Python web scraping ecosystem is full of packages that can handle either CSS selector or XPATH based approaches. We use Scrapy, as the latter provides a simple interface to both approaches. Below is a shrot comparison of the main tools for scraping webpages with Python.


*   [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - probably the most common library used for scraping webpages with Python. It is based on parsers which extract the content using find() methods. As a result, BeautifulSoup is easy and user friendly, but has limited support for CSS selectors and no support for XPATH expressions.  
*   [LXML](https://lxml.de/tutorial.html) - powerful and fast library mainly used for scraping a page using XPATH expressions. It has also developed CSS selector based approach (where user provided CSS selectors that were eventually translated by lxml to XPATH expressions) which was latter formed into a stand-out library (known as [CSSselect](https://cssselect.readthedocs.io/en/latest/)).
*   [Scrapy](https://scrapy.org/) - powerful, fast and user-friendly framework supporting both CSS selectors and XPATH expressions. Scrapy also extends traditional CSS selectors by providing additional features for easy extraction of web content. As a framework, it provides a rich functionality for supervising the crawler by setting scraping delays, obeying robots.txt file, extract data into different formats etc.

In this notebook, Scrapy will be used to extract information from a website http://quotes.toscrape.com/.  The same content will be scraped using both CSS selectors and XOATH approaches. The result of the scrapnig process will be independent lists on all authors names, quotes and tags on the total website. The overall stepwise logic applied to scraping will be as follows:


1.   **Checking whether scraping is allowed** - usually, websites develop a `robots.txt` file hosted on the root folder of their domain (in this case: http://quotes.toscrape.com/robots.txt), which shows the pages that are are Disallowed to scrape. If the file does not exist, or it shows that for User Agent: * the required pages are not Disallowed, then move to next steps.
2.   **Getting the HTML content** - the scraping task in general is done locally. The process assumes sending a request to the webpage to get its HTML content, which can later be used for scraping. Python standard `requests` library will be used to `get()` the HTML content of the webpage.
3.   **Converting textual content to Scrapy object** - to make Scrapy functionalities available, the resulting textual content received at point 2. will be converted to Scrapy object using the `TextResponse()` function.
4.   **Scraping a single page** - to scrape the necessary content one needs toinspect the HTML code and learn the necessary CSS selectors or XPATH expressions for scraping. In this notebook, we will first scrape a single page to see whether our selectors or expressions are correct, and then move on. `css()` and `xpath()` functions will be used to search and find content on the page matching the input selector or expression and  `extract()` method will be used to extract the matching object's content.
5.   **Developing a function to scrape a page** - based on the code at point 4. we will develop a functino to scrape a single page to avoid copy pasting the same code.
6.   **Crawaling and scraping the website** - the function above will be used inside a foor or while loop to collect data from all of the pages of the website. 





## Overview of the functionality

The code below provides an overall comparison of the main CSS selectors and XPATH expressions used in this notebook (more on this topic can be found on https://devhints.io/xpath). Whether `css()` or `xpath()` method is used, the extraction of content in Scrapy is the same:

- extract() - extracts all the matching content
- extract_first() - extracts the very first matching content
- get() - same as extract_first()
- getall() - same as extract

In the all code, we will be using `extract()` to extract all the matching content.

```python
#extracting all the divisions on the page
response.css('div').extract()
response.xpath('//div').extract()

#extracting all the divisions on the page with a particular class (example: my_class)
response.css('div[class="my_div"]').extract()
response.xpath('//div[@class="my_div"]').extract()

#shortcut for the same task
response.css('div.my_div').extract()
response.xpath('//div.my_div').extract()

#extracting all the divisions on the page with a particular id (example: my_id)
response.css('div[id="my_div"]').extract()
response.xpath('//div[@id="my_div"]').extract()

#shortcut for the same task
response.css('div#my_div').extract()
response.xpath('//div#my_div').extract()

#extracting a paragraph which is the direct child of division
response.css('div.my_div > p.my_p').extract()
response.xpath('//div.my_div/p.my_p').extract()

#extracting a paragraph which is an indirect child of division
response.css('div.my_div p.my_p').extract() 
response.xpath('//div.my_div//p.my_p').extract()

#all the code above extracts matching HTML
#code below shows how to extract only text from matching HTML
response.css('div.my_div p.my_p::text').extract()
response.xpath('//div.my_div/p.my_p/text()').extract()

#to extract not the HTML content or the text, but an attribute (say href), use the following code
response.css('div.my_div p.my_p::attr(href)').extract()
response.xpath('//div.my_div/p.my_p/@href').extract()

#if the exact class name is not known, but you know that it includes "my" inside, use this code
response.css('div[class*=my]::text').extract()
response.xpath('//div[contains(@class, "my")]/@text()').extract()
```

## Part 1 - steps 1 to 4

In [0]:
#required libs
import time #to make scraper sleep between requests
import requests #to send a request and get the page
from scrapy.http import TextResponse # to convert textual HTML content to Scrapy object


In [0]:
#Step 2
url = "http://quotes.toscrape.com/"
page = requests.get(url)

#page.status_code
#will show whether the page was succesfully received (200)
#or not (basically anything else)

#Step 3
response = TextResponse(url=page.url,body=page.text,encoding='utf-8')
#TextResponse() basically distinguishes between simple text and html tags

In [3]:
#code to extract all authors on a single page
authors = response.css("small::text").extract()
#for xpath: response.xpath("//small/text()").extract()
print(authors)

['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']


In [4]:
#code to extract all quotes on a single page
quotes = response.css("span[class='text']::text").extract()
#for xpath: response.xpath("//span[@class='text']/text()").extract()
print(quotes)

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


In [5]:
#code to extract all tags on a single page
tags = response.css("a[class='tag']::text").extract()
#for xpath: response.xpath("//a[@class='tag']/text()").extract()
print(tags)

['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle', 'miracles', 'aliteracy', 'books', 'classic', 'humor', 'be-yourself', 'inspirational', 'adulthood', 'success', 'value', 'life', 'love', 'edison', 'failure', 'inspirational', 'paraphrased', 'misattributed-eleanor-roosevelt', 'humor', 'obvious', 'simile', 'love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']


In [6]:
#in case you want to scrape the hyperlinks behind the tags:
tag_urls = response.css("a[class='tag']::attr(href)").extract()
#for xpath: response.xpath("//a[@class='tag']/@href").extract()
print(tag_urls)


['/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/tag/life/page/1/', '/tag/love/page/1/', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/tag/misattributed-eleanor-roosevelt/page/1/', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/']


## Part 2 - step 5

This section will use only CSS selectors for simplicity.
One can easily develop functinos with XPATH expressions as well, jsut by changing the `"response"` line.

In [0]:
#function for scraping author names from a single page, based on the code above
def author_scraper(url):
    page=requests.get(url)
    response=TextResponse(url=page.url, body=page.text,encoding="utf-8" )
    authors=response.css("small::text").extract()
    return authors
  
#function for scraping quotes from a single page, based on the code above
def quote_scraper(url):
    page=requests.get(url)
    response=TextResponse(url=page.url, body=page.text,encoding="utf-8" )
    quotes=response.css("span[class='text']::text").extract()
    return quotes
  
  
#function for scraping tags from a single page, based on the code above
def tag_scraper(url):
    page=requests.get(url)
    response=TextResponse(url=page.url, body=page.text,encoding="utf-8" )
    tags=response.css("a[class='tag']::text").extract()
    return tags

## Part 3 - step 6

This step will be implemented only for authors for simplicity. One can do the same for other components just by changing the author_scraper() function to quotes_scraper() or tag_scraper() functions inside the loop.

To scrape all the pages in a loop, we will basically develop all the URLs ina loop as well. For that reason we need to know the number of pages on the website. It is 10. Instead we can also use a while loop, which wil scrape unless the resulting list of authors on a given page is empty (no more authors to scrape!).

In [8]:
all_authors=[] #empty list to be populated will all authors afterwards
for i in range(1,11): # as we have 10 pages
    url = "http://quotes.toscrape.com/page/{}/".format(i) #we will construct current page's URL
    current_page_authors = author_scraper(url) #scrape current page and save in a lsit
    all_authors.extend(current_page_authors) #add the authots from this page to total lsit
    time.sleep(2) #wait a bit before scrapnig next page
print(all_authors)

['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin', 'Marilyn Monroe', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'Douglas Adams', 'Elie Wiesel', 'Friedrich Nietzsche', 'Mark Twain', 'Allen Saunders', 'Pablo Neruda', 'Ralph Waldo Emerson', 'Mother Teresa', 'Garrison Keillor', 'Jim Henson', 'Dr. Seuss', 'Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'J.K. Rowling', 'Bob Marley', 'Mother Teresa', 'J.K. Rowling', 'Charles M. Schulz', 'William Nicholson', 'Albert Einstein', 'Jorge Luis Borges', 'George Eliot', 'George R.R. Martin', 'C.S. Lewis', 'Marilyn Monroe', 'Marilyn Monroe', 'Albert Einstein', 'Marilyn Monroe', 'Marilyn Monroe', 'Martin Luther King Jr.', 'J.K. Rowling', 'James Baldwin', 'Jane Austen', 'Eleanor Roosevelt', 'Marilyn Monroe', 'Albert Einstein', 'Haruki Murakami', 'Alexandre Dumas fils', 'Stephen

In [9]:
#while loop with predefined number of iterations
all_authors=[]
i=1
while i<11:
    url = "http://quotes.toscrape.com/page/{}/".format(i)
    current_page_authors = author_scraper(url)
    all_authors.extend(current_page_authors)
    #time.sleep(2)
    i=i+1
print(all_authors)

['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin', 'Marilyn Monroe', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'Douglas Adams', 'Elie Wiesel', 'Friedrich Nietzsche', 'Mark Twain', 'Allen Saunders', 'Pablo Neruda', 'Ralph Waldo Emerson', 'Mother Teresa', 'Garrison Keillor', 'Jim Henson', 'Dr. Seuss', 'Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'J.K. Rowling', 'Bob Marley', 'Mother Teresa', 'J.K. Rowling', 'Charles M. Schulz', 'William Nicholson', 'Albert Einstein', 'Jorge Luis Borges', 'George Eliot', 'George R.R. Martin', 'C.S. Lewis', 'Marilyn Monroe', 'Marilyn Monroe', 'Albert Einstein', 'Marilyn Monroe', 'Marilyn Monroe', 'Martin Luther King Jr.', 'J.K. Rowling', 'James Baldwin', 'Jane Austen', 'Eleanor Roosevelt', 'Marilyn Monroe', 'Albert Einstein', 'Haruki Murakami', 'Alexandre Dumas fils', 'Stephen

In [10]:
#while loop without predefined number of iterations
#thus, we will use stopping condition
all_authors=[]
i=1
while True: #run always, unless stopped by break
    url = "http://quotes.toscrape.com/page/{}/".format(i)
    current_page_authors = author_scraper(url)
    if len(current_page_authors)!=0:
        all_authors.extend(current_page_authors)
        time.sleep(2)
        i=i+1
    else:
        break
print(all_authors)

['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin', 'Marilyn Monroe', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'Douglas Adams', 'Elie Wiesel', 'Friedrich Nietzsche', 'Mark Twain', 'Allen Saunders', 'Pablo Neruda', 'Ralph Waldo Emerson', 'Mother Teresa', 'Garrison Keillor', 'Jim Henson', 'Dr. Seuss', 'Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Bob Marley', 'Dr. Seuss', 'J.K. Rowling', 'Bob Marley', 'Mother Teresa', 'J.K. Rowling', 'Charles M. Schulz', 'William Nicholson', 'Albert Einstein', 'Jorge Luis Borges', 'George Eliot', 'George R.R. Martin', 'C.S. Lewis', 'Marilyn Monroe', 'Marilyn Monroe', 'Albert Einstein', 'Marilyn Monroe', 'Marilyn Monroe', 'Martin Luther King Jr.', 'J.K. Rowling', 'James Baldwin', 'Jane Austen', 'Eleanor Roosevelt', 'Marilyn Monroe', 'Albert Einstein', 'Haruki Murakami', 'Alexandre Dumas fils', 'Stephen