## What is Scrapy ?
Scrapy is a fast, open-source web crawling framework written in Python,
Used for extracting the data you need from websites.In a fast, simple, yet extensible way.

### Why use scrapy ?
It is easier to build and scale large crawling projects.<br>
It has a built-in mechanism called Selectors, for extracting the data from websites.<br>
It handles the requests asynchronously and it is fast.<br>
Scrapy generates feed exports in formats such as JSON, CSV, and XML.<br>
Scrapy has built-in support for selecting and extracting data from sources either by XPath or CSS expressions.<br>

### Creating a project

#### This will create a first_scrapy directory with the following contents:

## first Spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites)

In [None]:
import scrapy                                    #this is our spider created


class QuotesSpider(scrapy.Spider):
    name = "quotes"                                # identifies the Spider    

    def start_requests(self):                  #  return a list of requests on which the Spider will begin to crawl from
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)        #callback used to call parse method for each url

    def parse(self, response):          #method that will be called to handle the response downloaded for each of the requests made. 
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

### How to run our spider

You should notice that two new files have been created: quotes-1.html and quotes-2.html

In [None]:
class QuotesSpider(scrapy.Spider):                   #start_url directly can be used to avoid itterations
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

### Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider.


In [None]:
scrapy shell "http://quotes.toscrape.com/page/1/"

When you run above code ,you get the available scrapy objects you can work on

### Using the shell, 
you can try selecting elements using CSS with the response object:

In [None]:
response.css('title')

[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css('title') is a list-like object called SelectorList.<br>
To extract the text from the title above, you can do:

In [None]:
response.css('title::text').getall()             # getall() return list of all .           
#['Quotes to Scrape']

In [None]:
response.css('title').getall()                     #::text
#['<title>Quotes to Scrape</title>']

In [None]:
response.css('title::text').get()               #single search
#'Quotes to Scrape'

response.css('title::text')[0].get()
#'Quotes to Scrape'

 Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:

In [None]:
response.css('title::text').re(r'Quotes.*')
#['Quotes to Scrape']

response.css('title::text').re(r'Q\w+')
#['Quotes']

response.css('title::text').re(r'(\w+) to (\w+)')
#['Quotes', 'Scrape']

### XPath: another way to select

In [None]:
response.xpath('//title')
#[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

response.xpath('//title/text()').get()
#'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. 

### Extracting quotes and authors

In [None]:
response.css("div.quote")
#[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
# <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,...]

Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

In [None]:
quote = response.css("div.quote")[0]

Now extract text, author and the tags from that quote using the quote object we just created:

In [None]:
text = quote.css("span.text::text").get()
print(text)
#'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

author = quote.css("small.author::text").get()
print(author)
#'Albert Einstein'

In [None]:
tags = quote.css("div.tags a.tag::text").getall()
print(tags)
#['change', 'deep-thoughts', 'thinking', 'world']

Now put them all together in dictionary 

In [None]:
for quote in response.css("div.quote"):
    text = quote.css("span.text::text").get()
    author = quote.css("small.author::text").get()
    tags = quote.css("div.tags a.tag::text").getall()
    print(dict(text=text, author=author, tags=tags))

{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}

### Extracting data in our spider
integrate the extraction logic above into our spider.

In [None]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

### Storing the scraped data

In [None]:
scrapy crawl quotes -O quotes.json        #.csv ,.json

The -O command-line switch overwrites any existing file; use -o instead to append new content to any existing file.

### Quick Scrape product's name,price,link 

### To create a project

In [None]:
scrapy startproject whiskyscrapper                              #whiskyscrapper is name of project                             

### Open Scrapy shell
when we are using scrapy shell everything is going to be save in response variable

In [None]:
cd whiskyscrapper  #get to directory
 
scrapy shell      #enable shell       scrapy shell "site.com"

fetch('https://www.whiskyshop.com/scotch-whisky')          #fetch method to get the link which returns reponse object

#response shows get<200> then good to go else check url once

### Inspection of website
Will help to get the elements from which we will scrape.

In [None]:
response.css('div.product-item-info')     #selector used to select specify tag

response.css('div.product-item-info').get()       #get() used to get that content ,givesonly first found element 

In [None]:
response.css('div.product-item-info').extract()  #it will give the whole tag in string

In [None]:
product=response.css('div.product-item-info')
len(product)                               #100 i.e... no of producct per page         

In [None]:
product.css('a.product-item-link::text').getall()     #will give name of all 100 products

product.css('span.price::text').getall()           #will give prices of all 100 products

In [None]:
product.css('a.product-item-link').attrib['href']         #links of product items

### Creating Spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

In [None]:
scrapy genspider whiskyspider www.whiskyshop.com/scotch-whisky         #scrapy spider genspider class_name link_given ,will create spider

### Changes made in Spider 
i.e.. in whiskyspider.py file

In [None]:
import scrapy


class WhiskyspiderSpider(scrapy.Spider):
    name = 'whisky'
    allowed_domains = ['www.whiskyshop.com/scotch-whisky']
    start_urls = ['http://www.whiskyshop.com/scotch-whisky/']

    def parse(self, response):
        for product in response.css('div.product-item-info'):
            try:    
                yield{
                    'name': product.css('a.product-item-link::text').get(),
                    'price': product.css('span.price::text').get().replace('£',''),
                    'link':  product.css('a.product-item-link').attrib['href'] ,
                }
            except:
                yield{
                    'name': product.css('a.product-item-link::text').get(),
                    'price': 'Not Available',
                    'link':  product.css('a.product-item-link').attrib['href'] ,
                }

### To Crawl the Spider
Basically it means to run the project 

In [None]:
scrapy crawl whisky                                                          #scrapy crawl spider_name  

In [None]:
#to convert output into .json format
scrapy crawl whisky -O whisky.json

libraries to bypass
#scrapy-user-agents
#scrapy-proxy-pool

In [None]:
#callback is not working
"""response.css('a.action.next')
response.css('a.action.next').attrib['href'] 


next_pg=response.css('a.action.next').attrib['href']        
    if next_pg is not None:                                        #if next page is there
        yield response.follow(next_pg, callback=self.parse)        #follow will take back to next page and callback used to call parse method for next_page also"""