# Web Scraping using Scrapy

- An open source and collaborative framework for extracting the data you need from websites.

Installation: ```pip install scrapy```

Dependency - [Microsoft Visual C++ Build Tools](http://go.microsoft.com/fwlink/?LinkId=691126&fixForIE=.exe)
<br>
If there is an error due to rc.exe, use this link : [StackOverflow](https://stackoverflow.com/questions/43858836/python-installing-clarifai-vs14-0-link-exe-failed-with-exit-status-1158)

To set up a scrapy project, go to the desired folder and open terminal and run the following:

In [2]:
!scrapy startproject myfirstproject

New Scrapy project 'myfirstproject', using template directory 'c:\users\rusta\appdata\local\programs\python\python38\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\rusta\Documents\Python Scripts\Using Scrapy\myfirstproject

You can start your first spider with:
    cd myfirstproject
    scrapy genspider example example.com


## Writing a Spider

Spider is a class that allows to scrape information from a website. They have a base class scrapy.spider from which they inherit.

To write a spider class, go to the spiders folder in the directory and write the following :

In [8]:
import scrapy

class QuotesSpider(scrapy.Spider):
    # Name of the Spider should be in single quotes
    name = 'quotes'
    
    # To make GET request
    def start_request(self):
        
        #Write URL in single quotes as well
        urls = ['http://quotes.toscrape.com/page/1/']
        
        # Generator Function
        for url in urls:
            yield scrapy.Request(url=url, callback = self.parse)
    
    def parse(self, response):
        page_id = response.url.split("/")[-2]
        
        filename = "quotes-%s.html"%page_id
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log("Saved file %s"%filename)

To run this, go to directory, open Terminal and type the command: ```scrapy crawl quotes```. It will svae the page as HTML file.

### Using Shell to extract information

To scrape a page in shell : ```scrapy shell "url"```

To check response: ```response```

Get into data using CSS selectors:

```response.css('title')``` : Returns a list of all such elements, also contains metadata.
<br>
```response.css('title').getall()```: Returns the HTML part
<br>
```response.css('title::text').getall()``` : To get a list of the text
<br>
```response.css('title::text').get()``` : Returns first element of the list
<br>
```response.css("div.quote").getall()``` : div is the tag, quote is the class
<br>
```quote = response.css('div.quote')[0]``` <br>
```title0 = quote.css("span.text").get()``` <br>
```title0 = quote.css("span.text::text").get()``` <br>
```author = quote.css('small.author::text').get()``` <br>

Parsing as JSON

In [12]:
import scrapy

class QuotesSpider(scrapy.Spider):
    # Name of the Spider should be in single quotes
    name = 'quotes'
    
    # To make GET request
    def start_request(self):
        
        #Write URL in single quotes as well
        urls = ['http://quotes.toscrape.com/page/1/']
        
        # Generator Function
        for url in urls:
            yield scrapy.Request(url=url, callback = self.parse)
    
    def parse(self, response):
        
        for q in response.css("div.quote"):
            text = q.css("span.text::text").get()
            author = q.css("small.author::text").get()
            tags = q.css("a.tag::text").getall()
            
            yield {
                'text' : text,
                'author' : author,
                'tags' : tags
            }

To call above spider : ```scrapy crawl quotes -o quotes.json``` 
<br> ```-o``` is a command line command to write files

### Recursive Crawler

Check for next button: ```response.css('li.next a').get()```
<br>Extracting ```href```: ```response.css('li.next a::attr(href)').get()```
<br> Another way: ```response.css('li.next a:').attrib["href"]```

In [15]:
import scrapy

class QuotesSpider(scrapy.Spider):
    # Name of the Spider should be in single quotes
    name = 'quotes'
    
    # To make GET request
    def start_request(self):
        
        #Write URL in single quotes as well
        urls = ['http://quotes.toscrape.com/page/1/']
        
        # Generator Function
        for url in urls:
            yield scrapy.Request(url=url, callback = self.parse)
    
    def parse(self, response):
        
        for q in response.css("div.quote"):
            text = q.css("span.text::text").get()
            author = q.css("small.author::text").get()
            tags = q.css("a.tag::text").getall()
            
            yield {
                'text' : text,
                'author' : author,
                'tags' : tags
            }
        
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback = self.parse)