* Review official documentation of Scrapy
* Setup Scrapy project using Scrapy CLI
* Review Scrapy Project Folder Structure
* Add Spider to the Scrapy Project
* Update Scrapy Settings to write to json file
* Run and Validate the Scrapy Project
* Exercise and Solution - Scrape Data to JSON Files

* Review official documentation of Scrapy

You can find the complete documentation of Scrapy under https://docs.scrapy.org/en/latest/index.html

1. Go through the details related to tutorial
2. Understand the code provided in the tutorial (Spider Class, `parse`, `start_urls`, and other important methods)

* Setting up Project using Scrapy CLI

1. Run `scrapy startproject quotes` to setup scrapy project.
2. Check whether a folder by name quotes is not created or not.

* Review Scrapy Project Folder Structure

1. Check configuration file
2. Review `spiders` folder
3. Review `settings.py`
4. Review `pipelines.py`

* Add Spider to the Scrapy Project

1. Recreate the project by name quotes_scraper.
2. Add a program file under `quotes_scraper/quotes_scraper/spiders` folder by name `quotes.py` with the logic to scrape the data. We can also run `scrapy genspider quotes https://www.goodreads.com/quotes` command to create the spider with boilerplate code. We can also update the created `quotes.py` with the logic to scrape the data.
3. Review and understand the code. The code have the ability to process the quotes in all the 100 pages that are available under base url - https://www.goodreads.com/quotes

```python
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'

    def generate_urls(base_url):
        urls = []
        for i in range(1, 101):
            urls.append(f'{base_url}?page={i}')
        return urls

    def start_requests(self):
        """Special method in place of start urls. This will be called automatically to get the list of urls"""
        urls = generate_urls('https://www.goodreads.com/quotes')
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get()
            }
            yield payload
```

You can also develop the logic to scrape all the pages by following using next_page.

```python
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes?page=90']

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
                yield response.follow(next_page, self.parse)
```

* Update Scrapy Settings to write to json file

Make sure to review the details related to the feeds in the official documentation - https://docs.scrapy.org/en/latest/index.html.

1. Go to `settings.py` in scrapy project folder.
2. Append below text to the file and save it. You can also specify the full path for the output file.

```python
FEEDS = {
    'quotes.json': {
        'format': 'json',
        'overwrite': True
    }
}
```

* Run and Validate the Scrapy Project

1. Run `scrapy crawl quotes` to run the spider in the scrapy project.
2. Review the data in the files.

Note: If you run the project multiple times, the file will be overwritten.

Here is the updated code for your reference.

```python
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        """Special method in place of start urls. This will be called automatically to get the list of urls"""

        def generate_urls(base_url):
            urls = []
            for i in range(1, 101):
                urls.append(f'{base_url}?page={i}')
            return urls
        
        urls = generate_urls('https://www.goodreads.com/quotes')
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get()
            }
            yield payload
```

* Exercise - Scrape Data to JSON Files

Scrape quote text, author or title, author or title url, author or title url text into JSON format.

1. Make sure to have a project by name quotes_scraper.
2. Define spider to scrape all the 100 pages.
3. Save quote text, author or title, author or title url, and author or title url text details to json file to a file by name `quotes.json`

* Solution - Scrape Data to JSON Files

Scrape quote text, author or title, author or title url, author or title url text into JSON format.

1. Make sure to have a project by name quotes.

```shell
scrapy startproject quotes_scraper
```

2. Define spider to scrape all the 100 pages.

Create a file by name quotes_spider.py in spiders folder.

3. Save quote text, author or title, author or title url, and author or title url text details to json file to a file by name `quotes.json`

Update `quotes_spider.py` with below code.

```python
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        """Special method in place of start urls. This will be called automatically to get the list of urls"""

        def generate_urls(base_url):
            urls = []
            for i in range(1, 101):
                urls.append(f'{base_url}?page={i}')
            return urls
        
        urls = generate_urls('https://www.goodreads.com/quotes')
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload
```

Make sure below setting is added to `settings.py`.

```python
FEEDS = {
    'quotes.json': {
        'format': 'json',
        'overwrite': True
    }
}
```

Run `scrapy crawl quotes` to add the data to `quotes.json`.