* Validate Mongodb Database Setup
* Overview of Scrapy Pipelines
* Overview of using hash code for quote text
* Update Spider Logic to include hash code
* Develop Pipeline Logic to write to Mongodb
* Run the Pipeline to write to Mongodb
* Validate Data in Mongodb Collection
* Exercise and Solution

* Validate Mongodb Database Setup

1. Make sure Mongodb is running (Use telnet to validate - `telnet localhost 27017`)
2. Launch Mongo shell using `mongosh`.
3. We can also use `pymongo` to connect to Mongodb Database using Python.

```python
import pymongo
client = pymongo.MongoClient('localhost', 27017)

for db in client.list_databases():
    print(db['name'])

# We can create new database and then use relevant APIs to deal with collections and documents
db = client['quotes_db']

# If the database is empty, you will not see any collections
for collection in db.list_collections():
    print(collection)
```

* Overview of Scrapy Pipelines

Here are the details about Scrapy Pipelines.
1. We can define pipelines in `pipelines.py`.
2. The pipeline class will have the logic to write the data to specified target.
3. The logic to process HTML content and write to the target such as database are clearly separated.

We will understand how to write the extracted data into Mongo DB database using Scrapy pipelines.

* Overview of using hash code for quote text

```python
import hashlib

s = 'Hello World'
hashlib.md5(s.encode()).hexdigest()
```

* Update Spider Logic to include hash code

```python
import hashlib
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        """Special method in place of start urls. This will be called automatically to get the list of urls"""

        def generate_urls(base_url):
            urls = []
            for i in range(1, 4): # considers 3 pages
                urls.append(f'{base_url}?page={i}')
            return urls
        
        urls = generate_urls('https://www.goodreads.com/quotes')
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            quote_text = quoteDetails.css('.quoteText::text').get()
            payload = {
                'quoteTextHash': hashlib.md5(quote_text.encode()).hexdigest(),
                'quoteText': quote_text,
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleUrlText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload

```

* Develop Pipeline Logic to write to Mongodb

1. Connect to MongoDB Database
2. Process the data and store into MongoDB Database
3. Close the connection to MongoDB Database

Update `pipelines.py`

```python
from pymongo import MongoClient


class QuotesScraperPipeline:
    def __init__(self, mongo_uri, mongo_db, collection_name):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.collection_name = collection_name

    @classmethod
    def from_crawler(cls, crawler):
        mongo_uri = crawler.settings.get('MONGO_URI')
        mongo_db = crawler.settings.get('MONGO_DATABASE')
        collection_name = crawler.settings.get('MONGO_COLLECTION')
        return cls(mongo_uri, mongo_db, collection_name)

    def open_spider(self, spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.collection = self.db[self.collection_name]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.collection.insert_one(dict(item))
        return item
```

Update `settings.py` with Mongo DB connectivity information and also pipeline details. Make sure to comment out or delete the code related to `FEEDS` which will add data to file.

```python
ITEM_PIPELINES = {
    'quotes_scraper.pipelines.QuotesScraperPipeline': 300
}

MONGO_URI = 'mongodb://localhost:27017/'
MONGO_DATABASE = 'quotes_db'
MONGO_COLLECTION = 'quotes'
```

* Run the Pipeline to write to Mongodb

Run the pipeline using `scrapy crawl quotes`. It will process the data from the specified urls and load the data into Mongo DB collection.

* Validate Data in Mongodb Collection

1. Launch Mongo Shell
2. Switch to quotes_db using `use quotes_db`.
3. Check the count in the collection using `db.quotes.countDocuments({})`
4. Get first few records using pretty `db.quotes.find({}).pretty()`
5. Delete data from Mongo Collection using `db.quotes.deleteMany({})`


Here are the Python code snippets to validate and delete the data from the collection

```python
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')

for item in client.list_databases():
    print(item['name'])

db = client['quotes_db']

for coll in db.list_collections():
    print(coll['name'])

quotes_coll = db['quotes']

quotes_coll.count_documents({})

quotes_coll.find({})

for coll_item in quotes_coll.find({}):
    print(coll_item)

quotes_coll.delete_many({})
```

* Exercise - Include page urls while writing to Mongodb

1. Ensure you add the logic related to adding page urls to the `parse` function. The attribute name should be `parseUrl`. It can be populated using `response.url`. Make sure to have it after `quoteTextHash`.
2. Make sure data is upserted or merged. If there is no record in mongodb with given quoteTextHash, then the document should be inserted otherwise document should be updated.
3. Validate by reviewing the data in the Mongodb collection.

Here is the sample code to upsert data into Mongo collection;

```python
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017')

db = client['demo']

coll = db['users']

coll.insert_one({
    "user_id": 1, 
    "first_name": "Scott", 
    "last_name": "Tiger", 
    "username": "stiger", 
    "email": None
})

coll.insert_one({
    "user_id": 2, 
    "first_name": "Donald", 
    "last_name": "Duck", 
    "username": "dduck", 
    "email": None
})

from pprint import pprint

for item in coll.find({}):
    pprint(item)

coll.update_one(
    {"user_id": 1},
    {"$set": {
        "email": "stiger@email.com"
    }},
    upsert=True
)

for item in coll.find({}):
    pprint(item)

query = {'user_id': 3}
update = {'$set': {
    'first_name': 'Mickey',
    'last_name': 'Mouse',
    'username': 'mmouse',
    'email': 'mmouse@email.com'
}}
coll.update_one(filter=query, update=update, upsert=True)

for item in coll.find({}):
    pprint(item)

coll.drop()
```

Here are the equivalent mongo shell commands:

```js
use demo

db.users.insertOne({
    "user_id": 1, 
    "first_name": "Scott", 
    "last_name": "Tiger", 
    "username": "stiger", 
    "email": null
})

db.users.insertOne({
    "user_id": 2, 
    "first_name": "Donald", 
    "last_name": "Duck", 
    "username": "dduck", 
    "email": null
})

db.users.updateOne(
    {"user_id": 1}, # query
    {"$set": {
        "email": "stiger@email.com"
    }}, # update
    {"upsert": true} # update or insert
)

db.users.updateOne(
    {"user_id": 3},
    {"$set": {
        "first_name": "Mickey",
        "last_name": "Mouse",
        "username": "mmouse",
        "email": "mmouse@email.com"
    }},
    {"upsert": true}
)

db.users.find().pretty()

db.users.drop()
```

* Solution - Include page urls while writing to Mongodb

1. Update `parse` function in `quotes_spider.py`

```python
import hashlib
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes?page=90']

    def parse(self, response):
        sha = hashlib.sha256()
        for quoteDetails in response.css('.quoteDetails'):
            quote_text = quoteDetails.css('.quoteText::text').get()
            sha.update(quote_text.encode())
            payload = {
                'quoteTextHash': sha.hexdigest(),
                'pageUrl': response.url,
                'quoteText': quote_text,
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
                yield response.follow(next_page, self.parse)
```

2. Update `pipelines.py` with required changes to upsert into Mongodb collection

```python
from pymongo import MongoClient


class QuotesPipeline:
    def __init__(self, mongo_uri, mongo_db, collection_name):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.collection_name = collection_name

    @classmethod
    def from_crawler(cls, crawler):
        mongo_uri = crawler.settings.get('MONGO_URI')
        mongo_db = crawler.settings.get('MONGO_DATABASE')
        collection_name = crawler.settings.get('MONGO_COLLECTION')
        return cls(mongo_uri, mongo_db, collection_name)

    def open_spider(self, spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.collection = self.db[self.collection_name]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # self.collection.insert_one(dict(item))
        query = {'quoteTextHash': dict(item)['quoteTextHash']}
        update = {'$set': dict(item)}
        self.collection.update_one(query, update, upsert=True)
        return item
```

3. Run `scrapy crawl quotes` to crawl the data and populate into Mongo collection.
4. Run below mongo commands to validate. You can also `pymongo` based approach.

```js
use quotes_db
db.quotes.countDocuments({})
db.quotes.find({}).pretty()
db.quotes.countDocuments({"pageUrl": "https://www.goodreads.com/quotes?page=90"})
db.quotes.find({"pageUrl": "https://www.goodreads.com/quotes?page=90"}).pretty()
```