## What is Scrapy?

**Scrapy** is a popular Python framework for web scraping and crawling.
It lets you write small programs (called **spiders**) that automatically browse websites and collect data from many pages very quickly.

**Why use Scrapy?**

* It’s **fast** and **efficient** for big projects.
* Handles things like requests, data extraction, and exporting data out of the box.
* Easy to add advanced features (like pipelines and middlewares) as your project grows.

**Real-life Example:**
Imagine you want to collect the names and prices of all books from an online bookstore.
With Scrapy, you can:

* Tell it which website to visit.
* Show it where to find the name and price on each page.
* Scrapy will visit every book page, collect the data, and save it to a file for you—**automatically**.

---

## **Full Flow with Scrapy**

Here’s how it all connects:

1. **Scrapy Spider**: You write a spider that tells Scrapy what website to visit and what data to collect.
2. **Pipelines**: Scrapy passes the collected data through pipelines to clean or save it.
3. **Middlewares**: Scrapy uses middlewares to add extra rules (like delays, proxies, user-agent changes) while browsing.
4. **Data Export**: Scrapy saves all your final, clean data into files (CSV, JSON, etc.), ready for you to use.

---

## Visual Diagram (Text Form)

```
[Website]
    ↓
[Scrapy Spider] → [Middlewares] → [Pipelines] → [Export File (CSV/JSON)]
```

---

**In Short:**

* **Scrapy** is the engine.
* **Spider** does the crawling.
* **Middlewares** are helpers on the way.
* **Pipelines** clean and save the results.
* **Data export** gives you the final output.


## Goal:
I want to collect the title and price of all books from an online bookstore, and save them to a CSV file.


In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd

In [None]:
# define teh spider
class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ['website']

    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('p.price_color::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

In [4]:
# pipelines to clean the data
results = []

class StoreResultsPipeline:
    def process_item(self, item, spider):
        results.append(item)
        return item


In [5]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
process = CrawlerProcess({
    'ITEM_PIPELINES': {'__main__.StoreResultsPipeline': 1}, # Use our custom pipeline
    'LOG_ENABLED': False,  # Less output
})
process.crawl(BooksSpider)
process.start()

In [8]:
df = pd.DataFrame(results)
df['price'] = df['price'].str.replace('£', '')  

In [9]:
df.head()

Unnamed: 0,title,price
0,A Light in the Attic,51.77
1,Tipping the Velvet,53.74
2,Soumission,50.1
3,Sharp Objects,47.82
4,Sapiens: A Brief History of Humankind,54.23


In [10]:
df.to_csv('books.csv', index=False)
print("Saved to books.csv!")

Saved to books.csv!
