
# Scrapy Code Examples — From Basics to Useful Patterns



## How to Use
- `%%bash` cells are terminal commands (kept here for convenience).
- Python cells show minimal patterns, swap selectors/URLs for your site.


## 0) Install / Environment

In [None]:

%%bash
# Install Scrapy (uncomment to use). Prefer virtualenv or conda.
# python -m pip install --upgrade pip
# pip install scrapy


## 1) Start a Project (once)

In [None]:

%%bash
# Create a standard Scrapy project.
# scrapy startproject myproject



### Project Layout (reference)
```
myproject/
├─ myproject/
│  ├─ spiders/
│  ├─ items.py
│  ├─ pipelines.py
│  ├─ settings.py
│  └─ __init__.py
└─ scrapy.cfg
```


## 2) Minimal Spider (single page)

In [None]:

%%bash
mkdir -p myproject/myproject/spiders
cat > myproject/myproject/spiders/minimal_spider.py << 'PY'
import scrapy

class MinimalSpider(scrapy.Spider):
    name = "minimal_example"
    allowed_domains = ["example.com"]  # optional safety net
    start_urls = ["https://example.com"]

    def parse(self, response):
        for card in response.css("div.card"):
            yield {
                "title": card.css("h2::text").get(),
                "link": response.urljoin(card.css("a::attr(href)").get()),
            }
PY



**Run (terminal):**
```bash
cd myproject
scrapy crawl minimal_example -O minimal.csv
```


## 3) Selectors Cheat (copy & tweak)

In [None]:

# Inside parse():
# CSS:
# response.css("article.product_pod h3 a::attr(title)").get()
# response.css(".price_color::text").get()
# response.css("li.next a::attr(href)").get()
#
# XPath:
# response.xpath("//article[contains(@class,'product_pod')]//h3/a/@title").get()
# response.xpath("//p[@class='price_color']/text()").get()
# response.xpath("//li[@class='next']/a/@href").get()


## 4) Pagination Spider (follow 'Next')

In [None]:

%%bash
cat > myproject/myproject/spiders/pagination_spider.py << 'PY'
import scrapy
from urllib.parse import urljoin

class PaginationSpider(scrapy.Spider):
    name = "pagination_example"
    start_urls = ["https://books.toscrape.com/"]

    custom_settings = {
        "FEEDS": {"pagination_books.csv": {"format": "csv", "overwrite": True}},
        "ROBOTSTXT_OBEY": True,
        "DOWNLOAD_DELAY": 0.5,
        "AUTOTHROTTLE_ENABLED": True,
        "USER_AGENT": "Scrapy-Examples (+https://example.org/edu)",
    }

    def parse(self, response):
        for product in response.css("article.product_pod"):
            yield {
                "title": product.css("h3 a::attr(title)").get(),
                "price": product.css(".price_color::text").get(),
                "url": response.urljoin(product.css("h3 a::attr(href)").get()),
            }
        next_rel = response.css("li.next a::attr(href)").get()
        if next_rel:
            yield response.follow(urljoin(response.url, next_rel), callback=self.parse)
PY



**Run (terminal):**
```bash
cd myproject
scrapy crawl pagination_example
```


## 5) List → Detail Pattern

In [None]:

%%bash
cat > myproject/myproject/spiders/list_detail_spider.py << 'PY'
import scrapy

class ListDetailSpider(scrapy.Spider):
    name = "list_detail_example"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for product in response.css("article.product_pod"):
            detail_url = response.urljoin(product.css("h3 a::attr(href)").get())
            yield response.follow(detail_url, callback=self.parse_detail, meta={
                "title": product.css("h3 a::attr(title)").get(),
                "price": product.css(".price_color::text").get(),
            })

        next_rel = response.css("li.next a::attr(href)").get()
        if next_rel:
            yield response.follow(next_rel, callback=self.parse)

    def parse_detail(self, response):
        item = {
            "title": response.meta.get("title"),
            "price": response.meta.get("price"),
            "url": response.url,
        }
        desc = response.css("#product_description ~ p::text").get()
        item["description"] = desc.strip() if desc else None
        yield item
PY


## 6) Items & Pipelines

In [None]:

%%bash
cat > myproject/myproject/items.py << 'PY'
import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    rating_num = scrapy.Field()
    available = scrapy.Field()
PY

cat > myproject/myproject/pipelines.py << 'PY'
import re

class CleanPipeline:
    price_re = re.compile(r"[\d.]+")

    def process_item(self, item, spider):
        # price -> float
        m = self.price_re.search((item.get("price") or ""))
        item["price"] = float(m.group(0)) if m else None

        # rating words -> numbers
        rating_map = {"One":1,"Two":2,"Three":3,"Four":4,"Five":5}
        item["rating_num"] = rating_map.get(item.get("rating"))

        # available flag (example)
        avail = (item.get("available") or item.get("availability") or "").lower()
        item["available"] = "in stock" in avail
        return item
PY

python - << 'PY'
from pathlib import Path
p = Path("myproject/myproject/settings.py")
txt = p.read_text(encoding="utf-8")
if "ITEM_PIPELINES" not in txt:
    txt += "\n\nITEM_PIPELINES = {\n    'myproject.pipelines.CleanPipeline': 300,\n}\n"
p.write_text(txt, encoding="utf-8")
print("Enabled CleanPipeline in settings.py")
PY


## 7) Export Options

In [None]:

# CLI:
# scrapy crawl <name> -O out.csv     # overwrite
# scrapy crawl <name> -O out.jsonl   # JSON lines
#
# settings.py (global):
# FEEDS = {
#   'output/items.csv': {'format': 'csv', 'overwrite': True},
#   'output/items.jsonl': {'format': 'jsonlines'}
# }


## 8) Settings Essentials

In [None]:

%%bash
python - << 'PY'
from pathlib import Path
p = Path("myproject/myproject/settings.py")
txt = p.read_text(encoding="utf-8")

additions = [
    "ROBOTSTXT_OBEY = True",
    "DOWNLOAD_DELAY = 0.5",
    "AUTOTHROTTLE_ENABLED = True",
    "USER_AGENT = 'Scrapy-Examples (+https://example.org/edu)'",
    "RETRY_ENABLED = True",
    "RETRY_TIMES = 2",
    # "LOG_LEVEL = 'INFO'",
]
for line in additions:
    if line.split("=")[0].strip() not in txt:
        txt += "\n" + line
p.write_text(txt, encoding="utf-8")
print("Updated settings.py with essentials (idempotent).")
PY


## 9) Scrapy Shell (validate selectors fast)

In [None]:

%%bash
# Usually run in terminal:
# scrapy shell https://books.toscrape.com/
# response.css("article.product_pod h3 a::attr(title)").getall()
# response.css("li.next a::attr(href)").get()
# response.xpath("//p[@class='price_color']/text()").get()


## 10) Programmatic Run (template)

In [None]:

# Run Scrapy without CLI — template (copy to a .py script inside project):
# ---------------------------------------------------------
# from scrapy.crawler import CrawlerProcess
# from scrapy.utils.project import get_project_settings
# from myproject.spiders.pagination_spider import PaginationSpider
#
# process = CrawlerProcess(get_project_settings())
# process.crawl(PaginationSpider)
# process.start()
# ---------------------------------------------------------


## 11) Checklist to Adapt to Any Site


1. Identify target pages and **start URLs**.  
2. Use **Scrapy Shell** to perfect selectors.  
3. Build a **minimal spider** that yields a few fields.  
4. Add **pagination** (follow 'next').  
5. Add optional **detail page** parsing for extra fields.  
6. Enable **FEEDS** and verify output.  
7. Add **pipelines** for cleaning/validation.  
8. Configure **politeness** and **retries** in settings.
