# Douban Top250 Books Crawler

## Code

In [1]:
import os
import re
import time
import random
from typing import List, Dict

import requests
from lxml import etree


BASE_URL = "https://book.douban.com/top250"
PAGE_SIZE = 25
TOTAL_PAGES = 10
IMG_DIR = "BookImages"
REQUEST_TIMEOUT = 10

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/143.0.0.0 Safari/537.36"
    )
}


def fetch_page_html(session: requests.Session, start: int) -> str:
    """
    Fetch one page of Douban Top250 HTML.

    :param session: shared requests.Session for connection reuse
    :param start: pagination offset (0, 25, 50, ...)
    :return: raw HTML text
    """
    url = f"{BASE_URL}?start={start}"
    response = session.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
    response.raise_for_status()
    return response.text


def download_image(session: requests.Session, img_url: str, filename: str) -> None:
    """
    Download and save a book cover image.

    :param session: requests.Session
    :param img_url: image URL
    :param filename: image filename without extension
    """
    save_path = os.path.join(IMG_DIR, f"{filename}.jpg")

    response = session.get(img_url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
    response.raise_for_status()

    with open(save_path, "wb") as f:
        f.write(response.content)

    print(f"[IMG] Saved: {save_path}")


def parse_books(html: str) -> List[Dict[str, str]]:
    """
    Parse book information from Douban Top250 HTML.

    :param html: raw HTML text
    :return: list of book info dictionaries
    """
    tree = etree.HTML(html)
    book_tables = tree.xpath("//div[@class='indent']/table")

    books = []

    for table in book_tables:
        info_td = table.xpath("./tr/td[2]")[0]

        # title
        title = info_td.xpath("./div[1]/a/text()")[0].strip()

        # quote (optional)
        quote_nodes = info_td.xpath("./p[@class='quote']/span/text()")
        quote = quote_nodes[0].strip() if quote_nodes else ""

        # publish info
        publish_info = info_td.xpath("./p[@class='pl']/text()")[0].strip()
        info_parts = [item.strip() for item in publish_info.split("/")]

        price = info_parts[-1]
        pub_date = info_parts[-2]
        publisher = info_parts[-3]
        authors = " | ".join(info_parts[:-3])

        # rating
        rating = info_td.xpath(
            ".//span[@class='rating_nums']/text()"
        )[0].strip()

        rating_text = info_td.xpath(
            ".//span[@class='pl']/text()"
        )[0]

        rating_count_match = re.search(r"\d+", rating_text)
        rating_count = rating_count_match.group() if rating_count_match else "0"

        # cover image
        img_url = table.xpath("./tr/td[1]/a/img/@src")[0]

        books.append({
            "title": title,
            "authors": authors,
            "publisher": publisher,
            "pub_date": pub_date,
            "price": price,
            "rating": rating,
            "rating_count": rating_count,
            "quote": quote,
            "img_url": img_url,
        })

    return books


def main() -> None:
    os.makedirs(IMG_DIR, exist_ok=True)

    session = requests.Session()
    book_index = 1

    for page in range(TOTAL_PAGES):
        start = page * PAGE_SIZE
        print(f"\n===== Page {page + 1} (start={start}) =====")

        html = fetch_page_html(session, start)
        books = parse_books(html)

        for book in books:
            print(
                f"\nNo.{book_index} 《{book['title']}》\n"
                f"Authors      : {book['authors']}\n"
                f"Publisher    : {book['publisher']}\n"
                f"Publish Date : {book['pub_date']}\n"
                f"Price        : {book['price']}\n"
                f"Rating       : {book['rating']}/10\n"
                f"Reviews      : {book['rating_count']}\n"
                f"Quote        : {book['quote']}"
            )

            safe_title = re.sub(r"[\\/:*?\"<>|]", "_", book["title"])
            image_name = f"{book_index}-{safe_title}"
            # download_image(session, book["img_url"], image_name)

            book_index += 1

        # polite crawling
        time.sleep(random.uniform(4, 8))


if __name__ == "__main__":
    main()


===== Page 1 (start=0) =====

No.1 《红楼梦》
Authors      : [清] 曹雪芹 著
Publisher    : 人民文学出版社
Publish Date : 1996-12
Price        : 59.70元
Rating       : 9.6/10
Reviews      : 455966
Quote        : 都云作者痴，谁解其中味？
[IMG] Saved: BookImages/1-红楼梦.jpg

No.2 《活着》
Authors      : 余华
Publisher    : 作家出版社
Publish Date : 2012-8
Price        : 28.00元
Rating       : 9.4/10
Reviews      : 901138
Quote        : 生的苦难与伟大
[IMG] Saved: BookImages/2-活着.jpg

No.3 《哈利·波特》
Authors      : J.K.罗琳 (J.K.Rowling) | 苏农
Publisher    : 人民文学出版社
Publish Date : 2008-12-1
Price        : 498.00元
Rating       : 9.7/10
Reviews      : 130239
Quote        : 从9¾站台开始的旅程
[IMG] Saved: BookImages/3-哈利·波特.jpg

No.4 《1984》
Authors      : [英] 乔治·奥威尔 | 刘绍铭
Publisher    : 北京十月文艺出版社
Publish Date : 2010-4-1
Price        : 28.00
Rating       : 9.4/10
Reviews      : 306821
Quote        : 栗树荫下，我出卖你，你出卖我
[IMG] Saved: BookImages/4-1984.jpg

No.5 《三体全集》
Authors      : 刘慈欣
Publisher    : 重庆出版社
Publish Date : 2012-1
Price        : 168.00元
Rating     

## Doc: A Beginner’s Tutorial: Crawling Douban Top 250 Books with Python

This program is a **web crawler (spider)** that:

1. Visits Douban’s *Top 250 Books* pages
2. Extracts book information (title, author, rating, etc.)
3. Downloads each book’s cover image
4. Prints the results neatly in the terminal

You’ll learn **real-world Python skills** such as:

* Using third-party libraries
* Working with functions
* Parsing HTML
* Handling files and folders
* Writing clean, reusable code

---

### 1. What You Need Before Running This Code

#### 1.1 Python version

Python **3.8+** is recommended.

#### 1.2 Required libraries

This code uses two non-built-in libraries:

```bash
pip install requests lxml
```

Everything else (`os`, `re`, `time`, `random`, `typing`) is built into Python.

---

### 2. Importing Modules (工具箱)

```python
import os
import re
import time
import random
from typing import List, Dict

import requests
from lxml import etree
```

#### Why we need them

| Module       | Purpose                                      |
| ------------ | -------------------------------------------- |
| `os`         | File and directory operations                |
| `re`         | Regular expressions (extract numbers safely) |
| `time`       | Delay execution                              |
| `random`     | Random sleep time (polite crawling)          |
| `typing`     | Type hints (for readability)                 |
| `requests`   | Send HTTP requests                           |
| `lxml.etree` | Parse HTML using XPath                       |

👉 **Tip for beginners**:
Think of `import` as *borrowing tools* before you start working.

---

### 3. Global Configuration (程序的“设置区”)

```python
BASE_URL = "https://book.douban.com/top250"
PAGE_SIZE = 25
TOTAL_PAGES = 10
IMG_DIR = "BookImages"
REQUEST_TIMEOUT = 10
```

These are **constants** that control how the crawler behaves.

* Each page has **25 books**
* 10 pages × 25 = **250 books**
* Images will be saved into `BookImages/`

#### Why define constants at the top?

* Easy to modify later
* Avoid magic numbers scattered everywhere
* More readable and professional

---

### 4. HTTP Headers: Pretending to Be a Browser

```python
HEADERS = {
    "User-Agent": "Mozilla/5.0 ..."
}
```

Websites may block requests that don’t look like a real browser.

This `User-Agent` tells Douban:

> “I am a normal Chrome browser, not a bot.”

This is **very important in web crawling**.

---

### 5. Function 1: Fetching HTML Pages

```python
def fetch_page_html(session: requests.Session, start: int) -> str:
```

#### What this function does

* Builds a page URL
* Sends a GET request
* Returns the HTML source code as a string

```python
url = f"{BASE_URL}?start={start}"
response = session.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
return response.text
```

#### Key concepts for beginners

* **Function**: reusable block of code
* **f-string**: `f"{BASE_URL}?start={start}"`
* **Session**: reuses TCP connections → faster & cleaner
* **raise_for_status()**: throws an error if request fails

---

### 6. Function 2: Downloading Images

```python
def download_image(session, img_url, filename):
```

#### What it does

1. Sends a request to the image URL
2. Saves binary data (`.jpg`) to disk

```python
with open(save_path, "wb") as f:
    f.write(response.content)
```

#### Important beginner knowledge

* `"wb"` = **write binary**
* Images are not text → must use binary mode
* `with open(...)` automatically closes the file

---

### 7. Function 3: Parsing HTML (核心难点)

```python
def parse_books(html: str) -> List[Dict[str, str]]:
```

This is the **heart of the crawler**.

#### Step 1: Convert HTML into a tree

```python
tree = etree.HTML(html)
```

Think of HTML as a **tree structure**:

* nodes
* children
* attributes

---

#### Step 2: Find all book blocks

```python
book_tables = tree.xpath("//div[@class='indent']/table")
```

This XPath means:

> “Find all `<table>` elements inside `<div class="indent">`”

XPath is like **CSS selectors for Python**.

---

#### Step 3: Extract data from each book

Example: extracting title

```python
title = info_td.xpath("./div[1]/a/text()")[0].strip()
```

* `xpath()` always returns a **list**
* `[0]` gets the first element
* `.strip()` removes spaces/newlines

---

#### Step 4: Handling optional data safely

```python
quote_nodes = info_td.xpath("./p[@class='quote']/span/text()")
quote = quote_nodes[0] if quote_nodes else ""
```

Some books **do not have quotes**.

This avoids:

```text
IndexError: list index out of range
```

👉 This is **defensive programming**, very important.

---

#### Step 5: Splitting publish information

```python
info_parts = publish_info.split("/")
```

Then we assign:

```python
price = info_parts[-1]
pub_date = info_parts[-2]
publisher = info_parts[-3]
authors = " | ".join(info_parts[:-3])
```

This shows:

* List slicing
* Negative indexing
* Joining strings

---

#### Step 6: Extract numbers with regex

```python
rating_count_match = re.search(r"\d+", rating_text)
```

* `\d+` means “one or more digits”
* Regular expressions are perfect for messy text

---

#### Step 7: Store data as dictionaries

```python
books.append({
    "title": title,
    "authors": authors,
    ...
})
```

Each book = **dictionary**
All books = **list of dictionaries**

This structure is:

* Easy to print
* Easy to save to JSON / database later

---

### 8. The `main()` Function: Program Entry

```python
def main():
```

#### Step 1: Create image folder

```python
os.makedirs(IMG_DIR, exist_ok=True)
```

* Creates folder if it doesn’t exist
* Safe to run multiple times

---

#### Step 2: Loop through pages

```python
for page in range(TOTAL_PAGES):
    start = page * PAGE_SIZE
```

This generates:

* 0
* 25
* 50
* …
* 225

---

#### Step 3: Print book info

Formatted output improves readability:

```python
print(f"Rating : {book['rating']}/10")
```

This uses **f-strings**, recommended for beginners.

---

#### Step 4: Safe filenames

```python
safe_title = re.sub(r"[\\/:*?\"<>|]", "_", book["title"])
```

Some characters are **illegal in file names**.
We replace them with `_`.

---

#### Step 5: Polite crawling

```python
time.sleep(random.uniform(4, 8))
```

This:

* Avoids being blocked
* Mimics human behavior
* Is considered good crawler etiquette

---

### 9. Why `if __name__ == "__main__":`

```python
if __name__ == "__main__":
    main()
```

This means:

* The code runs **only when executed directly**
* Not when imported as a module

This is **standard Python practice**.

---

### 10. What You Should Learn From This Project

After understanding this code, you should be able to:

✅ Read and understand real Python projects
✅ Write clean functions
✅ Use requests + XPath
✅ Parse real websites
✅ Handle files safely
✅ Write polite, professional crawlers

---

### 11. Suggested Exercises for Beginners

1. Save book data to a **JSON file**
2. Save data to **CSV**
3. Add **retry mechanism**
4. Add **logging instead of print**
5. Crawl **another Douban category**