# 静态页面爬取

以豆瓣电影网站为例，使用request,pyquery,re进行爬取解析页面信息，将电影的基本信息存储到mongodb中。

## 引入需要的包 

In [1]:
import requests
import logging
import re
import pymongo
from pyquery import PyQuery as pq

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')
url = 'https://movie.douban.com'
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.115'}

## 定义获取html文件的基本函数

豆瓣网站首页信息的爬取，如果不适用头文件的话，会被拦截，无法获得信息。

In [2]:
def get_html(url, headers):
    
    logging.info('scraping %s...', url)
    try:
        response = requests.get(url, headers = headers)
        if response.status_code == 200:
            return response.text
        logging.error('get invalid status code %s while scraping %s', response.status_code, url)
    except requests.RequestException:
        logging.error('error occurred while scraping %s', url, exc_info=True)

## 定义获取所有详情页的url函数

In [9]:
def get_all_urls():
    doc = pq(url,headers=headers)
    detail = doc('.ui-slide-item')
    # print(detail)
    details = re.findall('href="(.*?)"',str(detail), re.S)
    detail = []
    for d in details:
        if 'gallery' not in d and 'from=showing' in d and d not in detail:
            detail.append(d)
    return detail

## 定义解析详细信息的函数

In [3]:
def parse_info(html):
    doc = pq(html)
    cover = doc('#mainpic').children('a').children().attr('src')
    title = doc('h1')
    name = re.search('<span property="v:itemreviewed">(.*?)</span>', str(title), re.S)
    name = re.sub('<span property="v:itemreviewed">|</span>','',str(name.group()))
    links = doc('#info')
    leader = re.search('rel="v:directedBy">(.*?)</a>',str(links),re.S)
    leader = re.sub('rel="v:directedBy">|</a>','',str(leader.group()))
    time = re.search('<span property="v:runtime" content="(.*?)">',str(links),re.S)
    time = re.sub('<span property="v:runtime" content=\"|\">', '',str(time.group()))
    publish = re.search('上映日期:</span> <span property="v:initialReleaseDate" content="(.*?)">',str(links),re.S)
    publish = re.sub('上映日期:</span> <span property="v:initialReleaseDate" content=\"|\">','',str(publish.group()))
    categories = re.findall('<span property="v:genre">(.*?)</span>',str(links),re.S)
    score = pq(doc.find('#interest_sectl'))
    score = re.findall(' <strong class="ll rating_num" property="v:average">(.*?)<', str(score), re.S)
    return {
        'cover': cover,
        'name': name,
        'categories': categories,
        'published_at': publish,
        'director': leader,
        'score': score
    }

## 将函数链接起来

In [10]:
def main():
    detail = get_all_urls()
    for d in detail:
        html =get_html(d, headers)
        data = parse_info(html)
        logging.info('get detail data %s', data)
if __name__ == '__main__':
    main()

2020-09-06 15:55:10,769 - INFO: scraping https://movie.douban.com/subject/30444960/?from=showing...
2020-09-06 15:55:11,688 - INFO: get detail data {'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2618403186.jpg', 'name': '信条 Tenet', 'categories': ['剧情', '动作', '科幻'], 'published_at': '2020-09-04(中国大陆)', 'director': '克里斯托弗·诺兰', 'score': ['7.9']}
2020-09-06 15:55:11,689 - INFO: scraping https://movie.douban.com/subject/26754233/?from=showing...
2020-09-06 15:55:12,410 - INFO: get detail data {'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2615992304.jpg', 'name': '八佰', 'categories': ['剧情', '历史', '战争'], 'published_at': '2020-08-21(中国大陆)', 'director': '管虎', 'score': ['7.7']}
2020-09-06 15:55:12,411 - INFO: scraping https://movie.douban.com/subject/27126336/?from=showing...
2020-09-06 15:55:13,159 - INFO: get detail data {'cover': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2616924633.jpg', 'name': '假面饭店 マスカレード・ホテル', 'categories':

## 存储到mongodb中 

### 链接mongodb的基本参数

In [11]:
MONGO_CONNECTION_STRING = 'mongodb://localhost:27017'
MONGO_DB_NAME = 'movies'
MONGO_COLLECTION_NAME = 'movies'

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['movies']
collection = db['movies']

在这里我们声明了几个变量，介绍如下：

- MONGO_CONNECTION_STRING：MongoDB 的连接字符串，里面定义了 MongoDB 的基本连接信息，如 host、port，还可以定义用户名密码等内容。
- MONGO_DB_NAME：MongoDB 数据库的名称。
- MONGO_COLLECTION_NAME：MongoDB 的集合名称。

这里我们用 MongoClient 声明了一个连接对象，然后依次声明了存储的数据库和集合。

接下来，我们再实现一个将数据保存到 MongoDB 的方法，实现如下：

### 定义存储函数

In [12]:
def save_data(data):
    collection.update_one({
        'name': data.get('name')
    }, {
        '$set': data
    }, upsert=True)


### 重新编写主函数

In [14]:
def main():
    detail = get_all_urls()
    for d in detail:
        html =get_html(d, headers)
        data = parse_info(html)
        logging.info('get detail data %s', data)
        logging.info('saving data to mongodb')
        save_data(data)
        logging.info('data saved successfully')
if __name__ == '__main__':
    main()

2020-09-06 15:59:06,818 - INFO: scraping https://movie.douban.com/subject/30444960/?from=showing...
2020-09-06 15:59:07,820 - INFO: get detail data {'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2618403186.jpg', 'name': '信条 Tenet', 'categories': ['剧情', '动作', '科幻'], 'published_at': '2020-09-04(中国大陆)', 'director': '克里斯托弗·诺兰', 'score': ['7.9']}
2020-09-06 15:59:07,821 - INFO: saving data to mongodb
2020-09-06 15:59:07,822 - INFO: data saved successfully
2020-09-06 15:59:07,822 - INFO: scraping https://movie.douban.com/subject/26754233/?from=showing...
2020-09-06 15:59:08,894 - INFO: get detail data {'cover': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2615992304.jpg', 'name': '八佰', 'categories': ['剧情', '历史', '战争'], 'published_at': '2020-08-21(中国大陆)', 'director': '管虎', 'score': ['7.7']}
2020-09-06 15:59:08,894 - INFO: saving data to mongodb
2020-09-06 15:59:08,896 - INFO: data saved successfully
2020-09-06 15:59:08,896 - INFO: scraping https://movie