<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;数据科学引论 - Python之道 </h1>

<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;第5课 数据收集 - Python网络爬虫实践 I </h1>

# 爬虫概述
在阅读这个样例之前，建议先了解爬虫是什么，简单理解url、爬虫技术、网页html等基本概念。

本笔记本所依赖Python爬虫Beautiful Soup，大家可以通过命令```pip3 install beautifulsoup4``` 或 ```pip3 install beautifulsoup4``` 安装所需依赖包。

# 定义爬虫的任务

## 涉及的语法
语法涉及类（面向对象）、列表list、字典dict、循环、函数、字符串操作、文件读写

## 概述
这个爬虫的任务是爬取http://quotes.toscrape.com/page/1/ 的前两页，提取每条名言的文字内容，作者和标签，最后以JSON格式保存到文件中


## 如何修改

在自己做定制时，只需要修改`__init__`和`parse`两个方法，通俗讲__init__方法决定了爬取哪些网站，parse则指明了在每一个网页上爬取哪些内容
- init_urls: 设置待爬取网站的列表和保存文件路径，其中变量self.urls是待爬取网站的列表，self.file是一个文件对象
- parse：方法内是针对每个url成功访问之后进行的页面解析
   关于如何解析具体网页，可根据实际需要查看Beautiful Soup的官方文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

In [4]:
from bs4 import BeautifulSoup
import urllib
import time
import json
import os

class MySpider():
    
    name = "spider"
    
    
    
    def __init__(self):
        
        self.file = open('demo1_quotes_bs.json', 'w');
        
        #设置待爬取网站列表
        self.urls = []
        for i in range(1,3):
            self.urls.append('http://quotes.toscrape.com/page/' + str(i) + '/' )
            
#       初始化效果 效果等同
#         self.urls = [
#             'http://quotes.toscrape.com/page/1/',
#             'http://quotes.toscrape.com/page/2/',
#         ]

        # 设置header
        self.headers = {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}

        
        print(self.urls)
        
    
    # 利用urllib获取网站的html，并导入BeautifulSoup
    def bs_request(self, url):
        request = urllib.request.Request(url, headers=self.headers)
        html = urllib.request.urlopen(request).read()
        response = BeautifulSoup(html,'html.parser')
        self.parse(url,response)
        
    # start函数，调用此函数可开始爬虫   
    def start(self):
        for url in self.urls:
            self.bs_request(url)
    

    # parse方法用于解析html文件
    def parse(self, url, response):
        
        #提取名言列表
        quotes = response.find_all("div", class_="quote")
        for quote in quotes:
            #提取每条名言中的作者名
            author = quote.find("small", class_="author").get_text()
            #提取名言的文字内容
            text = quote.find(class_="text").get_text()
            #提取名言标签
            tags = [t.get_text() for t in quote.select(".tags .tag")]
            #构建字典对象
            item = {"author":author, "text": text, "tags":tags }
            #将字典转换成json字符串
            line = json.dumps(dict(item))
            #将每个条目写入文件
            self.file.write(line + "\n")
        #及时将内容写入文件，否则可能会出现少许延迟
        self.file.flush()
        os.fsync(self.file)
        #输出当前解析完成的网页网址，可以当做爬取进度来看待,与程序逻辑无关
        print("over: " + url)

In [5]:
# 新建爬虫对象
spider = MySpider()

# 开始爬虫
spider.start()

['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
over: http://quotes.toscrape.com/page/1/
over: http://quotes.toscrape.com/page/2/


In [6]:
print(spider)

<__main__.MySpider object at 0x103cf2710>


In [12]:
data_list = []  # 存储解析后的JSON对象的列表

with open("demo1_quotes_bs.json", "r") as json_file:
    for line in json_file:
        try:
            json_data = json.loads(line)
            data_list.append(json_data)
        except json.JSONDecodeError as e:
            print(f"JSON解析错误：{e}")

# 现在data_list中包含了所有JSON对象的列表
for item in data_list:
    print(item)


{'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'tags': ['abilities', 'choices']}
{'author': 'Albert Einstein', 'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
{'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'author': 'Marilyn Monroe', 'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'tags': ['be-yourself', 'inspirational']}
