**任务描述**：近期，电视剧《司藤》热播，阿普闪购决定策划一场围绕国产口碑电视剧的周边特卖活动。为了最大化提升活动的成功率，需要对目前已经有的电视剧名称、演员和评分进行分析，以预判一个电视剧的评分走向。在一切预测与分析之前，首先就需要收集目前国产电视剧的相关数据，或者换句话说，需要构建一个国产电视剧和评分的数据集。作为数据分析部门的实习生，暂时还承担不了具体的分析和建模的工作，所以数据集构建这个任务就自然落到了你的肩上。

**任务说明**：收集国产电视剧的数据，越全越好，至少收集评分、电视剧名称、主演信息三个信息。之后将数据存储在一个 csv 表中，表头如下：

- title，代表电视剧的名称；
- rating，代表电视剧的评分；
- stars，代表电视剧的主演。

csv 的名字为，`tv_rating.csv`。


**任务分析**：筛选出了两个候选

- 豆瓣电视剧主页：<https://movie.douban.com/tv/#!type=tv&tag=%E5%9B%BD%E4%BA%A7%E5%89%A7&sort=recommend&page_limit=20&page_start=0>
- 看剧网：<https://www.kanjugo.com/list/?2-1.html>

上述两个网站都有国产电视剧专区，但是豆瓣电视剧主页无法通过 URL 来下载第二页和之后的内容，因为加载更多内容是在第一页中动态生成的，而这样动态生成的内容很难通过 Python 拿到。虽然这两个问题通过一些技巧也能解决，但是平添了很多工作量，所以我们暂时不考虑抓取豆瓣。而看剧网恰好可以满足，于是选择该网站来作为数据来源。

下面是数据提取代码：

In [None]:
import urllib3
import random

# 有些网站屏蔽了爬虫，参考：https://stackoverflow.com/questions/38785877/spoofing-ip-address-when-web-scraping-python/56654164#56654164
user_agent_list = (
    #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
)


def download_content(url):
    http = urllib3.PoolManager()
    user_agent = random.choice(user_agent_list)
    headers = {'User-Agent': user_agent, "Accept-Language": "en-US, en;q=0.5"}
    response = http.request("GET", url, headers=headers)
    response_data = response.data
    html_content = response_data.decode()
    return html_content


def save_to_file(filename, content):
    fo = open(filename, "w", encoding="utf-8")
    fo.write(content)
    fo.close()


# 下载第一页
kj1_content = download_content("https://www.kanjugo.com/list/?13.html")
save_to_file("kanju-1.html", kj1_content)

由于第一页与之后的页面规则不一致，所以先下载第一页，其余的循环下载：

In [10]:
import time

for i in range(2, 101):
    url = "https://www.kanjugo.com/list/?13-" + str(i) + ".html"
    print("begin downloading:", url)
    content = download_content(url)
    filename = "kanju-" + str(i) + ".html"
    save_to_file(filename, content)
    # 避免短时间内对网站发起大量的下载请求，浪费网站的带宽资源。
    print("downloaded:", filename)
    time.sleep(1)

begin downloading: https://www.kanjugo.com/list/?13-2.html
downloaded: kanju-2.html
begin downloading: https://www.kanjugo.com/list/?13-3.html
downloaded: kanju-3.html
begin downloading: https://www.kanjugo.com/list/?13-4.html
downloaded: kanju-4.html
begin downloading: https://www.kanjugo.com/list/?13-5.html
downloaded: kanju-5.html
begin downloading: https://www.kanjugo.com/list/?13-6.html
downloaded: kanju-6.html
begin downloading: https://www.kanjugo.com/list/?13-7.html
downloaded: kanju-7.html
begin downloading: https://www.kanjugo.com/list/?13-8.html
downloaded: kanju-8.html
begin downloading: https://www.kanjugo.com/list/?13-9.html
downloaded: kanju-9.html
begin downloading: https://www.kanjugo.com/list/?13-10.html
downloaded: kanju-10.html
begin downloading: https://www.kanjugo.com/list/?13-11.html
downloaded: kanju-11.html
begin downloading: https://www.kanjugo.com/list/?13-12.html
downloaded: kanju-12.html
begin downloading: https://www.kanjugo.com/list/?13-13.html
downloaded

下面将从下载的网页中提取出数据，然后保存为 csv 文件。

In [16]:
from bs4 import BeautifulSoup
import csv


def create_doc(doc_name):
    fo = open(doc_name, "r", encoding="utf-8")
    doc_content = fo.read()
    fo.close()
    return BeautifulSoup(doc_content)


def extract_content(doc, content_list):
    all_div = doc.find_all("div", class_="myui-vodlist__box")
    for div in all_div:
        title = div.find_all("a", class_="myui-vodlist__thumb")[0]["title"].strip()
        rating = div.find_all("span", class_="pic-tag pic-tag-top")[0].get_text().strip()
        rating = rating.replace("豆瓣", "")
        rating = rating.replace("站内", "")
        stars = div.find_all("p", class_="text text-overflow text-muted hidden-xs")[0].get_text().strip()
        stars = stars.replace("主演：", "")
        content_list.append({"title": title, "rating": rating, "stars": stars})
    return 0


kj_header = ["title", "rating", "stars"]
kj_csv_file = open("kj.csv", "w", encoding="utf-8", newline='')
kj_csv_doc = csv.DictWriter(kj_csv_file, kj_header)
kj_csv_doc.writeheader()

for i in range(1, 101):
    kj_list = []
    extract_content(create_doc("kanju-" + str(i) + ".html"), kj_list)
    kj_csv_doc.writerows(kj_list)

kj_csv_file.close()