# 基本信息


| 班级   | 姓名 | 学号 |
| :------- | ------ | ------ |
| SSSSSS | AA   | 2222 |
|        | BB   | 333  |

# 任务： 网络中心性的度量与解释（电视剧主演）

1. 构建“小组偏爱的演员网络”。根据组员的偏好，爬取豆瓣上不少于200部评分不低于7分的中国（含香港、澳门、台湾）电视剧的“主演”信息。凡出演同一部电视剧的演员之间，建立一条连边。连边上的权重代表演员之间的合作（即：共同出演同一部剧的）次数。

2. 构建“官方偏爱的演员网络”，根据中国电视剧三大奖项：飞天奖、金鹰奖、白玉兰奖的获奖名单，从豆瓣爬取名单上不少于200部电视剧的“主演”信息。凡主演。。。同上。

3. 用python绘制出上述两个演员合作网（有权图）
4. 计算两个网络中，各节点的四种中心性（点度中心性，中介中心性，接近中心性，特征向量中心性），并排序、输出各结果中排在前30的演员名单极其排序。同时，解读每种中心性结果的现实意义，并比较两个网络中共同出现的演员在不同网络中，各中心性指标的得分差异【提示：跨网络比较，请使用标准化的XX中心得分】。

5. 针对每种中心性度量中得分最高的5个节点，写一个循环，每次去掉一个当前得分最高的节点并输出节点排序情况。观察网络中，各节点XX中心性的排序变化情况，并在此基础上，加深理解各中心指标与网络结构之间的关系。


# 任务要求

1. 所有任务均需要提交至少两个文件：
    1. 数据
    2. 以Jupyter notebook提交的代码及全部运行结果。

2. 文件的第一部分，请以markdown模式，标注班级、组员的姓名和学号。
3. 文件的第二部分，请以markdown模式放置老师布置的任务描述。
4. 文件的第三部分，请以markdown模式，阐述任务代码的编写思路（如：1.建立一个列表/字典/元祖，存入节点对之间的指向关系；2.利用XX函数，将列表/字典/元祖转化为矩阵a；3.利用XX包的XX函数，计算矩阵a的转置矩阵，并存储为矩阵b；4. ...
5. 文件的第四部分，放置代码，并在一段代码块之前后之后，记得添加comment，说明每段代码的意图做什么【注：老师在阅读你们的代码过程中，如因缺乏注释而有不理解的地方，可能会请你们当面解释代码，再决定如何打分。】。
6. 评分标准：能正确使用markdown（第一、第二部分），思路清晰且详尽（第三部分），代码跑得通（第四部分），结果正确且解读正确（第四部分，部分作用需要解读结果，那么，请在代码后面，以Markdown模式，加入对结果的解读）

# 思路



###  数据收集

1. 利用豆瓣接口采集评分不低于的“国产剧”和“港剧”，不少于200部的电视剧列表。主要采集信息包括电视剧ID, 评分，并按照评分过滤出不低于7分的目标。
2. 利用百度百科等网站人工收集白玉兰奖、飞天奖、金鹰奖等往届获奖电视剧名称。
3. 利用豆瓣电影搜索功能，根据上述所收集的电视剧名称，采集电视剧对应的电视剧ID。
4. 上述两类电视剧信息保存到文件中，供之后使用。
2. 通过电视剧ID爬取到每部电视剧的主演信息，并保存。
    1. 演员的id到姓名的映射信息。
    2. 保存电视剧ID到演员ID列表的映射信息。

### 网络构建

1. 读取电视剧映射信息，并加入label来区分是小组喜欢的电视剧还是官方喜欢的电视剧。
2. 读取演员映射信息。
3. 读取电视剧的演员映射信息。

### 数据分析

1. 对不同的label (小组喜欢 vs 官方喜欢）
    1. 从电视剧的演员映射信息计算演员合作的次数。
    2. 利用networkx库构建网络图。`create_graph()`函数
    3. 对构建的网络图进行可视化。`plot_graph()`函数
    4. 对构建的网络图进行分析。`analysis()`函数
        1. 计算点度中心性。`networkx.degree_centrality(graph)`
        2. 计算中介中心性。`networkx.between_centrality(graph)`
        3. 计算接近中心性。`networkx.closeness_centrality(graph)`
        4. 计算特征向量中心性。`networkx.eigenvector_centrality(graph)`

### 可视化


In [84]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import quote, unquote


In [85]:
# param definitions
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Sec-Ch-Ua": '''"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"''',
    "Referer": "https://movie.douban.com/",
    "Sec-Fetch-Dest": "document",
    "cookie": """bid=WKVIFxu-_BI; ll="108169"; _ga=GA1.1.191779183.1701751376; _cc_id=23b54b9d3ef3a33a34eac66eed7b4a6a; panoramaId_expiry=1702776749037; panoramaId=f1310b00fced7a55f575e1c2e5e316d539385cb9e22ad6923b1d93788d069ada; panoramaIdType=panoIndiv; cto_bundle=BXoJSV91eU5RTFdCSmcxTFh6ZzJ4M3VUSFVkalFicWVrSGZKNkF0SEJKamlRc0lnY0huQzBRTEdRbmVkTm9sdVFIVThqMTVoanNrJTJCYzlRUUhLZEtSbFZNWVNSV1lJV3NHME9JUzVzQjJhWHhubldVbTQlMkJHcVVZeEluaHB1eFJSU2FXYWs1JTJGcWJWT0NNNjJnMTZCVVVXYmg3bERDOE5mVEN5MzN6ZTVEckt0N3Y1WGRQUmQ2JTJGa2xWMGU0b1V3ZDNzbHZ5OEpEQXpTVXZpaWJ5RThiMXZuQ1MybFElM0QlM0Q; _ga_YD7QXHZJ4Y=GS1.1.1702171947.1.1.1702172062.0.0.0; _ga_Y4GN1R87RG=GS1.1.1702171948.3.1.1702172062.0.0.0; _pk_id.100001.2939=23fb1116456f2945.1702181127.; __yadk_uid=IG3yWBrT25a5voDknmjWyvbI8e2gWUf3; __utmz=30149280.1702188613.6.3.utmcsr=movie.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/celebrity/1352362/; _pk_ref.100001.2939=%5B%22%22%2C%22%22%2C1702379326%2C%22https%3A%2F%2Fmovie.douban.com%2Fcelebrity%2F1352362%2F%22%5D; _pk_ses.100001.2939=1; ap_v=0,6.0; __utma=30149280.191779183.1701751376.1702188613.1702379330.7; __utmc=30149280; dbcl2="65813696:ATM4RixqcdE"; ck=Bp8_; push_noty_num=0; push_doumail_num=0; __utmt=1; __utmv=30149280.6581; __gads=ID=a9b321e888688167:T=1702171894:RT=1702380648:S=ALNI_MbJAOSJsVQBSYsoHKK1YgOhrt4-vg; __gpi=UID=00000ca895d8a964:T=1702171894:RT=1702380648:S=ALNI_MY587WTpStdnaM_AGYG1dCqa678-w; __utmb=30149280.9.10.1702379330""",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

GUOCHANJUAN_TAG = "国产剧"
GANGJU_TAG = "港剧"
REWARD_TAG = "三大电视奖"

GROUP_LIKE_LABEL = 'group_like'
OFFICIAL_LIKE_LABEL = 'official_like'

# input
REWARD_TV_FILE = "data/reward_tv_titles.csv"

GET_DOUBAN_TV_TAG_URL = "https://movie.douban.com/j/search_subjects"
SEARCH_TV_URL = "https://www.douban.com/search?cat=1002&q="
GET_ACTOR_URL = f"https://movie.douban.com/subject/"
ACTOR_DETAIL_URL = "https://movie.douban.com/celebrity/"

# output
TV_LOOKUP_FILE = "data/tv_list.csv" # id->title
ACTOR_LOOKUP_FILE = "data/actor_list.csv" # id->name
TV_ACTORS_FILE = "data/tv_actor_list.csv"

In [86]:
# function definitions

def get_douban_tv_by_tag(tag, num = 1):
    url = GET_DOUBAN_TV_TAG_URL
    params = {
        "type": "tv",
        "tag": tag,
        "page_limit": num,  # You can adjust the limit as needed
        "page_start": 0
    }
    response = requests.get(url, params=params, headers = headers)
    dramas = response.json().get('subjects', [])

    filtered_dramas = []
    for drama in dramas:
        title = drama["title"]
        url = drama["url"].strip('/')
        rate = drama["rate"]
        if rate == '' or float(rate) < 7:
            continue
        did = drama["id"]
        filtered_dramas.append([tag, did, title, rate, url])
    return filtered_dramas

def search_tv_by_name(search_text):
    quoted_search_text = quote(search_text)

    url = f"{SEARCH_TV_URL}{quoted_search_text}"

    # 发送请求
    r = requests.get(url, headers=headers)
    status = r.status_code
    if status == 200:
        soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser",from_encoding="utf-8")
        a = soup.find('div', class_='result-list')
        if a is None:
            print(f"fail soup find for {search_text}")
            return None
        num = len(a)
        if num == 0:
            print('request succ, but null result')
        else:
            href = a.find('a', href=True)
            link = href['href'].split('?url=')[-1].split('%2F&query=')[0]
            uq_link = unquote(link).strip('/')
            return uq_link
    else:
        print(f"Request failed, code:{status}")
    return None

def get_tv_actors(tv_id):

    url = f"{GET_ACTOR_URL}{tv_id}"

    r = requests.get(url, headers=headers)
    status = r.status_code
    return_actors = []
    if status == 200:
        soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser",from_encoding="utf-8")
        a = soup.find('div', class_='subject clearfix')
        if a is not None:
            actors = a.find('span', class_='actor').find('span', class_='attrs').find_all('a', href=True)
            for item in actors:
                aname = item.get_text()
                aid = item['href'].strip('/')
                return_actors.append((aname, aid))
    return return_actors

def get_reward_tv(num):
    with open(REWARD_TV_FILE, 'r') as f:
        result = []
        for line in f.readlines():
            title = line.strip()
            link = search_tv_by_name(title)
            if link is not None:
                tv_id = link.strip().strip('/').split('/')[-1]
                result.append([REWARD_TAG, tv_id, title, '10', link])
            if len(result) >= num:
                break
        return result
        
def tv_write(data, tag, label):
    
    with open(TV_LOOKUP_FILE, 'a+') as f:
        for tv in data:
            [_, tv_id, title, rate, link] = tv
            ntitle = title.replace(',', '&')
            simple_tv = [tv_id, ntitle, tag, label]
            f.write(','.join(simple_tv) + '\n')


In [18]:
# 简单地测试一下上面几个函数
ta = get_douban_tv_by_tag("国产剧", 1)
print(ta)

tb = search_tv_by_name("一念关山")
print(tb)

tc = get_tv_actors("35797771")
print(tc)


td = get_reward_tv(1)
print(td)

[['国产剧', '35797771', '一念关山', '7.3', 'https://movie.douban.com/subject/35797771']]
https://movie.douban.com/subject/35797771
[('刘诗诗', 'celebrity/1274533'), ('刘宇宁', 'celebrity/1401585'), ('方逸伦', 'celebrity/1359360'), ('何蓝逗', 'celebrity/1376538'), ('陈昊宇', 'celebrity/1351561'), ('常华森', 'celebrity/1437324'), ('王艳', 'celebrity/1274509'), ('吕行', 'celebrity/1313870'), ('李欢', 'celebrity/1349341'), ('陈宥维', 'celebrity/1386145'), ('陈都灵', 'celebrity/1342249'), ('王一哲', 'celebrity/1397908'), ('陈小纭', 'celebrity/1361294'), ('张芷溪', 'celebrity/1323727'), ('黄梦莹', 'celebrity/1349244'), ('张帆', 'celebrity/1420285'), ('叶青', 'celebrity/1315746'), ('叶筱玮', 'celebrity/1371589'), ('原若航', 'celebrity/1439518'), ('吴弘', 'celebrity/1342009'), ('张垒', 'celebrity/1276088'), ('苏梦芸', 'celebrity/1452637'), ('张乔耳', 'celebrity/1440942'), ('周陆啦', 'celebrity/1418957'), ('尹铸胜', 'celebrity/1313561'), ('张天阳', 'celebrity/1339958'), ('常铖', 'celebrity/1314524'), ('曾柯琅', 'celebrity/1424329'), ('景如洋', 'celebrity/1408694')]
[['三大电视奖', '6

In [None]:
# 爬取300部国产剧
NUM = 300
gcj_list = get_douban_tv_by_tag(GUOCHANJUAN_TAG, NUM)
gj_list = get_douban_tv_by_tag(GANGJU_TAG, NUM)
reward_list = get_reward_tv(NUM)



fail soup find for 永生不忘
fail soup find for 信仰·使命
fail soup find for 夜凖
fail soup find for 鸣沙湾
fail soup find for 侦察兵的荣誉


In [None]:
# 写电视剧id/name映射文件
tv_write(gcj_list, GUOCHANJUAN_TAG, GROUP_LIKE_LABEL)
tv_write(gj_list, GANGJU_TAG, GROUP_LIKE_LABEL)
tv_write(reward_list, REWARD_TAG, OFFICIAL_LIKE_LABEL)

In [None]:
# 汇总并去重上述电视剧
tv_id_set = set()
with open(TV_LOOKUP_FILE, 'r') as f:
    for line in f.readlines():
        item = line.strip().split(',')
        tv_id_list.append(item[1])

tv_id_list = list(tv_id_set)

In [None]:
# 爬取上述电视剧的主演列表
tv2actor = {}

actor_map = {}
tv2actor_simple = {}
for tv_id in tv_id_list:
    v = get_tv_actors(tv_id)
    aid_list = []
    for item in v:
        [name, aid] = item
        aid = aid.split('/')[-1]
        actor_map[aid] = name
        aid_list.append(aid)
    tv2actor_simple[tv_id] = aid_list

print(f"actor size: {len(actor_map)}")

In [None]:
# 写演员映射表
def actor_write(data):
    with open(ACTOR_LOOKUP_FILE, 'w') as f:
        for k, v in data.items():
            v = v.replace(',', '&')
            f.write(f'{k},{v}\n')
actor_write(actor_map)
        

In [None]:
# 写电视剧的演员列表
with open(TV_ACTORS_FILE, 'w') as f:
    for k, v in tv2actor_simple.items():
        v_str = ' '.join(v)
        f.write(f'{k},{v_str}\n')

In [None]:
# 读取电视剧的演员列表
tv2actor_id = {}
with open(TV_ACTORS_FILE, 'r') as f:
    for line in f.readlines():
        [k, v] = line.strip().split(',')
        actors = v.split(' ')
        tv2actor_id[k] = actors

# 读取电视剧列表id->title
tv_id2title = {}
label2tvlist = {'group_like': {}, 'official_like': {}}
with open(TV_LOOKUP_FILE, 'r') as f:
    for line in f.readlines():
        item = line.strip().split(',')
        tv_id2title[item[0]] = item[1]
        label2tvlist[item[3]][item[0]] = 1

# 读取演员列表id->name
actor_id2name = {}
with open(ACTOR_LOOKUP_FILE, 'r') as f:
    for line in f.readlines():
        item = line.strip().split(',')
        actor_id2name[item[0]] = item[1]
print(f'tv2actor_id size:{len(tv2actor_id)}')
print(f'tv_id2title size:{len(tv_id2title)}')
print(f'actor_id2name size:{len(actor_id2name)}')

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

def create_graph(label):
    # 计算同时出演电视剧的次数
    edges = {}

    tmp = set()
    for k, v in tv2actor_id.items():
        if k not in label2tvlist[label]:
            continue
        for i in range(len(v)):
            for j in range(len(v)):
                if i < j:
                    vi = v[i]
                    vj = v[j]
                    if v[i] not in edges:
                        edges[vi] = {}
                    if vj not in edges[vi]:

                        edges[vi][vj] = 0
                    edges[vi][vj] += 1


    G = nx.Graph()
    for i in edges:
        for j in edges[i]:
            G.add_edge(i, j, weight=edges[i][j])
    return G
    

In [None]:
group_like_actor_network = create_graph(GROUP_LIKE_LABEL)
print(group_like_actor_network)


official_like_actor_network = create_graph(OFFICIAL_LIKE_LABEL)
print(official_like_actor_network)

In [None]:
def plot_graph(G):
    elarge = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] > 1]
    esmall = [(u, v) for (u, v, d) in G.edges(data=True) if d["weight"] <= 1]

    pos = nx.spring_layout(G, seed=10)  # positions for all nodes - seed for reproducibility

    # nodes
    nx.draw_networkx_nodes(G, pos, node_size=700)

    # edges
    nx.draw_networkx_edges(G, pos, edgelist=elarge, width=6)
    nx.draw_networkx_edges(
        G, pos, edgelist=esmall, width=6, alpha=0.5, edge_color="b", style="dashed"
    )

    # node labels
    nx.draw_networkx_labels(G, pos, font_size=10, font_family="sans-serif")
    # edge weight labels
    edge_labels = nx.get_edge_attributes(G, "weight")
    nx.draw_networkx_edge_labels(G, pos, edge_labels)

    ax = plt.gca()
    ax.margins(0.08)
    plt.axis("on")
    plt.tight_layout()
    plt.show()

In [None]:
plot_graph(group_like_actor_network)
plot_graph(official_like_actor_network)


In [None]:
# 计算中心性指标
def calculate_centralities(G):
    degree_centrality = nx.degree_centrality(G)
    betweenness_centrality = nx.betweenness_centrality(G)
    closeness_centrality = nx.closeness_centrality(G)
    eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)
    return degree_centrality, betweenness_centrality, closeness_centrality, eigenvector_centrality

def map_actors_name(top_actors):
    new_top = []
    for item in top_actors:
        (idx, score) = item
        new_top.append((idx, actor_id2name[idx], score))
    return new_top

def top_actors(centrality, n=30):
    return sorted(centrality.items(), key=lambda x: x[1], reverse=True)[:n]

def analysis(G):
    centralities_ = calculate_centralities(G)
    # 打印前30名演员
    for i, centrality in enumerate(['Degree', 'Betweenness', 'Closeness', 'Eigenvector']):
        print(f"\t{centrality} Centrality Top Actors in G:")
        for (idx, score, name) in map_actors_name(top_actors(centralities_[i])):
            print(f'\t\t{name}\t{idx}\t{score}')

print(f'{GROUP_LIKE_LABEL} actor network analysis:')
analysis(group_like_actor_network)

print(f'{OFFICIAL_LIKE_LABEL} actor network analysis:')
analysis(official_like_actor_network)