#### Machine Learning工作流程一览

- **读取数据**：从CSV文件中加载数据，确保数据的完整性和正确性。
- **数据清洗**：检查数据中的缺失值或异常值，并进行适当的处理。
- **处理游戏类型相关参数**：将游戏类型列（genres）中的数据转换为适合聚类分析的格式。具体来说将参考游戏类型占总体出现的频率，以及该游戏所包含类型的情况确定每种类型的权重。
- **构建矩阵**：创建一个矩阵，其中每行代表一个玩家，每列对应一种游戏类型。矩阵中的值为玩家在该游戏类型上的加权游戏时间（即游戏时间乘以对应游戏类型的权重）。
- **降维**：应用主成分分析（PCA）来降低数据的维度，从而简化模型并提取最有信息量的特征。选择保留的成分数量应基于解释的方差比例和问题的具体需求。
- **聚类分析**：使用DBSCAN聚类算法对PCA处理后的数据进行聚类。选择适当的`eps`和`min_samples`参数，以确保最佳聚类效果。
- **展示聚类结果**：利用散点图或其他可视化工具展示DBSCAN聚类的结果，每个聚类用不同的颜色或标记表示。可能还包括降维后的特征空间的可视化，以便分析和解释聚类结果。

### 加载数据

In [1]:
import pandas as pd
import re
from ast import literal_eval # 将字符转成应有的数据类型,literal表示字面量，即原本是什么类型，eval表示评估
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
data = pd.read_csv('../../data/processed/all_steam_and_game_data_after_cleaned.csv')
data.head(5)

Unnamed: 0.1,Unnamed: 0,steamid,communityvisibilitystate,profilestate,personaname,profileurl,avatar,avatarmedium,avatarfull,avatarhash,...,playtime_2weeks_game3,playtime_forever_game3_2weeks,steam_appid_2weeks_game_3,price_overview_2weeks_game_3,genres_2weeks_game_3,developers_2weeks_game_3,publishers_2weeks_game_3,categories_2weeks_game_3,release_date_2weeks_game_3,metacritic_2weeks_game_3
0,0,76561197960269904,3,1,ツxxツ,https://steamcommunity.com/id/xcari/,https://avatars.steamstatic.com/c8499ee4d5ebde...,https://avatars.steamstatic.com/c8499ee4d5ebde...,https://avatars.steamstatic.com/c8499ee4d5ebde...,c8499ee4d5ebdebd78f07fc3fa19ce5370da82be,...,0,0,0,0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
1,1,76561197960280448,3,1,recon,https://steamcommunity.com/id/pzrecon/,https://avatars.steamstatic.com/628974cb0fcec1...,https://avatars.steamstatic.com/628974cb0fcec1...,https://avatars.steamstatic.com/628974cb0fcec1...,628974cb0fcec15a07cd1601fdadc7aa44ac245d,...,0,0,0,0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
2,2,76561197960290464,3,1,JKBe,https://steamcommunity.com/profiles/7656119796...,https://avatars.steamstatic.com/c698ae39dd85c1...,https://avatars.steamstatic.com/c698ae39dd85c1...,https://avatars.steamstatic.com/c698ae39dd85c1...,c698ae39dd85c1a1567e184a4b1735e9077a475f,...,0,0,0,0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown
3,3,76561197960315952,3,1,chuNami5000,https://steamcommunity.com/id/chuNk--/,https://avatars.steamstatic.com/b16314c3aff86b...,https://avatars.steamstatic.com/b16314c3aff86b...,https://avatars.steamstatic.com/b16314c3aff86b...,b16314c3aff86bf70b0bb5e54570fe8aa3efdd5d,...,67,272,952060,40,['Action'],"['CAPCOMCo.,Ltd.']","['CAPCOMCo.,Ltd.']","['Single-player', 'Multi-player', 'PvP', 'Onli...","2Apr,2020",77
4,4,76561197960331200,3,1,Mikki,https://steamcommunity.com/profiles/7656119796...,https://avatars.steamstatic.com/fef49e7fa7e199...,https://avatars.steamstatic.com/fef49e7fa7e199...,https://avatars.steamstatic.com/fef49e7fa7e199...,fef49e7fa7e1997310d705b2a6158ff8dc1cdfeb,...,0,0,0,0,Unknown,Unknown,Unknown,Unknown,Unknown,Unknown


### 数据清洗

In [3]:
'''默认值处理以及数据类型转换'''

# 替换unknown为空值na
data.replace('Unknown', pd.NA, inplace=True)

for idx, col in enumerate(data.columns):
    # 这里的r是原始字符串，不是read
    if re.search(r'genres', col):
        data[col] = data[col].apply(lambda x: literal_eval(x) if pd.notna(x) else x)
    else:
        pass

#### 处理游戏类型相关参数

In [7]:
'''整合数据至一列'''

all_genres = pd.Series(dtype='object')

for column in data:
    if 'genres' in column:
        non_na_series = data[column].dropna()
        all_genres = pd.concat([all_genres, non_na_series], ignore_index=True)

# 现在 all_genres 中的每个元素都应该是列表类型
all_genres

0                                     [Action, FreetoPlay]
1                                     [Action, FreetoPlay]
2                          [Action, Indie, Racing, Sports]
3                                                 [Action]
4        [Action, Adventure, Indie, MassivelyMultiplaye...
                               ...                        
33182                                             [Action]
33183                                 [Action, FreetoPlay]
33184                                             [Action]
33185                                             [Action]
33186                                 [Action, FreetoPlay]
Length: 33187, dtype: object

In [8]:
'''统计类型出现次数'''
genre_counts = {}
for genre_list in all_genres:
    for genre in genre_list:
        if genre in genre_counts:
            genre_counts[genre] += 1
        else:
            genre_counts[genre] = 1
genre_counts

{'Action': 24055,
 'FreetoPlay': 11575,
 'Indie': 9360,
 'Racing': 1294,
 'Sports': 1924,
 'Adventure': 10562,
 'MassivelyMultiplayer': 6275,
 'RPG': 7395,
 'Simulation': 5816,
 'Strategy': 5286,
 'Casual': 3549,
 'Utilities': 721,
 'Animation&Modeling': 452,
 'Design&Illustration': 463,
 'GameDevelopment': 120,
 'EarlyAccess': 1570,
 'PhotoEditing': 324,
 'AudioProduction': 98,
 'VideoProduction': 135,
 'Education': 98,
 'WebPublishing': 50,
 'SoftwareTraining': 35,
 'Nudity': 1,
 'Gore': 2,
 'Violent': 5,
 'Экшены': 2,
 'Казуальныеигры': 1,
 'Инди': 2,
 'Гонки': 1,
 'Симуляторы': 1,
 'Приключенческиеигры': 1,
 'Стратегии': 1}

In [9]:
'''剔除极端数据避免噪音污染权重'''
need_genre_counts = ['Nudity', 'Gore', 'Violent', 'Экшены', 'Приключенческиеигры', 'Инди', 'Стратегии']

for key in need_genre_counts:
    if key in genre_counts:
        del genre_counts[key]

In [10]:
genre_counts = pd.Series(genre_counts)
genre_counts.to_csv('./data/genre_count.csv')

加权方式介绍
- **计算类别权重**：
   - **计算比例倒数**：首先计算每个游戏类别在总类别中的占比，然后取该占比的倒数。
   - **归一化权重**：将各类别的倒数权重除以所有倒数权重的总和，确保权重总和为1。

- **计算加权时间**：
   - 根据归一化的倒数权重，为每个玩家所玩游戏的时间进行加权分配。

假设玩家A玩《荒野大镖客》200分钟，游戏类型包括[Action, Adventure]：

- **权重计算**：
  - Action的原始权重：1/24051
  - Adventure的原始权重：1/10559
  - 归一化后的权重：计算上述两者的权重总和，然后各自除以该总和。

- **加权时间计算**：
  - 使用归一化后的权重乘以200分钟，得到分配给Action和Adventure类别的加权游戏时间。

通过此方法可以更公平地评估玩家对各种不同流行程度游戏类别的参与度。


In [11]:
genre_counts

Action                  24055
FreetoPlay              11575
Indie                    9360
Racing                   1294
Sports                   1924
Adventure               10562
MassivelyMultiplayer     6275
RPG                      7395
Simulation               5816
Strategy                 5286
Casual                   3549
Utilities                 721
Animation&Modeling        452
Design&Illustration       463
GameDevelopment           120
EarlyAccess              1570
PhotoEditing              324
AudioProduction            98
VideoProduction           135
Education                  98
WebPublishing              50
SoftwareTraining           35
Казуальныеигры              1
Гонки                       1
Симуляторы                  1
dtype: int64

In [25]:
'''提取需要的序列'''
for_game_cluster_columns = []
for col in data:
    if re.search('steamid|genres|playtime_forever_game',col):
        for_game_cluster_columns.append(col) 
for_game_cluster_df = data[for_game_cluster_columns]

In [34]:
genre_counts['Action']

24055

In [51]:
test_genre_list = for_game_cluster_df.iloc[3,4]

total_count = 0
genre_count_list =[]
weight_list = []
for genre in test_genre_list:
    genre_size = genre_counts.get(genre,0)
    genre_count_list.append(genre_size)
    total_count += genre_size
    print(f'{genre}:{genre_counts[genre]}')

weight_list =[count/total_count for count in genre_count_list]



Action:24055


50721