## 赛事背景
讯飞开放平台针对不同行业、不同场景提供相应的AI能力和解决方案，赋能开发者的产品和应用，帮助开发者通过AI解决相关实际问题，实现让产品能听会说、能看会认、能理解会思考。

用户新增预测是分析用户使用场景以及预测用户增长情况的关键步骤，有助于进行后续产品和应用的迭代升级。

## 赛事任务
本次大赛提供了讯飞开放平台海量的应用数据作为训练样本，参赛选手需要基于提供的样本构建模型，预测用户的新增情况。

## 评审规则

### 数据说明
| 字段名称             | 字段含义                                       |
| :--------------- | :----------------------------------------- |
| mid              | 用户行为模块id                                   |
| eid              | 用户行为事件id                                   |
| did              | 用户id                                       |
| device\_brand    | 设备品牌/厂商                                    |
| ntt              | 网络类型                                       |
| operator         | 运营商                                        |
| common\_country  | 国家                                         |
| common\_province | 省份                                         |
| common\_city     | 城市                                         |
| appver           | 应用版本                                       |
| channel          | 应用渠道                                       |
| common\_ts       | 事件发生时间（毫秒时间戳）                              |
| os\_type         | 用于判断Android还是iOS                           |
| udmap            | 事件自定义属性（标准json文本，内含botId助手ID和pluginId插件ID） |
| is\_new\_did     | 预测目标，即是否为新增用户                              |

### 评估指标
本次竞赛的评价标准采用f1_score，分数越高，效果越好。

In [41]:
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict
import lightgbm as lgb
from sklearn.ensemble import HistGradientBoostingClassifier

In [32]:
train = pd.read_csv("train.csv")
test = pd.read_csv("testA_data.csv")

train["common_ts"] = pd.to_datetime(train["common_ts"], unit="ms")
test["common_ts"] = pd.to_datetime(test["common_ts"], unit="ms")

In [33]:
train["common_month"] = train["common_ts"].dt.month
test["common_month"] = test["common_ts"].dt.month

train["common_day"] = train["common_ts"].dt.day
test["common_day"] = test["common_ts"].dt.day

train["common_hour"] = train["common_ts"].dt.hour
test["common_hour"] = test["common_ts"].dt.hour

In [34]:
train.shape, test.shape

((3429925, 18), (1143309, 17))

In [35]:
train.columns

Index(['mid', 'eid', 'did', 'device_brand', 'ntt', 'operator',
       'common_country', 'common_province', 'common_city', 'appver', 'channel',
       'common_ts', 'os_type', 'udmap', 'is_new_did', 'common_month',
       'common_day', 'common_hour'],
      dtype='object')

In [36]:
train.describe(include="all")

  train.describe(include="all")


Unnamed: 0,mid,eid,did,device_brand,ntt,operator,common_country,common_province,common_city,appver,channel,common_ts,os_type,udmap,is_new_did,common_month,common_day,common_hour
count,3429925.0,3429925.0,3429925,3429925.0,3429925.0,3429925.0,3429925.0,3429925.0,3429925.0,3429925.0,3429925.0,3429925,3429925.0,3429925,3429925.0,3429925.0,3429925.0,3429925.0
unique,,,270837,,,,,,,,,3254416,,8077,,,,
top,,,20cd6a7d3a60fd193d925b21af6660f1e,,,,,,,,,2025-03-13 00:05:47.273000,,{},,,,
freq,,,68403,,,,,,,,,41,,3162776,,,,
first,,,,,,,,,,,,2025-02-28 16:00:00.115000,,,,,,
last,,,,,,,,,,,,2025-03-31 15:59:57.196000,,,,,,
mean,22.64608,136.6922,,88.92087,2.60518,1.929892,80.9136,145.9701,240.0957,58.74532,5.914274,,0.6228171,,0.156034,2.997912,15.78363,8.535301
std,13.93127,76.87001,,52.89133,1.148252,1.140171,2.438861,79.85908,141.4514,28.18215,4.161772,,0.4846814,,0.3628876,0.04564792,8.624008,5.39119
min,0.0,0.0,,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,0.0,,0.0,2.0,1.0,0.0
25%,13.0,64.0,,58.0,2.0,1.0,81.0,79.0,89.0,26.0,2.0,,0.0,,0.0,3.0,8.0,4.0


In [37]:
for col in ['mid', 'eid', 'did', 'device_brand', 'ntt', 'operator',
       'common_country', 'common_province', 'common_city', 'appver', 'channel',
       'os_type']:
    train[col + "_count"] = train[col].map(train[col].value_counts())
    test[col + "_count"] = test[col].map(test[col].value_counts())

    train[col + "_target"] = train[col].map(train.groupby(col)["is_new_did"].mean())
    test[col + "_target"] = test[col].map(train.groupby(col)["is_new_did"].mean())

In [42]:
pred = cross_val_predict(
    HistGradientBoostingClassifier(),
    train.drop(["did", "udmap", "is_new_did", "common_ts"], axis=1),
    train["is_new_did"]
)
f1_score(train["is_new_did"], pred)

0.9050957798130362

In [54]:
model = lgb.LGBMClassifier()
model.fit(
    train.drop(["did", "udmap", "is_new_did", "common_ts"], axis=1),
    train["is_new_did"]
)
pred = model.predict(test.drop(["did", "udmap", "common_ts"], axis=1))

[LightGBM] [Info] Number of positive: 535185, number of negative: 2894740
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.039179 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3023
[LightGBM] [Info] Number of data points in the train set: 3429925, number of used features: 38
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.156034 -> initscore=-1.688038
[LightGBM] [Info] Start training from score -1.688038


In [55]:
pd.DataFrame({"is_new_did": pred}).to_csv("submit.csv", index=None)