<center><h1><a href="https://challenge.xfyun.cn/topic/info?type=molecular-properties&ch=dw24_AtTCK9">分子性质AI预测挑战赛</a></h1></center>

# 一、赛事背景

在当今科技日新月异的时代，人工智能（AI）技术正以前所未有的深度和广度渗透到科研领域，特别是在化学及药物研发中展现出了巨大潜力。精准预测分子性质有助于高效筛选出具有优异性能的候选药物。以PROTACs为例，它是一种三元复合物由目标蛋白配体、linker、E3连接酶配体组成，靶向降解目标蛋白质。本次大赛聚焦于运用先进的人工智能算法预测其降解效能，旨在激发参赛者创新思维，推动AI技术与化学生物学的深度融合，进一步提升药物研发效率与成功率，为人类健康事业贡献智慧力量。通过此次大赛，我们期待见证并孵化出更多精准、高效的分子性质预测模型，共同开启药物发现的新纪元。

# 二、赛事任务

选手根据提供的demo数据集，可以基于demo数据集进行数据增强、自行搜集数据等方式扩充数据集，并自行划分数据。运用深度学习、强化学习或更加优秀人工智能的方法预测PROTACs的降解能力，若DC50>100nM且Dmax<80% ，则视为降解能力较差（demo数据集中Label=0）；若DC50<=100nM或Dmax>=80%，则视为降解能力好（demo数据集中Label=1）。

# 三、评审规则

本模型依据提交的结果文件，采用F1-score进行评价。

1、本赛题均提供下载数据，选手在本地进行算法调试，在比赛页面提交结果。

2、每支团队每天最多提交3次。

3、得分从高到低排序，排行榜将选择团队的历史最优成绩进行排名。

# 四、作品提交要求

排行榜更新结束后，前三名选手需要提交代码、模型、说明文档和分析报告，具体为：

1、代码和模型：需要提交完整代码，包括模型、以及数据处理等。代码应当清晰、规范并包含必要的注释。

2、说明文档：此文档应详细介绍模型的设计理念、使用的技术和算法、模型结构、以及如何训练和测试模型，应包括训练过程中的loss情况等关键信息。

3、分析报告：选手需要提供模型在给定数据集上的测试结果。这应包括上述提到的F1分数（F1 Score）关键性能指标。此外，还应包括模型的优缺点分析和可能的改进方向。若选手提供的分析报告内容全面且有丰富评估指标（如：PR曲线、ROC曲线）综合分析，则可视情况给予一定加分。

# 五、赛程规则

本赛题实行一轮赛制

## 【赛程周期】

6月9日-8月9日

1、6月9日10：00发布训练集、开发集、测试集（即开启比赛榜单）

2、比赛作品提交截止日期为8月9日17：00，公布名次日期为8月16日10：00

## 【现场答辩】

1、最终前三名团队将受邀参加科大讯飞AI开发者大赛总决赛并于现场进行答辩

2、答辩以（10mins陈述+5mins问答）的形式进行

3、根据作品成绩和答辩成绩综合评分（作品成绩占比70％，现场答辩分数占比30％）

# 六、奖项设置

本赛题设立一、二、三等奖共三名，具体详情如下：

## 【奖项激励】

1.  TOP3团队颁发获奖证书
2.  赛道奖金，第一名5000元、第二名3000元、第三名2000元

## 【资源激励】

1.  讯飞开放平台优质AI能力个人资源包
2.  讯飞AI全链创业扶持资源
3.  讯飞绿色实习/就业通道

注：

1.  鼓励选手分享参赛心得、参赛技术攻略、大赛相关技术或产品使用体验等文章至组委会邮箱（AICompetition@iflytek.com），有机会获得大赛周边；
2.  赛事规则及奖金发放解释权归科大讯飞所有；以上全部奖金均为税前金额，将由主办方代扣代缴个人所得税。

In [120]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

In [121]:
train = pd.read_excel('traindata.xlsx')
test = pd.read_excel('testdata.xlsx')

In [122]:
train.shape, test.shape

((351, 90), (353, 87))

In [123]:
train.head(1)

Unnamed: 0,uuid,Label,Uniprot,Target,E3 ligase,PDB,Name,Smiles,DC50 (nM),Dmax (%),...,XLogP3,Heavy Atom Count,Ring Count,Hydrogen Bond Acceptor Count,Hydrogen Bond Donor Count,Rotatable Bond Count,Topological Polar Surface Area,Molecular Formula,InChI,InChI Key
0,1,1,Q9NWZ3,IRAK4,CRBN,,,CC(C)NC1=CC(N2C=CC3=CC(C#N)=CN=C32)=NC=C1C(=O)...,405.0,90.0,...,2.14,62,7,14,5,16,255.84,C43H46N10O9,InChI=1S/C43H46N10O9/c1-24(2)49-31-19-34(52-15...,YNNBDJQWDDDBRU-OAMJFVEXSA-N


In [124]:
test.head(1)

Unnamed: 0,uuid,Uniprot,Target,E3 ligase,PDB,Name,Smiles,Assay (DC50/Dmax),Percent degradation (%),Assay (Percent degradation),...,XLogP3,Heavy Atom Count,Ring Count,Hydrogen Bond Acceptor Count,Hydrogen Bond Donor Count,Rotatable Bond Count,Topological Polar Surface Area,Molecular Formula,InChI,InChI Key
0,1,Q9H8M2,BRD9,VHL,,,COC1=CC(C2=CN(C)C(=O)C3=CN=CC=C23)=CC(OC)=C1CN...,Degradation of BRD9 in HeLa cells after 4 h tr...,,,...,3.69,74,8,16,3,22,199.15,C54H69FN8O10S,InChI=1S/C54H69FN8O10S/c1-34-47(74-33-58-34)35...,MXAKQOVZPDLCDK-UDVNCTHFSA-N


In [125]:
[x for x in train.columns if x not in test.columns]

['Label', 'DC50 (nM)', 'Dmax (%)']

In [126]:
train = train.drop(['DC50 (nM)', 'Dmax (%)'], axis=1)

In [127]:
train.dtypes.value_counts()

float64    53
object     28
int64       7
Name: count, dtype: int64

In [128]:
train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
uuid,351.0,,,,176.0,101.469207,1.0,88.5,176.0,263.5,351.0
Label,351.0,,,,0.621083,0.48581,0.0,0.0,1.0,1.0,1.0
Uniprot,329,49,P10275,48,,,,,,,
Target,351,65,AR,39,,,,,,,
E3 ligase,351,6,CRBN,181,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
Rotatable Bond Count,351.0,,,,17.37037,6.482031,6.0,12.0,17.0,21.0,37.0
Topological Polar Surface Area,351.0,,,,203.396752,40.392664,96.43,174.45,204.0,224.04,316.08
Molecular Formula,351,228,C46H59N7O7S,6,,,,,,,
InChI,351,241,InChI=1S/C46H59N7O7S/c1-30-41(61-29-49-30)32-1...,6,,,,,,,


In [129]:
for col in train.columns[1:]:
    if train[col].dtype == object:
        if train[col].nunique() > 200:
            print('#', col)        
        print(col, train[col].nunique(), test[col].nunique())

Uniprot 49 48
Target 65 55
E3 ligase 6 6
PDB 4 8
Name 49 159
# Smiles
Smiles 231 280
Assay (DC50/Dmax) 85 110
Percent degradation (%) 3 4
Assay (Percent degradation) 1 1
Assay (Protac to Target, IC50) 24 32
Assay (Protac to Target, Kd) 8 21
Assay (Protac to Target, Ki) 3 2
Assay (Protac to E3, IC50) 2 5
Assay (Protac to E3, Kd) 4 7
Assay (Ternary complex, IC50) 2 3
Assay (Ternary complex, Kd) 5 12
Assay (Cellular activities, IC5 29 33
EC50 (nM, Cellular activities) 13 17
Assay (Cellular activities, EC5 7 8
GI50 (nM, Cellular activities) 5 17
Assay (Cellular activities, GI5 4 16
Assay (Permeability, PAMPA Papp 1 1
Assay (Permeability, Caco-2 A2B 2 1
Assay (Permeability, Caco-2 B2A 2 1
Article DOI 63 86
# Molecular Formula
Molecular Formula 228 285
# InChI
InChI 241 296
# InChI Key
InChI Key 250 298


In [130]:
for col in train.columns[2:]:
    if train[col].dtype == object or test[col].dtype == object:
        train[col] = train[col].isnull()
        test[col] = test[col].isnull()

In [131]:
pred = cross_val_predict(
    DecisionTreeClassifier(),
    train.iloc[:, 2:],
    train['Label']
)
print(f1_score(train['Label'], pred))

0.6585956416464891


In [132]:
pred = cross_val_predict(
    LogisticRegression(max_iter=1000),
    train.iloc[:, 2:].fillna(0),
    train['Label']
)
print(f1_score(train['Label'], pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.7405660377358491


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [133]:
pred = cross_val_predict(
    RandomForestClassifier(),
    train.iloc[:, 2:].fillna(0),
    train['Label']
)
print(f1_score(train['Label'], pred))

0.7913043478260869


In [134]:
pred = cross_val_predict(
    LGBMClassifier(max_depth=10, num_leaves=128),
    train.iloc[:, 2:].values,
    train['Label']
)
print(f1_score(train['Label'], pred))

[LightGBM] [Info] Number of positive: 174, number of negative: 106
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000111 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 437
[LightGBM] [Info] Number of data points in the train set: 280, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.621429 -> initscore=0.495616
[LightGBM] [Info] Start training from score 0.495616
[LightGBM] [Info] Number of positive: 174, number of negative: 107
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000125 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 475
[LightGBM] [Info] Number of data points in the train set: 281, number of used features: 17
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.619217 -> initscore=0.486226
[LightGBM] [Info] Start training from score 0.486226
[LightGBM] [Info] Number of po

In [137]:
model = LGBMClassifier()
model.fit(train.iloc[:, 2:].values, train['Label'])
pred = model.predict(test.iloc[:, 1:].values, )

[LightGBM] [Info] Number of positive: 218, number of negative: 133
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000426 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 541
[LightGBM] [Info] Number of data points in the train set: 351, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.621083 -> initscore=0.494146
[LightGBM] [Info] Start training from score 0.494146


In [138]:
pd.DataFrame(
    {
        'uuid': test['uuid'],
        'Label': pred
    }
).to_csv('submit.csv', index=None)