# 药物筛选 Assignment

> 10185101210 陈俊潼

使用 `Random Forest` 模型预测具有抗菌作用的有机物。


### 准备活性数据

导入 rdkit 相关库：

In [1]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

导入数据处理的相关库：

In [2]:
import pandas as pd
import numpy as np

获取分子的活性数据：

In [3]:
df_all = pd.read_csv('./Experimental_anti_bact.csv', delimiter=',', header=0)
act_smiles = df_all[df_all['Activity']=='Active']['SMILES'].tolist()
inact_smiles = df_all[df_all['Activity']=='Inactive']['SMILES'].tolist()
df_all.head()
print(len(act_smiles), len(inact_smiles))

120 2215


计算所有分子的分子指纹：

In [4]:
from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator

mols_act = [Chem.MolFromSmiles(x) for x in act_smiles]
fps_act = rdFingerprintGenerator.GetFPs(mols_act)

mols_inact = [Chem.MolFromSmiles(x) for x in inact_smiles]
fps_inact = rdFingerprintGenerator.GetFPs(mols_inact)

fps = fps_act + fps_inact

准备样本标签：

In [5]:
tag = []
for i in range(len(fps_act)):
    tag.append("ACTIVE")
for i in range(len(fps_inact)):
    tag.append("INACTIVE")

### 使用随机森林模型

导入随机森林模型并对模型进行训练：

In [6]:
from sklearn.model_selection import train_test_split
# 20% for testing, 80% for training
X_train, X_test, y_train, y_test = train_test_split(fps, tag, test_size=0.20, random_state = 0)
print(len(X_train), len(y_test))

1868 467


对模型进行训练，并测量模型的准确度：

In [9]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_jobs=-1, n_estimators=100)
forest.fit(X_train, y_train) # Build a forest of trees from the training set 

from sklearn import metrics
y_pred = forest.predict(X_test) # Predict class for X
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Model Accuracy: %.2f" %accuracy)

Model Accuracy: 0.96


### 导入药物信息


In [10]:
df_new = pd.read_csv('./Drug_HUB.csv', delimiter='\t', header=0)
df_new = df_new[['Name', 'SMILES']]
df_new.head()

Unnamed: 0,Name,SMILES
0,cefmenoxime,CO\N=C(\C(=O)NC1C2SCC(CSc3nnnn3C)=C(N2C1=O)C(O...
1,ulifloxacin,CC1Sc2c(C(O)=O)c(=O)c3cc(F)c(cc3n12)N1CCNCC1
2,cefotiam,CN(C)CCn1nnnc1SCC1=C(N2[C@H](SC1)[C@H](NC(=O)C...
3,ceftriaxone,CO\N=C(/C(=O)N[C@H]1[C@H]2SCC(CSc3nc(=O)c(O)nn...
4,balofloxacin,CNC1CCCN(C1)c1c(F)cc2c(c1OC)n(cc(C(O)=O)c2=O)C...


进行药物筛选，并将保存结果在csv文件中：

In [11]:
print("Runnig...")
i = 0;
df_result = pd.DataFrame({"Name":[], "SMILES":[], "Probability":[]})
df_result.head()
for one in zip(df_new['Name'], df_new['SMILES']):
    i = i + 1;
    mol = Chem.MolFromSmiles(one[1])
    fingerPrint = rdFingerprintGenerator.GetFPs([mol])
    y_pred = forest.predict(fingerPrint)
    y_prob = forest.predict_proba(fingerPrint)
    print('\r', str(i) + "/" + str(len(df_new)),one[0], y_pred, y_prob)
    if(y_pred[0] == 'ACTIVE'):
        new = pd.DataFrame({"Name": [one[0]],
                         "SMILES": [one[1]],
                          "Probability": [y_prob[0][0]]})
        df_result=df_result.append(new,ignore_index=True,sort=True)  
print('Finished.')
df_result.to_csv("./Drug_avtive.csv", index = False)


Runnig...
 1/4496 cefmenoxime ['ACTIVE'] [[0.7 0.3]]
 2/4496 ulifloxacin ['ACTIVE'] [[0.52 0.48]]
 3/4496 cefotiam ['ACTIVE'] [[0.6 0.4]]
 4/4496 ceftriaxone ['ACTIVE'] [[0.6 0.4]]
 5/4496 balofloxacin ['ACTIVE'] [[0.79 0.21]]
 6/4496 cefminox ['ACTIVE'] [[0.58 0.42]]
 7/4496 danofloxacin ['ACTIVE'] [[0.61 0.39]]
 8/4496 besifloxacin ['ACTIVE'] [[0.79 0.21]]
 9/4496 cefazolin ['ACTIVE'] [[0.54 0.46]]
 10/4496 cefodizime ['ACTIVE'] [[0.56 0.44]]
 11/4496 trovafloxacin ['ACTIVE'] [[0.57 0.43]]
 12/4496 cefpirome ['ACTIVE'] [[0.63 0.37]]
 13/4496 cefotiam-cilexetil ['INACTIVE'] [[0.45 0.55]]
 14/4496 sitafloxacin ['ACTIVE'] [[0.62 0.38]]
 15/4496 ceftizoxim ['INACTIVE'] [[0.45 0.55]]
 16/4496 cefmetazole ['ACTIVE'] [[0.54 0.46]]
 17/4496 cefoselis ['ACTIVE'] [[0.57 0.43]]
 18/4496 cefotaxime ['ACTIVE'] [[0.57 0.43]]
 19/4496 ceftazidime ['ACTIVE'] [[0.77 0.23]]
 20/4496 cefetamet ['INACTIVE'] [[0.48 0.52]]
 21/4496 cefamandole ['ACTIVE'] [[0.78 0.22]]
 22/4496 cefuroxime ['INACTIVE'] [[0.