# Automated LigandNet workflow

## Author: Daniel Castaneda Mogollon
### e-mail: <dann.dignus.discere@gmail.com>
### Sirimulla Research Lab
### Last modified: 08/03/2018


### This workflow provides the user with the tools to generate a regression or classification model, in regards of predicting protein-ligand features.

## Importing modules

In [1]:
import os
import sys
from rdkit import Chem
from rdkit.Chem import AllChem
import glob
from tqdm import *
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.utils.class_weight import compute_class_weight
from sklearn.externals import joblib
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import  make_scorer, roc_auc_score, recall_score, accuracy_score, precision_score


## Let's start

In [2]:
def start():
    print("What would you like to do? (please type only the number of your choice)")
    print("1. Classification")
    print("2. Regression")
    counter=0
    answer=0
    answer = input()
    while (answer!=str(1) and answer!=str(2)) and counter<5:
            answer = input("Please type ONLY a number between 1 and 2: ")
            counter = counter+1
            if counter == 5:
                print("Too many attempts. Exiting program")
                sys.exit(1)
    print("\n")
    return answer

## Defining methods
### Here we define a method that will execute the fingerprints by generating temporary files (don't worry, the temp files will be deleted). This method takes '.txt' input files.

In [3]:
def TPATF_txt(main_path,temp_path,fingerprints_path,maya_path,pname):
    f_run = "perl " + maya_path+'TopologicalPharmacophoreAtomTripletsFingerprints.pl'
    command = f_run + " -r " + fingerprints_path+pname+"_tpatf"+" --AtomTripletsSetSizeToUse FixedSize -v ValuesString -o "+temp_path+pname+"temp.sdf"
    os.system(command)
    f_file = open(fingerprints_path+pname+"_tpatf.csv","r").readlines()
    for lines in f_file:
        if 'Cmpd' in lines:
            line = lines.split(';')[5].replace('"','')
            line = ','.join(line.split(' '))
    return line

### This method will skip those ligand SMILES that cannot be read, and it will delete the temporary files as well. It works only for '.txt' input files

In [4]:
def generateFingerprint_txt(pname,main_path,fingerprints_path,temp_path,maya_path):
    active_file = open(main_path+pname+'.txt','r').readlines()
    output = open(fingerprints_path+pname+'.csv','w')
    for line in active_file:
        try:
            w = Chem.SDWriter(temp_path+pname+'temp.sdf')
            smiles=line.split('\t')[0]
            try:
                pic50 = line.split('\t')[2].split('=')[1].rstrip('\n')
            except:
                pass
            mol=Chem.MolFromSmiles(smiles)
            AllChem.Compute2DCoords(mol)
            mol.SetProp("smiles",smiles)
            w.write(mol)
            w.flush()
            fingerprints = TPATF_txt(main_path,temp_path,fingerprints_path,maya_path,pname)
            if main_path.endswith('actives/'):
                output.write(smiles+','+pic50+','+fingerprints)
            else:
                output.write(smiles+','+fingerprints)
        except:
            pass

### Same method as TPATF but with '.sdf' file format instead

In [5]:
def TPATF_sdf(main_path,temp_path,fingerprints_path,maya_path,pname):
    f_run = "perl " + maya_path+'TopologicalPharmacophoreAtomTripletsFingerprints.pl'
    command = f_run + " -r " + fingerprints_path+pname+"_tpatf"+" --AtomTripletsSetSizeToUse FixedSize -v ValuesString -o "+temp_path+pname+"temp.sdf"
    os.system(command)
    f_file = open(fingerprints_path+pname+"_tpatf.csv","r").readlines()
    for lines in f_file:
        if 'Cmpd' in lines:
            line = lines.split(';')[5].replace('"','')
            line = ','.join(line.split(' '))
    return line

### Same method as  generateFingerprint but with '.sdf' file format instead

In [6]:
def generateFingerprint_sdf(pname,main_path,fingerprints_path,temp_path,maya_path):
    sdf_file = open(main_path+pname+'.sdf')
    output2 = open(fingerprints_path+pname+'.csv','w')
    w = open(temp_path+pname+'temp.sdf','w')
    for line in sdf_file:
        w.write(line)
        if '$$$$' in line:
            w.close()
            try:
                fingerprint = TPATF_sdf(main_path,temp_path,fingerprints_path,maya_path,pname)
                output2.write(fingerprint)
            except:
               pass
            w = open(temp_path+pname+'temp.sdf','w')
    output2.close()

### Getting the fingerprints for actives in decoys. The input should be either in '.sdf' format or '.txt' format. If it is '.txt', please make sure the file starts with SMILES

In [7]:
def fingerprints():
    print("You have chosen a classification approach. Here, we will try to generate a model (of your choice) that will predict") 
    print("if a ligand is an active or decoy, based on your input.")
    print('\n')
    print("For this, we need you to provide TWO files in a 'sdf' format. One file will contain the ACTIVES for a specific") 
    print("protein, and the other will have the DECOYS.")
    print("\n")
    print("In case you don't have the file for decoys, we recommend using http://dude.docking.org/generate or downloading") 
    print("DecoyFinder through http://urvnutrigenomica-ctns.github.io/DecoyFinder/#Downloads_\n")
    main_path = input("Please type the path where all of your files and folders are located (i.e /Users/Daniel/Desktop/ligand_net)")
    os.chdir(main_path)
    if main_path[-1]!='/':                                          #Adding the last / to the path given
        main_path = main_path+'/'
    actives_path = main_path+'actives/'                             #Defining paths for everything
    decoys_path = main_path+'decoys/'
    temp_path = main_path+'temp/'
    fingerprints_path = main_path+'fingerprints/'
    maya_path = main_path+'mayachemtools/bin/'
    maya_path_program = maya_path+'TopologicalPharmacophoreAtomTripletsFingerprints.pl'
    f_run = "perl "+maya_path_program                               #fingerprint program
    print("Getting fingerprints for actives . . .")
    #Getting the fingerprints of actives in .txt or .sdf format
    for pname in tqdm(os.listdir(actives_path)):
        if pname=='.DS_Store':                                      #Ignoring this file
            continue
        elif pname[-4:]=='.txt':
            generateFingerprint_txt(pname[:-4],actives_path,fingerprints_path,temp_path,maya_path)
            os.remove(temp_path+pname[:-4]+'temp.sdf')              #Removing temporary files
            os.remove(fingerprints_path+pname[:-4]+'_tpatf.csv')    #Removing temporary files
        elif pname[-4:]=='.sdf':
            generateFingerprint_sdf(pname[:-4],actives_path,fingerprints_path,temp_path,maya_path)
            os.remove(temp_path+pname[:-4]+'temp.sdf')              #Removing temporary files
            os.remove(fingerprints_path+pname[:-4]+'_tpatf.csv')    #Removing temporary files
            
    print("Fingerprints for actives obtained.")
    print("Getting fingerprints for decoys . . .")
    #Getting the fingerprints of decoys in .txt format
    for pname in tqdm(os.listdir(decoys_path)):
        if pname=='.DS_Store':
            continue
        elif pname[-4:]=='.txt':
            generateFingerprint_txt(pname[:-4],decoys_path,fingerprints_path,temp_path,maya_path)
            os.remove(temp_path+pname[:-4]+'temp.sdf')
            os.remove(fingerprints_path+pname[:-4]+'_tpatf.csv')
        elif pname[-4:]=='.sdf':
            generateFingerprint_sdf(pname[:-4],decoys_path,fingerprints_path,temp_path,maya_path)
            os.remove(temp_path+pname[:-4]+'temp.sdf')              #Removing temporary files
            os.remove(fingerprints_path+pname[:-4]+'_tpatf.csv')    #Removing temporary files
            
    print("Fingerprints for decoys obtained.")

### Defining the SVM method.

In [8]:
def svm(fingerprints_path):
    print('Reading actives fingerprints . . .')
    for item in tqdm(os.listdir(fingerprints_path)):
        active = open(fingerprints_path + item, 'r').readlines()
        frame1 = []
        for lines in active:
        # print(lines)
        if 'Cmpd' in lines:
            line = lines.split(';')[5].rstrip('"\n').split(' ')
            # print(len(line))
            df = pd.DataFrame(np.array(line).reshape(1, len(line)))
            df.astype(int)
            frame1.append(df)
    active_val = [1] * len(frame1)
    c_list = [1,10,100]
    gamma_list = [1,10,100]
    classifier_sv = SVC(class_weight='balanced', kernel='linear', random_state=1, verbose=classifier_loglevel)
    

IndentationError: expected an indented block (<ipython-input-8-e8accfb7a068>, line 8)

## Now we will be generating different models with a variety of Machine Learning approaches. It is up to you if you want to generate many or just one of your preference.

In [9]:
def model_generation():
    print("What would you like us to run for your model(s)?: ")
    print("1. All the Machine Learning approaches we have (Random Forests,Support Vector Machine, Neural Networks, Extra Tree")
    print("Classifier)")
    print("2. Just one")
    all_or_one = input("Please answer with a number (1 or 2):\n")
    counter = 0 
    while (all_or_one!=str(1) and all_or_one!=str(2)):
        counter = counter+1
        all_or_one = input("Invalid option, please select 1 for an all model generation or 2 for a single one")
        if counter==5:
            print("Too many attempts. Exiting program now.")
            sys.exit(1)
    if all_or_one ==str(2):
        print("You have selected only one ML approach. Which one would you like to run? (Please choose a number)")
        print("1. Support Vector Machine (SVM)")
        print("2. Random Forest (RF)")
        print("3. Extra Tree Classifier")
        print("4. Neural Networks (NN)")
        ml_model = input()
        counter=0
        while ml_model!=str(1) and ml_model!=str(2) and ml_model!=str(3) and ml_model!=str(4):
            counter=counter+1
            ml_model = input("Invalid option, please choose an integer number between 1 and 4:")
            if counter==5:
                print("Too many attempts. Exiting program now.")
            sys.exit(1)
        #if ml_model==str(1):
           
           

## Main method (where we run everything)

In [11]:
def main():
    answer = start()
    if answer == str(1):
        fingerprints()
        model_generation()
    elif answer == str(2):
        print('Regression is being built at the moment...')
    
    
main()

What would you like to do? (please type only the number of your choice)
1. Classification
2. Regression
1


You have chosen a classification approach. Here, we will try to generate a model (of your choice) that will predict
if a ligand is an active or decoy, based on your input.


For this, we need you to provide TWO files in a 'sdf' format. One file will contain the ACTIVES for a specific
protein, and the other will have the DECOYS.


In case you don't have the file for decoys, we recommend using http://dude.docking.org/generate or downloading
DecoyFinder through http://urvnutrigenomica-ctns.github.io/DecoyFinder/#Downloads_

Please type the path where all of your files and folders are located (i.e /Users/Daniel/Desktop/ligand_net)/Users/Danniel/Desktop/testing_ligandnet


  0%|          | 0/3 [00:00<?, ?it/s]

Getting fingerprints for actives . . .


100%|██████████| 3/3 [00:51<00:00, 17.07s/it]
  0%|          | 0/3 [00:00<?, ?it/s]

Fingerprints for actives obtained.
Getting fingerprints for decoys . . .


100%|██████████| 3/3 [04:50<00:00, 96.97s/it]


Fingerprints for decoys obtained.
What would you like us to run for your model(s)?: 
1. All the Machine Learning approaches we have (Random Forests,Support Vector Machine, Neural Networks, Extra Tree
Classifier)
2. Just one
Please answer with a number (1 or 2):
2
You have selected only one ML approach. Which one would you like to run? (Please choose a number)
1. Support Vector Machine (SVM)
2. Random Forest (RF)
3. Extra Tree Classifier
4. Neural Networks (NN)
1
