# **Bioinformatics Project - Computational Drug Discovery: Bioactivity Prediction App**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a Streamlit web app for predicting bioactivity of compounds against acetylcholinesterase.

This app uses the trained Random Forest model from Part 4 to make predictions on new compounds.

---

## **1. Install required libraries**

In [None]:
! pip install streamlit pandas numpy scikit-learn pickle-mixin

## **2. Import libraries**

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import VarianceThreshold
import pickle
import subprocess
import os

## **3. Load the dataset and train/save the model**

In [None]:
# Load the dataset (use updated dataset if available)
df = pd.read_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv')

# Prepare features (same as Part 4)
X = df.drop('pIC50', axis=1)
Y = df.pIC50

# Apply variance threshold (same as Part 4)
selection = VarianceThreshold(threshold=(.8 * (1 - .8)))
X = selection.fit_transform(X)

# Train the model (same as Part 4)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, Y_train)

# Save the model and feature selector
pickle.dump(model, open('bioactivity_model.pkl', 'wb'))
pickle.dump(selection, open('feature_selector.pkl', 'wb'))

print('Model trained and saved successfully!')
print(f'Model R² score: {model.score(X_test, Y_test):.3f}')

## **4. Function to calculate PubChem fingerprints from SMILES**

In [None]:
def calculate_pubchem_fingerprints(smiles_list):
    """
    Calculate PubChem fingerprints for a list of SMILES strings using PaDEL-Descriptor
    (Same method as used in Part 3)
    """
    # Create a temporary SMILES file
    with open('molecule.smi', 'w') as f:
        for i, smiles in enumerate(smiles_list):
            f.write(f'{smiles}\tmol_{i}\n')
    
    # Check if PaDEL-Descriptor is available
    if os.path.exists('PaDEL-Descriptor/PaDEL-Descriptor.jar'):
        # Use PaDEL-Descriptor (Java-based, same as Part 3)
        cmd = [
            'java', '-Xms1G', '-Xmx1G', '-Djava.awt.headless=true',
            '-jar', 'PaDEL-Descriptor/PaDEL-Descriptor.jar',
            '-removesalt', '-standardizenitro', '-fingerprints',
            '-descriptortypes', 'PaDEL-Descriptor/PubchemFingerprinter.xml',
            '-dir', './', '-file', 'descriptors_output.csv'
        ]
        subprocess.run(cmd, check=True)
    else:
        st.error('PaDEL-Descriptor not found. Please run Part 3 first to download it.')
        return None
    
    # Load and return the descriptors
    if os.path.exists('descriptors_output.csv'):
        descriptors = pd.read_csv('descriptors_output.csv')
        descriptors = descriptors.drop('Name', axis=1)
        return descriptors
    else:
        return None

## **5. Streamlit App**

In [None]:
st.write('''
# Bioactivity Prediction App (Acetylcholinesterase)

This app predicts the **bioactivity** (pIC50 value) of compounds towards the **Acetylcholinesterase** enzyme.

**Credits:** App built in `Python` + `Streamlit` by [Data Professor](https://youtube.com/dataprofessor)
''')

In [None]:
# Sidebar
st.sidebar.header('User Input Features')

# SMILES input
SMILES_input = st.sidebar.text_area("SMILES input", "CCO")

st.sidebar.markdown("""
[Example SMILES](https://www.rdkit.org/docs/GettingStartedInPython.html)
""")

In [None]:
# Calculate descriptors and make prediction
if st.sidebar.button('Predict'):
    try:
        # Calculate fingerprints
        with st.spinner('Calculating molecular descriptors...'):
            descriptors = calculate_pubchem_fingerprints([SMILES_input])
        
        if descriptors is not None:
            # Apply variance threshold
            selection = pickle.load(open('feature_selector.pkl', 'rb'))
            descriptors_selected = selection.transform(descriptors)
            
            # Load model and make prediction
            model = pickle.load(open('bioactivity_model.pkl', 'rb'))
            prediction = model.predict(descriptors_selected)
            
            st.header('Predicted pIC50 Value')
            st.write(f'**{prediction[0]:.2f}**')
            
            # Interpretation
            if prediction[0] > 6:
                st.success('Compound is likely **ACTIVE** (pIC50 > 6)')
            elif prediction[0] < 5:
                st.warning('Compound is likely **INACTIVE** (pIC50 < 5)')
            else:
                st.info('Compound is likely **INTERMEDIATE** (5 ≤ pIC50 ≤ 6)')
            
            st.header('Calculated molecular descriptors')
            st.write(descriptors)
        else:
            st.error('Failed to calculate descriptors. Please check your SMILES input.')
            
    except Exception as e:
        st.error(f'Error: {str(e)}')
        st.info('Please check your SMILES input and ensure all dependencies are installed.')

## **6. Run the Streamlit App**

In [None]:
# To run this app, use the following command in terminal:
# streamlit run bioactivity_prediction_app.py
# 
# Note: This notebook contains the app code. To create a .py file, 
# copy the Streamlit code cells to a Python file and run with streamlit.