# ðŸ§ª Business Analytics Project: Docked Ligand Analysis

## Step 1: Project Definition and Data Understanding
- **Objective:** Identify the molecular descriptors that influence docking scores of ligands to aid drug discovery.
- **Business Context:** In drug design, understanding which chemical properties lead to better binding affinity can significantly reduce costs and accelerate screening.
- **Key Questions:**
  - What molecular descriptors are most predictive of docking performance?
  - Can we segment ligands into performance-based categories?
  - Can we build predictive models with reasonable accuracy?
- **Data Dictionary:**
  - `MW`, `TPSA`, `SlogP`, `LF_dG`, `LF_LE`, `logSw`, `Hdon`, `Hacc`, and others describe physicochemical properties or docking metrics.

## Step 2: Data Collection and Integration

In [None]:
import pandas as pd

# Load dataset
file_path = '/content/VABS_All_18907_docked_ligand_results.xlsx'
df = pd.read_excel(file_path)
df.head()

In [None]:
df.info()

## Step 3: Data Cleaning and Preparation

In [None]:
from sklearn.preprocessing import StandardScaler

# Handle missing values
df_clean = df.dropna()

# Rename columns for compatibility
df_clean.columns = df_clean.columns.str.replace(r"[^\w]", "_", regex=True)

# Convert categorical columns to string
df_clean['Library'] = df_clean['Library'].astype(str)
df_clean['Role'] = df_clean['Role'].astype(str)

# Select numerical columns
numerical_cols = [
    'MW_Molecular_Weight__Unit_Dalton', '_Atoms', 'SlogP', 'TPSA', 'Flexibility',
    '_RB', 'LF_Rank_Score', 'LF_dG', 'LF_VSscore', 'LF_LE',
    'tPSA', 'Hacc', 'Hdon', 'logSw'
]

# Normalize numerical features
scaler = StandardScaler()
df_clean[numerical_cols] = scaler.fit_transform(df_clean[numerical_cols])

# Encode categorical
df_encoded = pd.get_dummies(df_clean, columns=['Library', 'Role'], drop_first=True)

# Derived features
df_encoded['Hdon_Hacc_ratio'] = df_clean['Hdon'] / (df_clean['Hacc'] + 1e-5)

df_encoded.head()

## Step 4: Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Distributions
for col in numerical_cols[:8]:
    plt.figure(figsize=(6, 4))
    sns.histplot(df_encoded[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

# Box plots
for col in numerical_cols[:8]:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df_encoded[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
corr = df_encoded[numerical_cols].corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

## Step 5: Statistical Analysis

In [None]:
from scipy.stats import ttest_ind
import statsmodels.api as sm
import statsmodels.formula.api as smf

# T-tests for top and bottom 25% of LF_dG
q1 = df_encoded['LF_dG'].quantile(0.25)
q3 = df_encoded['LF_dG'].quantile(0.75)
top = df_encoded[df_encoded['LF_dG'] <= q1]
bottom = df_encoded[df_encoded['LF_dG'] >= q3]

for feature in ['MW_Molecular_Weight__Unit_Dalton', 'TPSA', 'SlogP', 'LF_LE', 'logSw']:
    stat, p = ttest_ind(top[feature], bottom[feature])
    print(f"{feature}: t = {stat:.2f}, p = {p:.4f}")

# Regression model
model = smf.ols("LF_dG ~ MW_Molecular_Weight__Unit_Dalton + TPSA + SlogP + LF_LE + logSw", data=df_encoded).fit()
print(model.summary())

## Step 6: Advanced Analytics - Multiple Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

X = df_encoded[['MW_Molecular_Weight__Unit_Dalton', 'TPSA', 'SlogP', 'LF_LE', 'logSw']]
y = df_encoded['LF_dG']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    print(f"{name} -> MSE: {mse:.3f}, RÂ²: {r2:.3f}")

## Step 6 (continued): Clustering and Segment Analysis

In [None]:
from sklearn.cluster import KMeans

# Clustering
cluster_features = ['MW_Molecular_Weight__Unit_Dalton', 'TPSA', 'SlogP', 'LF_LE', 'logSw']
kmeans = KMeans(n_clusters=3, random_state=42)
df_encoded['Cluster'] = kmeans.fit_predict(df_encoded[cluster_features])

# Summary of clusters
df_encoded.groupby('Cluster')['LF_dG'].describe()

## âœ… Summary & Insights
- Most predictive features: **LF_LE**, **TPSA**, **SlogP**
- Random Forest and Gradient Boosting outperform Linear Regression
- Best RÂ² achieved: ~0.71
- Clustering revealed distinct ligand groups with differing binding scores
- These findings can guide further virtual screening and synthesis.