<a href="https://colab.research.google.com/github/aaishams/PeriodHealthGummies/blob/main/bbb_permeability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Blood-Brain Barrier Permeability ML Project**

## **Create Dataset**

### Install the Required Packages

In [46]:
!pip install pubchempy
!pip install rdkit
!pip install adme_py



### List the Compounds

In [47]:
compounds = [
    # --- Simple molecules, solvents, aromatics ---
    "Benzene", "Toluene", "Xylene", "Ethanol", "Methanol", "Isopropanol", "Acetone",
    "Formaldehyde", "Acetic acid", "Formic acid", "Benzoic acid", "Phenol", "Aniline",
    "Pyridine", "Imidazole", "Pyrazole", "Indole", "Quinoline", "Isoquinoline",
    "Naphthalene", "Anthracene", "Phenanthrene", "Styrene", "Cyclohexane",
    "Cyclopentane", "Cyclobutane", "Cyclopropane", "Propane", "Butane", "Pentane",
    "Hexane", "Heptane", "Octane", "Decane", "Ethylene", "Propylene", "Butadiene",
    "Acetonitrile", "Dimethylformamide", "Dimethyl sulfoxide", "Urea", "Thiourea",
    "Glycerol", "Glucose", "Fructose", "Sucrose",

    # --- Amino acids ---
    "Alanine", "Glycine", "Valine", "Leucine", "Isoleucine", "Serine", "Threonine",
    "Cysteine", "Methionine", "Phenylalanine", "Tyrosine", "Tryptophan", "Histidine",
    "Aspartic acid", "Glutamic acid", "Lysine", "Arginine", "Proline",

    # --- NSAIDs & painkillers ---
    "Aspirin", "Paracetamol", "Ibuprofen", "Diclofenac", "Naproxen", "Indomethacin",
    "Ketoprofen", "Celecoxib", "Etoricoxib", "Meloxicam",

    # --- Diabetes (antidiabetics) ---
    "Metformin", "Glibenclamide", "Gliclazide", "Pioglitazone", "Rosiglitazone",
    "Sitagliptin", "Linagliptin", "Saxagliptin", "Canagliflozin", "Dapagliflozin",
    "Empagliflozin",

    # --- Statins ---
    "Atorvastatin", "Simvastatin", "Rosuvastatin", "Pravastatin", "Lovastatin",
    "Fluvastatin", "Pitavastatin",

    # --- Antihypertensives ---
    "Amlodipine", "Nifedipine", "Felodipine", "Verapamil", "Diltiazem",
    "Metoprolol", "Atenolol", "Propranolol", "Carvedilol", "Bisoprolol", "Nebivolol",
    "Losartan", "Valsartan", "Candesartan", "Irbesartan", "Telmisartan", "Olmesartan",
    "Ramipril", "Lisinopril", "Enalapril", "Perindopril", "Captopril", "Quinapril",
    "Hydrochlorothiazide", "Chlorthalidone", "Furosemide", "Bumetanide", "Torsemide",
    "Spironolactone", "Eplerenone",

    # --- Cardiac drugs ---
    "Digoxin", "Amiodarone", "Lidocaine", "Flecainide", "Procainamide", "Quinidine",

    # --- Antiepileptics ---
    "Phenytoin", "Carbamazepine", "Valproic acid", "Lamotrigine", "Levetiracetam",
    "Topiramate", "Gabapentin", "Pregabalin",

    # --- Benzodiazepines ---
    "Diazepam", "Lorazepam", "Alprazolam", "Clonazepam", "Midazolam",

    # --- Antipsychotics ---
    "Haloperidol", "Risperidone", "Olanzapine", "Quetiapine", "Clozapine",
    "Aripiprazole", "Chlorpromazine", "Thioridazine",

    # --- Antidepressants ---
    "Fluoxetine", "Sertraline", "Paroxetine", "Citalopram", "Escitalopram",
    "Venlafaxine", "Duloxetine", "Bupropion", "Mirtazapine", "Trazodone",

    # --- Mood stabilizers / others ---
    "Lithium carbonate", "Thiopental", "Ketamine",

    # --- Opioids / pain management ---
    "Morphine", "Codeine", "Fentanyl", "Tramadol", "Methadone",

    # --- Antibiotics (extra group to bring total ~200) ---
    "Amoxicillin", "Ampicillin", "Penicillin G", "Cefalexin", "Ceftriaxone",
    "Ciprofloxacin", "Levofloxacin", "Moxifloxacin", "Gentamicin", "Streptomycin",
    "Erythromycin", "Azithromycin", "Clarithromycin", "Tetracycline", "Doxycycline",
    "Minocycline", "Chloramphenicol", "Clindamycin", "Vancomycin", "Linezolid",
    "Rifampicin", "Isoniazid", "Pyrazinamide", "Ethambutol",

    # --- Antivirals / antifungals ---
    "Acyclovir", "Zidovudine", "Lamivudine", "Tenofovir", "Oseltamivir",
    "Remdesivir", "Amphotericin B", "Fluconazole", "Itraconazole", "Ketoconazole",

    # --- Simple inorganics / lab chemicals ---
    "Sodium chloride", "Potassium chloride", "Calcium carbonate", "Magnesium sulfate",
    "Ammonium chloride", "Sodium bicarbonate", "Potassium permanganate",
    "Hydrochloric acid", "Sulfuric acid", "Nitric acid", "Phosphoric acid",

    # --- Small alcohols & organics ---
    "Butanol", "Pentanol", "Hexanol", "Octanol", "Phenylethanol", "Allyl alcohol", "Glycolic acid",
    "Lactic acid", "Malic acid", "Succinic acid", "Citric acid", "Oxalic acid", "Tartaric acid",
    "Fumaric acid", "Maleic acid", "Adipic acid", "Glutaric acid", "Stearic acid", "Palmitic acid",
    "Oleic acid", "Linoleic acid", "Linolenic acid",

    # --- Amines ---
    "Methylamine", "Ethylamine", "Aniline", "Piperidine", "Piperazine", "Morpholine",
    "Triethylamine", "Diethylamine", "Dimethylamine", "Imidazole", "Histamine",

    # --- Sugars & related ---
    "Maltose", "Lactose", "Ribose", "Deoxyribose", "Mannose", "Galactose", "Trehalose",
    "Cellobiose", "Starch", "Cellulose",

    # --- Vitamins & cofactors ---
    "Vitamin A", "Vitamin B1", "Vitamin B2", "Vitamin B3", "Vitamin B5", "Vitamin B6",
    "Vitamin B7", "Vitamin B9", "Vitamin B12", "Vitamin C", "Vitamin D", "Vitamin E",
    "Vitamin K1", "Vitamin K2", "Niacinamide", "Folic acid", "Biotin", "Coenzyme Q10",
    "Thiamine pyrophosphate", "Riboflavin mononucleotide", "NADH", "NADPH", "FAD", "FMN",

    # --- Nucleobases & nucleotides ---
    "Adenine", "Guanine", "Cytosine", "Thymine", "Uracil",
    "AMP", "ADP", "ATP", "GMP", "GDP", "GTP", "CMP", "UMP", "TMP",

    # --- Flavonoids & polyphenols ---
    "Quercetin", "Kaempferol", "Myricetin", "Luteolin", "Apigenin", "Genistein", "Daidzein",
    "Catechin", "Epicatechin", "Epigallocatechin gallate", "Hesperidin", "Naringenin",
    "Rutin", "Galangin", "Chrysin", "Baicalein",

    # --- Terpenes ---
    "Limonene", "Menthol", "Camphor", "Pinene", "Myrcene", "Farnesol", "Squalene",
    "Geraniol", "Citral", "Linalool", "Eucalyptol",

    # --- Alkaloids ---
    "Caffeine", "Theobromine", "Theophylline", "Nicotine", "Cotinine", "Cocaine", "Quinine",
    "Reserpine", "Strychnine", "Scopolamine", "Atropine", "Hyoscyamine", "Pilocarpine",
    "Ergotamine", "Psilocybin",

    # --- Common drugs (extensions) ---
    "Omeprazole", "Pantoprazole", "Lansoprazole", "Esomeprazole", "Rabeprazole",
    "Ranitidine", "Famotidine", "Cimetidine", "Sucralfate",

    # --- Cancer drugs ---
    "Cisplatin", "Carboplatin", "Oxaliplatin", "Paclitaxel", "Docetaxel", "Vincristine",
    "Vinblastine", "Etoposide", "Doxorubicin", "Daunorubicin", "Methotrexate",
    "5-Fluorouracil", "Gemcitabine", "Capecitabine", "Cytarabine", "Cyclophosphamide",
    "Ifosfamide", "Tamoxifen", "Letrozole", "Anastrozole", "Imatinib", "Dasatinib",
    "Nilotinib", "Erlotinib", "Gefitinib", "Sorafenib", "Sunitinib", "Bevacizumab", "Trastuzumab",

    # --- Antivirals (HIV, Hepatitis, Influenza, COVID) ---
    "Efavirenz", "Nevirapine", "Delavirdine", "Etravirine", "Raltegravir", "Dolutegravir",
    "Bictegravir", "Abacavir", "Didanosine", "Stavudine", "Zalcitabine",
    "Sofosbuvir", "Velpatasvir", "Ledipasvir", "Glecaprevir", "Pibrentasvir",
    "Favipiravir", "Molnupiravir", "Paxlovid",

    # --- Antifungals (extensions) ---
    "Posaconazole", "Voriconazole", "Caspofungin", "Micafungin", "Anidulafungin",
    "Terbinafine", "Griseofulvin", "Nystatin",

    # --- Steroids & hormones ---
    "Cortisol", "Cortisone", "Prednisone", "Prednisolone", "Dexamethasone", "Hydrocortisone",
    "Betamethasone", "Estradiol", "Progesterone", "Testosterone", "Dihydrotestosterone",
    "Aldosterone", "Dehydroepiandrosterone",

    # --- Antibiotics (extensions) ---
    "Meropenem", "Imipenem", "Ertapenem", "Doripenem",
    "Cefepime", "Ceftazidime", "Cefotaxime", "Cefuroxime", "Cefdinir",
    "Aztreonam", "Tigecycline", "Colistin", "Polymyxin B",

    # --- Misc natural products ---
    "Curcumin", "Resveratrol", "Piperine", "Capsaicin", "Ginkgolide A", "Ginkgolide B",
    "Bilobalide", "Artemisinin", "Artemether", "Dihydroartemisinin",

    # --- Inorganics / salts / acids / bases ---
    "Sodium hydroxide", "Potassium hydroxide", "Calcium hydroxide", "Magnesium hydroxide",
    "Sodium sulfate", "Ammonium sulfate", "Boric acid", "Sodium phosphate",
    "Potassium phosphate", "Ammonium nitrate", "Sodium acetate", "Potassium acetate",

    # --- Extra filler (to bring closer to 1000) ---
    "Pyruvate", "Acetoacetate", "Beta-hydroxybutyrate", "Glutathione", "Cystine",
    "Ornithine", "Citrulline", "Homocysteine", "S-Adenosylmethionine", "Creatine",
    "Creatinine", "Bilirubin", "Bile acid", "Cholic acid", "Chenodeoxycholic acid",
    "Lithocholic acid", "Uric acid", "Allopurinol", "Probenecid"
]


### Calculate the Parameters

In [48]:
import pubchempy as pcp
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski, Crippen
import pandas as pd
from adme_py import ADME

cols = ["Canonical SMILES", "MW", "logP", "HBD", "HBA", "TPSA", "Rotatable Bonds", "B3P"]
details = []
for comp in compounds:
  comp_data = []
  compound = pcp.get_compounds(comp, "name")
  if not compound:
    print(comp, "not found!")
    continue
  else:
    smiles = compound[0].connectivity_smiles

    mol = Chem.MolFromSmiles(smiles)
    molwt = Descriptors.MolWt(mol)
    logp = Crippen.MolLogP(mol)
    hbd = Lipinski.NumHDonors(mol)
    hba = Lipinski.NumHAcceptors(mol)
    tpsa = Descriptors.TPSA(mol)
    rotbonds = Lipinski.NumRotatableBonds(mol)
    adme = ADME(smiles).calculate()
    b3p_boolean = adme['pharmacokinetics']['blood_brain_barrier_permeant']
    if b3p_boolean == True:
      b3p = 1
    else:
      b3p = 0

    comp_data = comp_data + [smiles, molwt, logp, hbd, hba, tpsa, rotbonds, b3p]
  details.append(comp_data)

Starch not found!
Vitamin D not found!
TMP not found!
Sucralfate not found!
Bevacizumab not found!
Trastuzumab not found!
Polymyxin B not found!


### Build the Dataset

In [49]:
input_df = pd.DataFrame(details, columns = cols)
input_df

Unnamed: 0,Canonical SMILES,MW,logP,HBD,HBA,TPSA,Rotatable Bonds,B3P
0,C1=CC=CC=C1,78.114,1.68660,0,0,0.00,0,0
1,CC1=CC=CC=C1,92.141,1.99502,0,0,0.00,0,0
2,CC1=CC=CC=C1C,106.168,2.30344,0,0,0.00,0,1
3,CCO,46.069,-0.00140,1,1,20.23,0,0
4,CO,32.042,-0.39150,1,1,20.23,0,0
...,...,...,...,...,...,...,...,...
455,CC(CCC(=O)O)C1CCC2C1(CCC3C2C(CC4C3(CCC(C4)O)C)O)C,392.580,4.47790,3,3,77.76,4,0
456,CC(CCC(=O)O)C1CCC2C1(CCC3C2CCC4C3(CCC(C4)O)C)C,376.581,5.50710,2,2,57.53,4,1
457,C12=C(NC(=O)N1)NC(=O)NC2=O,168.112,-1.76720,4,3,114.37,0,0
458,C1=NNC2=C1C(=O)NC=N2,136.114,-0.35380,2,3,74.43,0,0


## **Data Preparation**

### Data Preparation as x and y

In [50]:
y = input_df["B3P"]
y

Unnamed: 0,B3P
0,0
1,0
2,1
3,0
4,0
...,...
455,0
456,1
457,0
458,0


In [51]:
x = input_df[["MW", "logP", "HBD", "HBA", "TPSA", "Rotatable Bonds"]]
x

Unnamed: 0,MW,logP,HBD,HBA,TPSA,Rotatable Bonds
0,78.114,1.68660,0,0,0.00,0
1,92.141,1.99502,0,0,0.00,0
2,106.168,2.30344,0,0,0.00,0
3,46.069,-0.00140,1,1,20.23,0
4,32.042,-0.39150,1,1,20.23,0
...,...,...,...,...,...,...
455,392.580,4.47790,3,3,77.76,4
456,376.581,5.50710,2,2,57.53,4
457,168.112,-1.76720,4,3,114.37,0
458,136.114,-0.35380,2,3,74.43,0


### Data Splitting

In [52]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 100)

In [53]:
x_train

Unnamed: 0,MW,logP,HBD,HBA,TPSA,Rotatable Bonds
21,178.234,3.99300,0,0,0.00,0
136,284.746,3.15380,0,2,32.67,1
19,128.174,2.83980,0,0,0.00,0
235,45.085,-0.03500,1,1,26.02,0
37,41.053,0.52988,0,1,23.79,0
...,...,...,...,...,...,...
343,853.918,3.73570,4,14,221.29,10
359,293.374,2.92876,0,5,78.29,4
323,324.424,3.17320,1,4,45.59,4
280,126.115,-0.62838,2,2,65.72,0


In [54]:
x_test

Unnamed: 0,MW,logP,HBD,HBA,TPSA,Rotatable Bonds
300,610.565,-1.15660,8,15,234.29,7
422,305.418,3.78960,2,3,58.56,9
360,493.615,4.59032,2,7,86.28,7
78,357.435,2.49090,1,6,71.53,7
421,285.343,2.99720,0,3,38.77,3
...,...,...,...,...,...,...
51,105.093,-1.60940,3,3,83.55,2
443,104.105,-0.15810,2,2,57.53,2
1,92.141,1.99502,0,0,0.00,0
308,152.237,2.40170,0,1,17.07,0


In [55]:
y_train

Unnamed: 0,B3P
21,1
136,1
19,1
235,0
37,0
...,...
343,0
359,1
323,1
280,0


In [56]:
y_test

Unnamed: 0,B3P
300,0
422,1
360,0
78,1
421,1
...,...
51,0
443,0
1,0
308,1


In [57]:
import numpy as np

np.set_printoptions(threshold = np.inf)

## **Model Building - Random Forest Classifier**

### Training the Model

In [58]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth = 2, random_state = 100)
rfc.fit(x_train, y_train)

### Applying the Model to Make a Prediction

In [59]:
y_rfc_train_pred = rfc.predict(x_train)
y_rfc_test_pred = rfc.predict(x_test)

In [60]:
y_rfc_train_pred

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,

In [61]:
y_rfc_test_pred

array([0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 1])

### Evaluate Model Performance

In [62]:
from sklearn.metrics import mean_squared_error, r2_score

rfc_train_mse = mean_squared_error(y_train, y_rfc_train_pred)
rfc_train_r2 = r2_score(y_train, y_rfc_train_pred)

rfc_test_mse = mean_squared_error(y_test, y_rfc_test_pred)
rfc_test_r2 = r2_score(y_test, y_rfc_test_pred)

In [63]:
print("RFC MSE (train): ", rfc_train_mse)
print("RFC R2 (train): ", rfc_train_r2)
print("RFC MSE (test): ", rfc_test_mse)
print("RFC R2 (test): ", rfc_test_r2)

RFC MSE (train):  0.08695652173913043
RFC R2 (train):  0.5520219119716971
RFC MSE (test):  0.09782608695652174
RFC R2 (test):  0.601923076923077


In [64]:
rfc_results = pd.DataFrame([rfc_train_mse, rfc_train_r2, rfc_test_mse, rfc_test_r2]).transpose()
rfc_results.columns = ["Training MSE", "Training R2", "Testing MSE", "Testing R2"]

rfc_results

Unnamed: 0,Training MSE,Training R2,Testing MSE,Testing R2
0,0.086957,0.552022,0.097826,0.601923
