<a href="https://colab.research.google.com/github/aysanraza/antibiotic-finder/blob/main/antibiotic_finder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction and Problem Statement

**Introduction**

Beta-lactam antibiotics are a class of antibiotics that inhibit the synthesis of the bacterial cell wall, leading to cell death. They do this by blocking penicillin-binding proteins (PBPs), which are enzymes that catalyze the cross-linking of peptidoglycan, a major component of the bacterial cell wall. Without PBP activity, peptidoglycan cannot be cross-linked, and the bacterial cell wall becomes weak and susceptible to lysis.

Antimicrobial resistance (AMR) is the ability of microorganisms to withstand the effects of antimicrobial drugs. AMR is a serious global health threat, and it is estimated to cause 1-2 million deaths each year. There are many factors that can contribute to AMR, including overuse and misuse of antibiotics, poor infection control practices, and the spread of resistant bacteria.

Drug repurposing is the identification of new therapeutic uses for existing drugs. It is a promising approach to drug discovery, as it can accelerate the development of new treatments for diseases with unmet medical needs. There are many examples of successful drug repurposing, including thalidomide, sildenafil, and rituximab.

**Problem Statement:**

Antibiotic resistance is a growing global concern, and beta-lactams, a major class of antibiotics, have shown a dramatic decline in efficacy in recent years. This necessitates the development of newer and faster alternatives to beta-lactam resistance. This project aims to broaden the scope of alternative solutions to the global problem of antibiotic resistance.

# Importations

In [None]:
#imports

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from google.colab import data_table

# Data Preprocessings

## Data Importation

We used the chembl database to extract 25 ATP-dependent Clp protease proteolytic (clpp) inhibitors as the positive class. We created a random decoy chemical compound set consisting of 85 compounds from chembl database as the negative class. We then used the mordred toolkit to build molecular descriptors for all 110 chemical compounds in our dataset. This resulted in two files: one containing 1614 molecular decriptors for each of the 25 clpp inhibitors and the other containing 1614 molecular decriptors for each of the 85 random chemical compound.

⏬

In [None]:
# positive class: ATP-dependent Clp protease proteolytic (clpp) inhibitors

positive = pd.read_csv("/data/des_4.csv") # clpp inhibitors compounds
print("shape before proprocessing",positive.shape)
positive = positive.iloc[:,1:]
positive = positive.head(25)
print("shape after preprocessing", positive.shape)

shape before proprocessing (106, 1614)
shape after preprocessing (25, 1613)


In [None]:
# negative class: random

negative = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data/des_2.csv") #random chemical compounds
print("shape before proprocessing",negative.shape)
negative = negative.iloc[:,1:]
print("shape after preprocessing", negative.shape)

shape before proprocessing (85, 1614)
shape after preprocessing (85, 1613)


## Data Mangling and Cleaning

To prepare the dataset for training a machine learning model, we first added binary labels to the dataset and then concatenated it from both classes and shuffled it. Once we had shuffled the data, we looked for missing values and removed them from the dataset. These data preprocessing steps are important because they help to ensure that the data is in a format that is compatible with the machine learning algorithm and that it does not contain any missing values.

⏬

In [None]:
# data mangling and cleaning

data_table.enable_dataframe_formatter()
positive['labels'] = 1
negative['labels'] = 0
frames = [positive, negative]
data = pd.concat(frames)
print(data.shape)
data.fillna(method ='pad', inplace=True)
data

(110, 1614)


Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2,labels
0,22.426819,18.243028,0,0,37.507567,2.290527,4.581054,37.507567,1.250252,4.285711,...,64.616280,409.200156,7.178950,2846,39,142.0,156.0,10.888889,6.861111,1
1,28.816147,22.557659,0,3,47.134629,2.479550,4.959101,47.134629,1.273909,4.530292,...,73.702766,536.128328,8.647231,4068,57,192.0,221.0,11.652778,8.069444,1
2,31.607127,23.718932,0,0,52.712937,2.425344,4.850687,52.712937,1.285681,4.621618,...,77.780518,573.202872,7.852094,5954,59,206.0,233.0,12.562500,9.111111,1
3,28.628869,22.300380,0,0,48.015221,2.458294,4.916588,48.015221,1.297709,4.525457,...,73.439868,518.124288,8.635405,4202,55,188.0,215.0,10.951389,8.222222,1
4,24.467109,19.634654,0,0,40.628559,2.415312,4.830624,40.628559,1.269642,4.369076,...,67.427325,453.170510,7.552842,2998,44,158.0,177.0,9.979167,7.166667,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,25.315579,20.744317,0,0,37.740036,2.445647,4.880853,37.740036,1.179376,4.393059,...,83.930327,460.133395,9.202668,3310,48,174.0,201.0,13.826389,6.472222,0
81,23.308764,19.230101,0,0,39.837573,2.519325,5.030429,39.837573,1.327919,4.334756,...,79.489530,430.129550,8.115652,2326,50,158.0,189.0,8.590278,6.750000,0
82,22.003940,19.119250,0,0,37.102148,2.533727,5.067453,37.102148,1.279384,4.288204,...,64.782600,396.204907,6.950963,2106,48,148.0,176.0,9.840278,6.666667,0
83,27.728860,21.277320,0,0,43.453727,2.445033,4.880237,43.453727,1.241535,4.486406,...,87.145719,512.085399,10.450722,4426,51,188.0,217.0,12.375000,7.305556,0


## Features Extraction

we used the Pandas library to extract a sub-DataFrame containing the features from our dataset. The features in our dataset consisted of 1613 molecular descriptors spanning over the samples, which came from two major sources.

⏬

In [None]:
# features extraction
features = data.iloc[:, :-1]

## Labels Extraction

we used the Pandas library to extract a sub-DataFrame containing the labels from our dataset. The labels in our dataset consisted of two classes (1 or 0) spanning over the features of 1613 molecular descriptors.

⏬

In [None]:
# labels extraction

labels = data.iloc[:,-1:]

## Splitting Dataset

The features and labels parameters are the input and output variables of the dataset, respectively. The test_size parameter specifies the proportion of the dataset to be used for the test set. The random_state parameter is used to ensure that the split is reproducible.

The train_test_split() function returns four variables:

* X_train: The training data features
* X_test: The test data features
* y_train: The training data labels
* y_test: The test data labels

The test_size parameter is set to 0.3, which means that 30% of the dataset will be used for the test set. The random_state parameter is set to 50, which ensures that the split is reproducible.

⏬

In [None]:
# splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=50)