# Introduction and Problem Statement

**Introduction:**

Beta-lapachone is a naphthoquinone compound that has been shown to have a variety of biological activities, including anti-cancer, anti-inflammatory, and anti-obesity effects. In recent years, there has been growing interest in the potential of beta-lapachone for the treatment of obesity and its associated metabolic disorders.

One of the key mechanisms by which beta-lapachone is thought to exert its anti-obesity effects is through the regulation of adipogenesis, the process by which fat cells are formed. Beta-lapachone has been shown to inhibit adipogenesis by downregulating the expression of key adipogenesis-related genes and by inducing the apoptosis of adipocytes.

Alpha-lapachone is another naphthoquinone compound that has been shown to have a variety of biological activities, including anti-cancer and anti-inflammatory effects. However, the potential anti-obesity activity of alpha-lapachone has not been well studied.

**Problem Statement**

Given the promising anti-obesity activity of beta-lapachone, it is of interest to investigate whether alpha-lapachone, another member of the naphthoquinones class, also possesses anti-obesity activity. This information could lead to the development of new and more effective anti-obesity therapies.

# Importations

In [None]:
# imports

import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC
from IPython.display import Image

# Data Preprocessing

## Data Importation

To screen for potential drug target proteins for alpha-lapachone, we used the Swiss Target Prediction server. This server uses a variety of methods, including machine learning, to predict the targets of small molecules. We also used the UniProt database to screen for a decoy protein dataset. This database contains information on over 230 million proteins from a variety of organisms.Once we had identified a set of potential drug target proteins and a decoy protein dataset, we used a Python-based tool called ProPy3 to screen 8,000 tripeptides for each of the proteins in our dataset. Tripeptides are short peptides that are made up of three amino acids. ProPy3 is a tool that can be used to predict the binding of peptides to proteins.

As a result of this screening process, we generated a dataset of 336,000 tripeptides for alpha-lapachone targets and 672,000 tripeptides for the decoy dataset. This dataset will be used to further investigate the potential anti-obesity activity of alpha-lapachone and to develop new and more effective anti-obesity therapies.

⏬

In [None]:
# data importation

df1 = pd.read_csv("/data/alpha_lapachone_targets_tripeptide.csv", header=None)
print(df1.shape)
df2 = pd.read_csv('/data/decoy.csv', header=None)
df2 = df2.head(84)
print(df2.shape)

(42, 8000)
(84, 8000)


## Data Transformation and Mangling


To prepare the dataset for training a machine learning model, we first added binary labels to the dataset and then concatenated it from both classes and shuffled it. Once we had shuffled the data, we looked for missing values and removed them from the dataset. These data preprocessing steps are important because they help to ensure that the data is in a format that is compatible with the machine learning algorithm and that it does not contain any missing values.

⏬

In [None]:
# data transformation and mangling

df1['label'] = 1
df2['label'] = 0
df = pd.concat([df1, df2], ignore_index=True)
df = shuffle(df)
print (df.shape)
print (df)
df.fillna(method ='pad', inplace=True) # filling null values with pad

(126, 8001)
     0  1  2  3  4  5  6  7  8  9  ...  7991  7992  7993  7994  7995  7996  \
27   0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     1     0   
99   2  0  0  0  0  0  0  0  0  1  ...     0     0     0     1     0     0   
38   2  0  1  1  0  0  0  0  0  0  ...     0     0     0     0     0     0   
87   0  0  0  0  0  0  0  0  0  0  ...     1     0     0     0     0     0   
69   0  0  0  2  0  0  0  0  0  1  ...     0     0     0     0     0     0   
..  .. .. .. .. .. .. .. .. .. ..  ...   ...   ...   ...   ...   ...   ...   
76   1  0  1  1  0  0  0  0  1  0  ...     0     0     2     1     0     0   
119  0  1  2  0  0  2  0  4  0  1  ...     1     0     0     0     1     0   
1    0  0  0  0  1  1  0  0  0  0  ...     0     0     0     1     0     0   
111  1  0  1  1  0  1  2  2  0  2  ...     0     1     1     2     2     0   
108  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   

     7997  7998  7999  label  
27      0     0     

## Features Extraction

we used the Pandas library to extract a sub-DataFrame containing the features from our dataset. The features in our dataset consisted of 8000 tripeptides spanning over the samples, which came from two major sources.

⏬

In [None]:
# features extraction

features = df.iloc[:, :-1]; features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7990,7991,7992,7993,7994,7995,7996,7997,7998,7999
27,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
99,2,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
38,2,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
87,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
69,0,0,0,2,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,1,0,1,1,0,0,0,0,1,0,...,0,0,0,2,1,0,0,0,0,0
119,0,1,2,0,0,2,0,4,0,1,...,1,1,0,0,0,1,0,0,0,0
1,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
111,1,0,1,1,0,1,2,2,0,2,...,0,0,1,1,2,2,0,0,0,0


## Labels Extraction

we used the Pandas library to extract a sub-DataFrame containing the labels from our dataset. The labels in our dataset consisted of two classes (1 or 0) spanning over the features of 8000 tripeptides.

⏬

In [None]:
# labels extraction

labels = df.iloc[:,-1:]; labels

Unnamed: 0,label
27,1
99,0
38,1
87,0
69,0
...,...
76,0
119,0
1,1
111,0


## Splitting Dataset

The features and labels parameters are the input and output variables of the dataset, respectively. The test_size parameter specifies the proportion of the dataset to be used for the test set. The random_state parameter is used to ensure that the split is reproducible.

The train_test_split() function returns four variables:

* X_train: The training data features
* X_test: The test data features
* y_train: The training data labels
* y_test: The test data labels

The test_size parameter is set to 0.3, which means that 30% of the dataset will be used for the test set. The random_state parameter is set to 50, which ensures that the split is reproducible.

⏬

In [None]:
# splitting dataset

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=50)