# Introduction and Problem Statement

**Introduction:**

Parkinson's disease (PD) is a neurodegenerative disorder of the central nervous system that primarily affects the motor system. The exact pathogenesis of PD remains incompletely understood, and current treatments focus on increasing dopamine levels in the brain. This poses a major challenge for the development of new and effective therapies.

Genome-wide association studies (GWAS) are a powerful tool for identifying genetic variants associated with complex diseases, such as PD. To date, more than 50 GWAS studies of PD have been published, and these studies have identified a large number of genetic risk factors for the disease. The data generated by these studies is helping us to better understand the underlying mechanisms of PD and to develop new diagnostic and therapeutic approaches.

Classification is a supervised machine learning task that involves training a model to predict the class labels of new data points based on a set of training data points with known class labels. Classification algorithms have been widely used in biomedical research to predict disease risk, identify disease subtypes, and develop personalized treatment plans.

**Problem Statement:**

Genome-wide association studies (GWAS) have provided a deeper understanding of the genetic basis of complex diseases, including Parkinson's disease (PD). However, there is still a need to identify new disease gene associations for PD. This research has the potential to generate new and improved hypotheses about the pathogenesis of PD.






# Importations

In [None]:
#imports

import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC
from joblib import dump, load
from IPython.display import Image
import matplotlib.pyplot as plt
from sklearn import metrics

# Data Processing

## Data Importation

We used the DisGeNET database to extract 235 Parkinson's disease (PD) gene associations as the positive class. We created a random decoy protein set consisting of 250 proteins from UniProt as the negative class. We then used the Python-based tool Propy3 to build a tripeptide dataset for all 485 proteins in our dataset. This resulted in two files: one containing 8000 tripeptides for each of the 235 PD proteins and the other containing 8000 tripeptides for each of the 250 decoy set proteins.

⏬


In [None]:
# dataset importation

df1 = pd.read_csv('/data/tripeptide_data_PD.csv', header=None)
df2 = pd.read_csv('/data/tripeptide_data_non-pd.csv', header=None)

print(df1.shape), print(df2.shape)

(235, 8000)
(250, 8000)


(None, None)

## Data Mangling

To prepare the dataset for training a machine learning model, we first added binary labels to the dataset and then concatenated it from both classes and shuffled it. Once we had shuffled the data, we looked for missing values and removed them from the dataset. These data preprocessing steps are important because they help to ensure that the data is in a format that is compatible with the machine learning algorithm and that it does not contain any missing values.

⏬

In [None]:
# data mangling and cleaning

df1['label'] = 1
df2['label'] = 0
df = pd.concat([df1, df2], ignore_index=True)
df = shuffle(df)
print (df.shape), print (df)
df.fillna(method ='pad', inplace=True) # filling null values with pad

(485, 8001)
     0  1  2  3  4  5  6  7  8  9  ...  7991  7992  7993  7994  7995  7996  \
276  1  1  0  1  0  1  0  0  0  0  ...     1     0     0     0     1     0   
429  1  1  0  0  0  0  0  1  0  0  ...     0     0     0     0     0     0   
12   0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
332  0  1  0  1  1  0  1  0  0  0  ...     0     0     0     1     0     0   
99   0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
..  .. .. .. .. .. .. .. .. .. ..  ...   ...   ...   ...   ...   ...   ...   
284  0  0  0  0  0  0  2  0  0  0  ...     0     0     0     0     0     0   
255  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
366  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   
145  2  0  0  0  0  0  1  2  1  1  ...     0     0     0     0     0     0   
229  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0     0   

     7997  7998  7999  label  
276     0     0     

## Features Extraction

We used the Pandas library to extract a sub-dataframe containing the features from our dataset. The features in our dataset consisted of 8000 tripeptides spanning over the samples, which came from two major sources.

⏬

In [None]:
features = df.iloc[:, :-1]; features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7990,7991,7992,7993,7994,7995,7996,7997,7998,7999
276,1,1,0,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
429,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
332,0,1,0,1,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,2
99,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
255,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
145,2,0,0,0,0,0,1,2,1,1,...,0,0,0,0,0,0,0,0,1,1


## Labels Extraction

We used the Pandas library to extract a sub-dataframe containing the labels from our dataset. The labels in our dataset consisted of two classes (1 or 0) spanning over the features of 8000 tripeptides.

⏬

In [None]:
labels = df.iloc[:,-1:]; labels

Unnamed: 0,label
276,0
429,0
12,1
332,0
99,1
...,...
284,0
255,0
366,0
145,1


## Splitting Dataset

The features and labels parameters are the input and output variables of the dataset, respectively. The test_size parameter specifies the proportion of the dataset to be used for the test set. The random_state parameter is used to ensure that the split is reproducible.

The train_test_split() function returns four variables:

* X_train: The training data features
* X_test: The test data features
* y_train: The training data labels
* y_test: The test data labels

The test_size parameter is set to 0.3, which means that 30% of the dataset will be used for the test set. The random_state parameter is set to 50, which ensures that the split is reproducible.

⏬

In [None]:
# splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=50)