# Preprocessing

In this notebook, I hope to create a class that will scale the data both with standardized scaling (subtracting the mean and dividing the difference by the standard deviation) and the max min scaling (subtracting the min and dividing the difference by the range of the max and min of the dataset).

In addition to the scaling, I will also include a train_test split as part of the class as well, to ensure that the data is ready to go for the models in the next notebook.

Finally, the class will produce two sets of data representing variable lengths of fictitiuos proteins with random amino acid sequences (barring the required start with the methionine residue)

This project is dedicated towards seeing if there is a clear way to indicate a high probabilites if proteins will interaction with each other based on the qualities of the respective proteins.  The proteins were analyzed according to their sequence; from thsoe sequences, several features were engineerd: chiefly quantitave counts of different kinds of features contained by the different amino acids.  These features include the amount of hydrophobic amino acid residues, hydrophilic residues, basic residues. etc.

With the development of these features, the data will be fed into a classification model, wherein the model will predict whether or not protein 1 and protein 2 will interact with each other.

Inputs: the features of the proteins

Outputs: a 0 if the proteins do not interact or a 1 if the proteins do interact

The data was originally a set of two csv files; one for interacting proteins and one for non interacting proteins.  The individual csvs contained two protein sequences each.

The data is coming from kaggle https://www.kaggle.com/datasets/spandansureja/ppi-dataset?resource=download

# Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import random

In [164]:
class preprocessing:
    
    def __init__(self, df_path):
        self.df = pd.read_csv(df_path)
        self.df = self.df.drop(columns='Unnamed: 0')
        
        
    def show(self):
        return self.df
    
    def basic(self):
        
        df = self.df
        features = list(df.columns[df.columns != 'protein_interaction'])
        
        X = df[features]
        
        y = df['protein_interaction']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 7, stratify=y)
        
        return X_train, X_test, y_train, y_test
        
    
    def standard(self):
        
        df = self.df
        
        features = list(df.columns[df.columns != 'protein_interaction'])
        
        features.remove('protein_1_seq')
        
        features.remove('protein_2_seq')
        
        X = df[features]
        
        y = df['protein_interaction']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 7, stratify=y)
        
        stan_scal = StandardScaler()
        
        X_train = stan_scal.fit_transform(X_train)
        
        X_test = stan_scal.transform(X_test)
        
        return X_train, X_test, y_train, y_test
    
    
    def minmax(self):
        
        df = self.df
        
        features = list(df.columns[df.columns != 'protein_interaction'])
        
        features.remove('protein_1_seq')
        
        features.remove('protein_2_seq')
        
        X = df[features]
        
        y = df['protein_interaction']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 7, stratify=y)
        
        mm_scal = MinMaxScaler()
        
        X_train = mm_scal.fit_transform(X_train)
        
        X_test= mm_scal.transform(X_test)
        
        return X_train, X_test, y_train, y_test
    
    
    
    def standard_nonsense(self):
        
        structure = {'protein_1_len':0, '1_phobic_count':0, '1_philic_count':0, '1_basic_count': 0, '1_acidic_count': 0, '1_aromatic_count':0, '1_sulfur_count': 0, 'protein_2_len':0, '2_phobic_count':0, '2_philic_count':0, '2_basic_count': 0, '2_acidic_count': 0, '2_aromatic_count':0, '2_sulfur_count': 0}
        
        new_df = pd.DataFrame(data=structure, index=[0])
        
        amino_acids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
        
        rng = np.random.default_rng()
        
        for x in range(2500):
        
            size_rng_1 =rng.integers(low=24, high=33000)
        
            seq_rng_1 = random.choices(amino_acids, k=size_rng_1)
            
            size_rng_2 =rng.integers(low=24, high=33000)
        
            seq_rng_2 = random.choices(amino_acids, k=size_rng_2)
            
            seq_hold = [seq_rng_1, seq_rng_2]
        
            p1_list = []
            p2_list = []

            for i in range(2):

                phobic = {'A':0, 'F':0, 'G':0, 'I':0, 'L':0, 'M':0, 'P':0, 'V':0, 'W':0, 'Y':0, 'phobic_total':0}
                philic = {'C':0, 'D':0, 'E':0, 'H':0, 'K':0, 'N':0, 'Q':0, 'R':0, 'S':0, 'T':0, 'philic_total':0}
                basic = {'H':0, 'K':0, 'R':0, 'basic_total':0}
                acidic = {'D':0, 'E':0, 'acidic_total':0}
                aromatic = {'F':0, 'H':0, 'W':0, 'Y':0, 'aromatic_total':0}
                sulfur = {'C':0 , 'M':0, 'sulfur_total':0}

                seq = seq_hold[i]
                    
                seq_len = len(seq)

                for z in seq:
                    
                    if z in phobic:
                        phobic[z] += 1
                        phobic['phobic_total'] += 1

                    if z in philic:
                        philic[z] += 1
                        philic['philic_total'] += 1

                    if z in basic:
                        basic[z] += 1
                        basic['basic_total'] += 1

                    if z in acidic:
                        acidic[z] += 1
                        acidic['acidic_total'] += 1

                    if z in aromatic:
                        aromatic[z] += 1
                        aromatic['aromatic_total'] += 1

                    if z in sulfur:
                        sulfur[z] += 1
                        sulfur['sulfur_total'] += 1


                if i == 0:

                    p1_list = [seq_len, phobic['phobic_total'], philic['philic_total'], basic['basic_total'], acidic['acidic_total'], aromatic['aromatic_total'], sulfur['sulfur_total']]

                elif i != 0:

                    p2_list = [seq_len, phobic['phobic_total'], philic['philic_total'], basic['basic_total'], acidic['acidic_total'], aromatic['aromatic_total'], sulfur['sulfur_total']]

            tot_list = p1_list + p2_list
            col_list = [k for k,v in structure.items()]

            moving_dict = dict(zip(col_list, tot_list))

            moving_df = pd.DataFrame(data = moving_dict, index= [0])

            new_df = pd.concat([new_df, moving_df], axis = 0)
            
        new_df = new_df.reset_index(drop=True)
        
        new_df = new_df.drop(index=[0])
        
        zeros_ones = (rng.integers(low=0, high=2, size=2500)).tolist()
            
        new_df['protein_interaction'] = zeros_ones
        
        #return new_df
        
        features = list(new_df.columns[new_df.columns != 'protein_interaction'])
            
        X = new_df[features]
        
        y = new_df['protein_interaction']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 7, stratify=y)
        
        stan_scal = StandardScaler()
        
        X_train = stan_scal.fit_transform(X_train)
        
        X_test = stan_scal.transform(X_test)
        
        return X_train, X_test, y_train, y_test
    
    
    
    def minmax_nonsense(self):
        
        structure = {'protein_1_len':0, '1_phobic_count':0, '1_philic_count':0, '1_basic_count': 0, '1_acidic_count': 0, '1_aromatic_count':0, '1_sulfur_count': 0, 'protein_2_len':0, '2_phobic_count':0, '2_philic_count':0, '2_basic_count': 0, '2_acidic_count': 0, '2_aromatic_count':0, '2_sulfur_count': 0}
        
        new_df = pd.DataFrame(data=structure, index=[0])
        
        amino_acids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
        
        rng = np.random.default_rng()
        
        for x in range(2500):
        
            size_rng_1 =rng.integers(low=24, high=33000)
        
            seq_rng_1 = random.choices(amino_acids, k=size_rng_1)
            
            size_rng_2 =rng.integers(low=24, high=33000)
        
            seq_rng_2 = random.choices(amino_acids, k=size_rng_2)
            
            seq_hold = [seq_rng_1, seq_rng_2]
        
            p1_list = []
            p2_list = []

            for i in range(2):

                phobic = {'A':0, 'F':0, 'G':0, 'I':0, 'L':0, 'M':0, 'P':0, 'V':0, 'W':0, 'Y':0, 'phobic_total':0}
                philic = {'C':0, 'D':0, 'E':0, 'H':0, 'K':0, 'N':0, 'Q':0, 'R':0, 'S':0, 'T':0, 'philic_total':0}
                basic = {'H':0, 'K':0, 'R':0, 'basic_total':0}
                acidic = {'D':0, 'E':0, 'acidic_total':0}
                aromatic = {'F':0, 'H':0, 'W':0, 'Y':0, 'aromatic_total':0}
                sulfur = {'C':0 , 'M':0, 'sulfur_total':0}

                seq = seq_hold[i]
                    
                seq_len = len(seq)

                for z in seq:
                    
                    if z in phobic:
                        phobic[z] += 1
                        phobic['phobic_total'] += 1

                    if z in philic:
                        philic[z] += 1
                        philic['philic_total'] += 1

                    if z in basic:
                        basic[z] += 1
                        basic['basic_total'] += 1

                    if z in acidic:
                        acidic[z] += 1
                        acidic['acidic_total'] += 1

                    if z in aromatic:
                        aromatic[z] += 1
                        aromatic['aromatic_total'] += 1

                    if z in sulfur:
                        sulfur[z] += 1
                        sulfur['sulfur_total'] += 1


                if i == 0:

                    p1_list = [seq_len, phobic['phobic_total'], philic['philic_total'], basic['basic_total'], acidic['acidic_total'], aromatic['aromatic_total'], sulfur['sulfur_total']]

                elif i != 0:

                    p2_list = [seq_len, phobic['phobic_total'], philic['philic_total'], basic['basic_total'], acidic['acidic_total'], aromatic['aromatic_total'], sulfur['sulfur_total']]

            tot_list = p1_list + p2_list
            col_list = [k for k,v in structure.items()]

            moving_dict = dict(zip(col_list, tot_list))

            moving_df = pd.DataFrame(data = moving_dict, index= [0])

            new_df = pd.concat([new_df, moving_df], axis = 0)
            
        new_df = new_df.reset_index(drop=True)
        
        new_df = new_df.drop(index=[0])
        
        zeros_ones = (rng.integers(low=0, high=2, size=2500)).tolist()
            
        new_df['protein_interaction'] = zeros_ones
        
        #return new_df
        
        features = list(new_df.columns[new_df.columns != 'protein_interaction'])
            
        X = new_df[features]
        
        y = new_df['protein_interaction']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 7, stratify=y)
        
        mm_scal = MinMaxScaler()
        
        X_train = mm_scal.fit_transform(X_train)
        
        X_test= mm_scal.transform(X_test)
        
        return X_train, X_test, y_train, y_test
        
        
        
    

In [160]:
df_loaded = preprocessing('feature_engineered.csv')

In [95]:
#reveal = df_loaded.show()

In [96]:
#reveal.head()

In [97]:
#base = df_loaded.basic()

In [98]:
#print (base[0])

In [109]:
#stan = df_loaded.standard()

In [126]:
#stan[0].shape

In [127]:
#minmax = df_loaded.minmax()

In [128]:
#print (minmax[1])

In [129]:
#minmax[2].shape

In [155]:
stan_non = df_loaded.standard_nonsense()

In [156]:
print (stan_non[1])

[[ 1.12379772  1.1267519   1.12063889 ... -1.26430076 -1.24957043
  -1.28293183]
 [-0.8544594  -0.85435459 -0.85440987 ... -0.97910993 -0.96978497
  -0.98903774]
 [-0.36785339 -0.35483857 -0.38080872 ... -1.27908843 -1.31186229
  -1.23974309]
 ...
 [ 0.57132659  0.56945448  0.57309647 ...  0.4648007   0.56481188
   0.53942221]
 [-1.54703788 -1.54584722 -1.54794963 ...  0.77956688  0.78336128
   0.76695312]
 [ 0.07875188  0.07684711  0.08064344 ...  1.26967261  1.22890643
   1.25782999]]


In [158]:
stan_non[1].shape

(750, 14)

In [161]:
minmax_non = df_loaded.minmax_nonsense()

In [162]:
print (minmax_non[0])

[[0.73627743 0.73724875 0.72957541 ... 0.0618095  0.06723827 0.05133411]
 [0.59552611 0.59171848 0.59472757 ... 0.03015826 0.02888087 0.02552204]
 [0.36541852 0.3630108  0.3650003  ... 0.64467005 0.63086643 0.63486079]
 ...
 [0.52729317 0.52507998 0.52542064 ... 0.18333831 0.18306258 0.17024362]
 [0.68491277 0.68280316 0.6817105  ... 0.76470588 0.78203971 0.74767981]
 [0.49106437 0.48795799 0.49037235 ... 0.3592117  0.34792419 0.35817865]]


# Summary:

Now I have access to the training and test data that is split appropriately, along with standard and minmax scaling options as well.