# Upsampling

From baseline neural network models and initial hyperparameter search, the initial imbalance was shown to create major problems. To address this, we will upsample data *within each protein* to create balanced deleterious-beneficial-neutral distribution.

From the initial hyperparameter search, we will use 0.05 as the threshold offset going forward. This is a better balance for performance!

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample

In [2]:
# Read in the Data
data = pd.read_csv('data/merged.csv')

In [6]:
data.shape

(49581, 113)

In [3]:
# Make the conversion to Categorical and prep
def label_type(row):
    if row['scaled_effect'] < .9:
        return('Deleterious')
    elif row['scaled_effect'] > 1.1:
        return('Beneficial')
    else:
        return('Neutral')

# Convert to categorical characterization
data['type'] = data.apply(lambda row: label_type(row), axis=1)
processed_data = data.drop(['scaled_effect'], axis=1)

# Get Unique proteins
proteins = processed_data.protein.unique()

# Final Upsampled Data Structure
upsampled_data = pd.DataFrame(columns = processed_data.columns)

In [4]:
for protein in proteins:
    prot_data = processed_data[processed_data.protein == protein]
    print(protein)
    
    del_samples = prot_data[prot_data.type == 'Deleterious']
    num_del_samples = del_samples.shape[0]
    print("Del Samples: " + str(num_del_samples))
    upsampled_data = pd.concat([upsampled_data, del_samples])
    
    ben_samples = prot_data[prot_data.type == 'Beneficial']
    neut_samples = prot_data[prot_data.type == 'Neutral']
    
    # Upsample -- Deleterious is *always* larger
    if ben_samples.shape[0] != 0:
        ben_resampled = resample(ben_samples,
                                 replace=True,
                                 n_samples = num_del_samples)
        upsampled_data = pd.concat([upsampled_data, ben_resampled])
        print("Ben Resampled: " + str(ben_resampled.shape[0]))
    
    if neut_samples.shape[0] != 0:
        neut_resampled = resample(neut_samples,
                                 replace=True,
                                 n_samples = num_del_samples)
        upsampled_data = pd.concat([upsampled_data, neut_resampled])
        print("Neut Resampled: " + str(neut_resampled.shape[0]))
        

TEM-1
Del Samples: 9735
Ben Resampled: 9735
Neut Resampled: 9735
Kka2
Del Samples: 12120
Ben Resampled: 12120
Neut Resampled: 12120
Uba1
Del Samples: 528
Neut Resampled: 528
PSD95pdz3
Del Samples: 436
Ben Resampled: 436
Neut Resampled: 436
Pab1
Del Samples: 404
Neut Resampled: 404
hsp90
Del Samples: 109
Neut Resampled: 109


In [7]:
upsampled_data.to_csv('data/upsampled_data.csv')

In [8]:
upsampled_data.shape

(68955, 112)