## Data processing

This Jupyter Noterbook helps us to convert binary attribute(s) to +/-1, categorical attributes(s) to onehot.

In [1]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

We load the data which were cleaned from the `data cleaning` step.

In [2]:
Xy = np.loadtxt('nki_cleaned.dat', dtype = 'str')

print(Xy.shape)
print(Xy)

(272, 1567)
[['43.0' '14.817248000000001' '14.817248000000001' ... '-0.369153'
  '0.15379500000000002' '0.0']
 ['48.0' '14.261465' '14.261465' ... '-0.643019' '-0.014098' '0.0']
 ['38.0' '6.644764' '6.644764' ... '0.258495' '-0.19891099999999998'
  '0.0']
 ...
 ['50.0' '2.6192' '2.149213' ... '0.25190300000000004' '-0.822792' '1.0']
 ['52.0' '2.2905' '2.209446' ... '0.35681599999999997' '0.345088' '1.0']
 ['52.0' '3.737' '2.12731' ... '-1.0893030000000001' '-0.326193' '1.0']]


### Attributes

We find number of unique value for each column, to have an idea about which variables are continuous, which variables are binary, category. It depends on data, however it is likely that nu = 2 --> binary; nu = 3 or 4: --> category, n > 4: continuous. Of course, we have to see data in detail as well.

In [3]:
X = Xy[:,:-1]
l,n = X.shape
nu = np.array([len(np.unique(X[:,i])) for i in range(n)])
print('number of uniques of each variable:')
print(nu)

number of uniques of each variable:
[ 27 261 262 ... 272 272 272]


We then proceed to set up parameters for what kind of variable would be classified as a binary, category, or continuous, and then print those that are binary or category.

In [4]:
i_binary = []
i_category = []
i_continuous = []
for i in range(X.shape[1]):
    nu = np.unique(X[:,i])
    if len(nu) == 2: # binary 
        i_binary.append(i)
    elif len(nu) < 5:
        i_category.append(i)
    else:
        i_continuous.append(i)
        
print('i_binary:',i_binary)
print('i_category:',i_category)

i_binary: [3, 4, 5]
i_category: [9, 10, 11]


We then define variable type, 1: continuous, 2: binary, 3: category.

In [5]:
variable_type  = np.ones(n) # continuous
variable_type[i_binary] = 2 # binary
variable_type[i_category] = 3 # categorical

print(variable_type)

[1. 1. 1. ... 1. 1. 1.]


We now convert binary to +/-1, category to onehot.

In [6]:
def convert_binary_and_category(x,variable_type):
    """
    convert binary to +-1, category to one hot; remain continuous.
    """
    
    onehot_encoder = OneHotEncoder(sparse=False,categories='auto')

    # create 2 initial columns
    x_new = np.zeros((x.shape[0],2))

    for i,i_type in enumerate(variable_type):
        if i_type == 1: # continuous
            x_new = np.hstack((x_new,x[:,i][:,np.newaxis]))
        elif i_type == 2: # binary
            unique_value = np.unique(x[:,i])
            x1 = np.array([-1. if value == unique_value[0] else 1. for value in x[:,i]])        
            x_new = np.hstack((x_new,x1[:,np.newaxis]))
        else: # category      
            x1 = onehot_encoder.fit_transform(x[:,i].reshape(-1,1))
            x_new = np.hstack((x_new,x1))        

    # drop the 2 initial column
    x_new = x_new[:,2:]
    
    return x_new.astype(float)

In [7]:
# convert X
X_new = convert_binary_and_category(X,variable_type)

print(X_new.shape)
print(X_new)

(272, 1572)
[[ 4.3000000e+01  1.4817248e+01  1.4817248e+01 ... -8.7631000e-02
  -3.6915300e-01  1.5379500e-01]
 [ 4.8000000e+01  1.4261465e+01  1.4261465e+01 ... -2.3154700e-01
  -6.4301900e-01 -1.4098000e-02]
 [ 3.8000000e+01  6.6447640e+00  6.6447640e+00 ... -1.1429800e-01
   2.5849500e-01 -1.9891100e-01]
 ...
 [ 5.0000000e+01  2.6192000e+00  2.1492130e+00 ... -5.1088400e-01
   2.5190300e-01 -8.2279200e-01]
 [ 5.2000000e+01  2.2905000e+00  2.2094460e+00 ... -3.9653100e-01
   3.5681600e-01  3.4508800e-01]
 [ 5.2000000e+01  3.7370000e+00  2.1273100e+00 ...  7.9495200e-01
  -1.0893030e+00 -3.2619300e-01]]


### Target

In [8]:
## target
y = Xy[:,-1].astype(float)

print(np.unique(y,return_counts=True))

(array([0., 1.]), array([195,  77]))


In [9]:
y_new = y


print(np.unique(y_new,return_counts=True))

(array([0., 1.]), array([195,  77]))


In [10]:
# combine X and y
Xy_new = np.hstack((X_new,y_new[:,np.newaxis]))

In [11]:
np.savetxt('nki_processed.dat',Xy_new,fmt='%f')