## Data processing

This Jupyter Noterbook helps us to convert binary attribute(s) to +/-1, categorical attributes(s) to onehot.

In [6]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

We load the data which were cleaned from the `data cleaning` step.

In [7]:
Xy = np.loadtxt('wisconsin_cleaned.dat', dtype = 'str')

print(Xy.shape)
print(Xy)

(569, 31)
[['17.99' '10.38' '122.8' ... '0.4601' '0.1189' 'M']
 ['20.57' '17.77' '132.9' ... '0.275' '0.08902' 'M']
 ['19.69' '21.25' '130.0' ... '0.3613' '0.08757999999999999' 'M']
 ...
 ['16.6' '28.08' '108.3' ... '0.2218' '0.0782' 'M']
 ['20.6' '29.33' '140.1' ... '0.4087' '0.124' 'M']
 ['7.76' '24.54' '47.92' ... '0.2871' '0.07039' 'B']]


### Attributes

We find number of unique value for each column, to have an idea about which variables are continuous, which variables are binary, category. It depends on data, however it is likely that nu = 2 --> binary; nu = 3 or 4: --> category, n > 4: continuous. Of course, we have to see data in detail as well.

In [8]:
X = Xy[:,:-1]
l,n = X.shape
nu = np.array([len(np.unique(X[:,i])) for i in range(n)])
print('number of uniques of each variable:')
print(nu)

number of uniques of each variable:
[456 479 522 539 474 537 537 542 432 499 540 519 533 528 547 541 533 507
 498 545 457 511 514 544 411 529 539 492 500 535]


We then define variable type, 1: continuous, 2: binary, 3: category.

In [9]:
variable_type  = np.ones(n) # continuous


print(variable_type)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


### Target

In [12]:
## target
y = Xy[:,-1]

print(np.unique(y,return_counts=True))

(array(['B', 'M'], dtype='<U21'), array([357, 212]))


In [13]:
# convert taget to 0 and 1
y_new = np.ones(y.shape[0])
y_new[y =='B'] = 0

print(np.unique(y_new,return_counts=True))

(array([0., 1.]), array([357, 212]))


In [14]:
# combine X and y
Xy_new = np.hstack((X_new,y_new[:,np.newaxis]))

In [15]:
np.savetxt('wisconsin_processed.dat',Xy_new,fmt='%f')