# Preprocessing Techniques

### Import some packages

In [1]:
import numpy as np
from sklearn import preprocessingocessing

  return f(*args, **kwds)


In [2]:
input_data = np.array([[5.1, -2.9, 3.3], 
                       [-1.2, 7.8, -6.1], 
                       [3.9, 0.4, 2.1], 
                       [7.3, -9.9, -4.5]]) 

## Types of preprocessing techniques

### 1. Binarization
This process is used when we want to convert our numerical values into boolean values.
Using 2.1 as the threshold, all values above it would become 1 and the rest becomes 0.

In [3]:
data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print("\nBinarized data:\n", data_binarized)


Binarized data:
 [[1. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


## 2. Mean Removal
This is done so that each feature is centered around zero. It helps to remove bias from the feature vector.

In [4]:
#print mean and standard deviation
print("\nBefore:")
print("Mean = ", input_data.mean(axis = 0))
print("Standard Deviation = ", input_data.std(axis = 0))


Before:
Mean =  [ 3.775 -1.15  -1.3  ]
Standard Deviation =  [3.12039661 6.36651396 4.0620192 ]


In [6]:
data_scaled = preprocessing.scale(input_data)
print("\nAfter:")
print("Mean = ", data_scaled.mean(axis = 0))
print("Standard Deviation = ", data_scaled.std(axis = 0))


After:
Mean =  [1.11022302e-16 0.00000000e+00 2.77555756e-17]
Standard Deviation =  [1. 1. 1.]


Note that the mean is closer to zero and the standard deviation is 1

## Scaling

This is to bring the values of different features to within the same range, such that features with naturally large measurements wouldn't overwhelm others with smaller measurements.

In [9]:
#Min max scaling
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)

print("\nInput Data: \n", input_data)
print("\nMin max scaled data: \n", data_scaled_minmax)


Input Data: 
 [[ 5.1 -2.9  3.3]
 [-1.2  7.8 -6.1]
 [ 3.9  0.4  2.1]
 [ 7.3 -9.9 -4.5]]

Min max scaled data: 
 [[0.74117647 0.39548023 1.        ]
 [0.         1.         0.        ]
 [0.6        0.5819209  0.87234043]
 [1.         0.         0.17021277]]


Note that each row is scaled such that the max value is 1 and all other values are relative to this value

## Normalization

This modifies the values in the feature vector so that we can measure them on a common scale.

Some common forms of normalization aim to modify the values so that they sum to 1: 

>L1 normalization, which refers to Least Absolute Deviation. This is such that the sum of absolute values is a row is 1.

>L2 normalization which refers to least square makes sure the sum of squares is 1

In general, L1 normalization technique is considered more robust than L2 normalization technique. L1 normalization technique is robust because it is resistant to outliers in the data. If we are solving a problem where outliers are important, then maybe L2 normalization becomes a better choice. 

In [10]:
#Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm="l1")
data_normalized_l2 = preprocessing.normalize(input_data, norm="l2")

print("\nInput data:\n", input_data)
print("\nL1 normalized data:\n", data_normalized_l1)
print("\nL2 normalized data:\n", data_normalized_l2)


Input data:
 [[ 5.1 -2.9  3.3]
 [-1.2  7.8 -6.1]
 [ 3.9  0.4  2.1]
 [ 7.3 -9.9 -4.5]]

L1 normalized data:
 [[ 0.45132743 -0.25663717  0.2920354 ]
 [-0.0794702   0.51655629 -0.40397351]
 [ 0.609375    0.0625      0.328125  ]
 [ 0.33640553 -0.4562212  -0.20737327]]

L2 normalized data:
 [[ 0.75765788 -0.43082507  0.49024922]
 [-0.12030718  0.78199664 -0.61156148]
 [ 0.87690281  0.08993875  0.47217844]
 [ 0.55734935 -0.75585734 -0.34357152]]


To confirm

In [15]:
len(data_normalized_l1)

4

In [19]:
range(len(data_normalized_l1))

range(0, 4)

In [17]:
for a in range(len(data_normalized_l1)):
    for val in data_normalized_l1[a]:
        print(val)

0.4513274336283185
-0.2566371681415929
0.29203539823008845
-0.07947019867549669
0.5165562913907285
-0.40397350993377484
0.609375
0.0625
0.328125
0.33640552995391704
-0.45622119815668205
-0.20737327188940094


In [45]:

for a in range(len(data_normalized_l1)):
    abs_val = []
    for val in data_normalized_l1[a]:
        abs_val.append(abs(val))
    print(f"Sum of absolute values of L1 normalized row {a+1}:")
    print(sum(abs_val))
    
#fact
print('\n')
for a in range(len(data_normalized_l2)):
    abs_val = []
    for val in data_normalized_l1[a]:
        abs_val.append(abs(val))
    print(f"Sum of absolute values of L2 normalized row {a+1}:")
    print(sum(abs_val))

Sum of absolute values of L1 normalized row 1:
0.9999999999999998
Sum of absolute values of L1 normalized row 2:
1.0
Sum of absolute values of L1 normalized row 3:
1.0
Sum of absolute values of L1 normalized row 4:
1.0


Sum of absolute values of L2 normalized row 1:
0.9999999999999998
Sum of absolute values of L2 normalized row 2:
1.0
Sum of absolute values of L2 normalized row 3:
1.0
Sum of absolute values of L2 normalized row 4:
1.0
