# 3.4.2

## Normalize data

No imports allowed, just using built utilities to get random data, keep track of time.

In [1]:
from utils import random_data, start_time, elapsed_time

Functions to normalize data (sample feature minus feature mean divided by feature standard deviation). Would be much faster to use built-in functionality from libraries like numpy, but implementing on own for purpose of JQR.

In [2]:
def compute_means(data: list[list[int]], num_samples: int, num_features: int) -> list[float]:
    '''
    '''
    means = []
    for feat in range(num_features):
        means.append(sum([data[i][feat] for i in range(num_samples)]) / num_samples)
    return means

In [3]:
def compute_standard_deviations(data: list[list[int]], means: list[float], num_samples: int, num_features: int
                               ) -> list[float]:
    '''
    Feature standard deviation computed as square root of sum of squared feature distances from feature mean.
    '''
    s_devs = [] 
    for feat in range(num_features):
        s_devs.append((sum([(data[i][feat] - means[feat]) ** 2 for i in range(num_samples)]) / num_samples) ** 0.5)
    return s_devs

In [4]:
def normalize(data: list[list[int]]) -> list[list[float]]:
    '''
    @param data Assumes a num_samples x num_features matrix of integers.
    '''
    num_samples = len(data)
    num_features = len(data[0])
    
    # compute mean for each feature
    means = compute_means(data, num_samples, num_features)
    
    # compute standard deviation for each feature
    s_devs = compute_standard_deviations(data, means, num_samples, num_features)

    # normalize each feature for each sample
    for i in range(num_samples):
        for feat in range(num_features):
            data[i][feat] = (data[i][feat] - means[feat]) / s_devs[feat]
    return data

Visually inspect results with a small dataset.

In [6]:
# generate 4x3 matrix
st = start_time()
data = random_data(num_features=3, num_samples=4, seeded=True)
et = elapsed_time(st)
print (f"{et:.2f} seconds to generate data.")
print ("")

# print data before normalization
for sample in data:
    print (sample)
print ("")
    
# normalize
st = start_time()
data = normalize(data)
et = elapsed_time(st)
    
# print data after normalization
for sample in data:
    print (sample)
print ("")    
print (f"{et:.2f} seconds to normalize data.")

0.00 seconds to generate data.

[-18, 81, -26]
[-42, 57, -37]
[-8, 92, -24]
[-26, 75, 10]

[0.44212732908197744, 0.3747162805875646, -0.3835676173160103]
[-1.4871555614575604, -1.518587031854867, -1.0086407714606196]
[1.2459952001401182, 1.2424802987903458, -0.2699179529260813]
[-0.2009669677645352, -0.09860954752304332, 1.6621263417027112]

0.00 seconds to normalize data.


See how long a larger dataset takes (obviously this is not going to be efficient, numpy built-in funcitionality would be much faster).

In [7]:
# generate 1,000,000x100 matrix
st = start_time()
data = random_data(num_features=100, num_samples=1000000, seeded=True)
et = elapsed_time(st)
print (f"{et:.2f} seconds to generate data.")

# normalize
st = start_time()
data = normalize(data)
et = elapsed_time(st)
print (f"{et:.2f} seconds to normalize data.")

114.44 seconds to generate data.
69.75 seconds to normalize data.
