# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [2]:
# import NumPy into Python
import numpy as np


# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, (1000, 20))

# print the shape of X
print(X.shape)


(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [3]:
# Average of the values in each column of X
ave_cols = np.average(X, axis=None, weights=None, returned=False)
print(ave_cols)
# Standard Deviation of the values in each column of X 
std_cols = np.std(X, axis=0)

print(std_cols)


2513.58425
[1450.99891301 1431.90989455 1446.30950344 1443.63522327 1453.06277037
 1431.23093193 1465.2432951  1456.07994938 1425.24488662 1453.212906
 1438.78571797 1465.01610489 1423.41752048 1441.48984526 1430.25956103
 1449.72547773 1445.06359317 1428.09763859 1453.36129786 1422.29621932]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [4]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)

()
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [5]:
# Mean normalize X
X_norm = X / np.std(X, axis=0)

print(X_norm)

[[1.34596931 3.38429116 0.12791176 ... 1.49849698 0.7080139  1.69584927]
 [0.54169579 3.27744086 3.17843448 ... 1.71206781 1.65960109 3.14632067]
 [0.15920067 1.81785181 1.4899992  ... 2.88845797 0.91512001 3.49575562]
 ...
 [1.88766509 0.35896113 1.38282988 ... 0.05251742 0.79470948 0.93370142]
 [0.02894558 2.87099071 0.61812496 ... 2.70499712 2.24789253 1.73944075]
 [0.47760201 0.27376024 0.63126184 ... 0.18206038 0.93989017 3.4542734 ]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [8]:
# Print the average of all the values of X_norm

average = np.average(X_norm)

print(average)
# Print the average of the minimum value in each column of X_norm
min_in_cols = np.min(X_norm, axis= 0)
average = np.mean(min_in_cols)
print("The average of the minimum value in each column of X_norm is", average)



# Print the average of the maximum value in each column of X_norm
max_in_cols = np.max(X_norm, axis=0)
average = np.mean(max_in_cols)
print("The average of the maximum value in each column of X_norm is", average)


1.7425083144761098
The average of the minimum value in each column of X_norm is 0.005058930644025113
The average of the maximum value in each column of X_norm is 3.462071771941089


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [None]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [9]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
import numpy as np

# Create a 1000 x 20 ndarray with random integers.
X = np.random.randint(0, 5001, (1000, 20))

# Calculate the mean normalized version of X.
X_norm = X / np.std(X, axis=0)

# Get the number of rows in X_norm.
num_rows = X_norm.shape[0]

# Create a random permutation of the row indices.
row_indices = np.random.permutation(num_rows)

print(row_indices)



[555 299 611 899 190 737 379 814 924 721 243  16 888 347 803 254 891 315
 137 651 818 726 722 138 727 495 106 627 584 905  35 472 904  44 788 704
  93 995 268 573  87 761 471 731 346 135 393 199  74 127 357 202 497 978
 822 311 797 851 140 656 869 340 742 612 988 481 867 680 177 361 173 150
 209 147 939 856 229 907 483  12 475 649 353 720 446 431 499 736  59 990
 104 595 146 285 863 819 713  84 945 777 385 300 437 873 143 717 642 113
 992 171 971 670  69 594 432 596 382 531 227 328 666 575 943  42 233 602
 181 614 544 480 947 770  39 946 417 800 675 433 169 326 378 148 505 660
  72 194 961 884 370  50 936 689 120  98 205 706 157 628 607 592 458 816
 479 593 695 482 979 145 752 567 599 944 118  13 287 206 170 957 312 732
 806 685 342 696 890  32 121   0 420 375 404 778 674 430 376  99 837 500
 320 250 493 175 841 349 180 621 690 558   5  54 855 112 847 183  46 377
  52 668 987 405 386 316 136 810 322 832 184 407 126 256 415 697  51 251
  70 264 271 161 915 548 763 416 246 383 839 681 58

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [11]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.

# Calculate the split ratio for the training set, cross-validation set, and test set.
split_ratio = 0.6
cross_val_ratio = 0.2
test_ratio = 0.2

# Calculate the number of rows in the training set, cross-validation set, and test set.
num_train_rows = int(num_rows * split_ratio)
num_cross_val_rows = int(num_rows * cross_val_ratio)
num_test_rows = int(num_rows * test_ratio)


# Create a Training Set
X_train = X_norm[row_indices[:num_train_rows]]

# Create a Cross Validation Set
X_crossVal =  X_norm[row_indices[num_train_rows:num_train_rows + num_cross_val_rows]]

# Create a Test Set
X_test = X_norm[row_indices[num_train_rows + num_cross_val_rows:]]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [12]:
# Print the shape of X_train
print(X_train.shape)


# Print the shape of X_crossVal
print(X_crossVal.shape)


# Print the shape of X_test
print(X_test.shape)


(600, 20)
(200, 20)
(200, 20)
