# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [15]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, size=(1000,20))
print(X)

# print the shape of X
print(X.shape)

[[3626 4654 3385 ... 2172 1649 4241]
 [4885  227 1771 ... 1199   37 2903]
 [1963 2602 3467 ... 2634 1667 1106]
 ...
 [ 343 3907 4488 ...   54 1216 1169]
 [1394 4606 3796 ... 3877 2197 4705]
 [1298 3293 3311 ... 1651 4431 1478]]
(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [16]:
# Average of the values in each column of X
ave_cols = np.mean(X, axis=0)

print(ave_cols)

# Standard Deviation of the values in each column of X
std_cols = np.std(X, axis=0)

print(std_cols)

[2408.407 2452.799 2515.491 2489.561 2539.863 2490.677 2466.815 2516.166
 2445.649 2542.257 2472.652 2524.653 2520.539 2483.974 2476.303 2544.862
 2497.276 2443.415 2540.145 2480.836]
[1435.56252018 1460.96956252 1465.81462058 1469.31949156 1457.9693228
 1417.74292404 1461.55797996 1428.88905743 1433.20671007 1438.3184164
 1413.04581698 1427.45695227 1442.48221981 1444.9623799  1458.99179134
 1429.62579613 1435.50875296 1457.24391876 1427.54405325 1414.83678603]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [17]:
# Print the shape of ave_cols
print( ave_cols.shape )

# Print the shape of std_cols
print( std_cols.shape ) 


(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [18]:
# Mean normalize X
X_norm = (X - ave_cols) / std_cols

print(X_norm)

[[ 0.84816438  1.50667136  0.59319165 ... -0.18625228 -0.62425044
   1.24407565]
 [ 1.72517251 -1.52350813 -0.50790256 ... -0.85395107 -1.75346252
   0.29838353]
 [-0.31026653  0.10212465  0.64913324 ...  0.13078456 -0.61164137
  -0.97172763]
 ...
 [-1.43874403  0.99536707  1.34567426 ... -1.63968089 -0.92756857
  -0.92719953]
 [-0.70662683  1.47381647  0.87358182 ...  0.98376461 -0.24037437
   1.57202868]
 [-0.77349958  0.57509822  0.54270778 ... -0.5437765   1.32455107
  -0.70879978]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [19]:
# Print the average of all the values of X_norm
avg = np.mean(X_norm)
print(avg)

# Print the average of the minimum value in each column of X_norm
mini = np.mean(np.min(X_norm, axis=0))
print(mini)

# Print the average of the maximum value in each column of X_norm
maxi = np.mean(np.max(X_norm, axis=0))
print(maxi)

-4.263256414560601e-18
-1.7275608849810968
1.737236422640774


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [22]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([0, 3, 2, 4, 1])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [31]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X.shape[0])
print(row_indices)

[199 430 236 570 585 413 624  20 311 591 425 544 558 323 259  53 571 225
 393  46 316 388 221 390 928 529 535 857 380 511  15 439 720 489 492 469
 951 603 907 632 828 971 694 404 867  24 105 937 668 169 934 128 373 650
 251 972  21 882 689 223  79 760 317 656 729 711 791 709 186 191 981 524
 395 773 789 305  47  34 449 295 759 495  86 637 414 182 655 133 261 173
  43 712 429   5 834 947 697 665 410 963 663 876 431 550 800 214 910 865
 672  25 761  23 945 980 942 921 427 629 171  52 461 386 308 358 874 866
 325 192 279 690 633 121 911 872 455 130 244 331 324 420 372 284 156 744
 285  30 605 997 932 909 856 715 567 576 260 854  13 269 908 795 753 770
 974 734 398 467 885 454 434 831 200 838 377 302  58 925 612 452  93 582
 914 277  11 676 155  50 818 505 792 657 660 357 416 751 955 580 504 557
 680 523 695 816 253 614 470 881 384 336 100 829 853  49 202 767 183 587
 180 212 138 887  77 179 378 641 292 339 220 742 966 487 490 894 385 164
 328 266 243 545 415 991 899 851 878 888 197 993 10

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [35]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.


# Create a Training Set
print('Training data [%d, %d]'%(0,int(0.6 * row_indices.size)))
X_train = X[row_indices[0:int(0.6 * row_indices.size)], :]
print(X_train)

# Create a Cross Validation Set
print('Cross Validation data [%d, %d]'%(int(0.6 * row_indices.size),int(0.8 * row_indices.size)))
X_crossVal = X[row_indices[int(0.6 * row_indices.size):int(0.8 * row_indices.size)], :]
print(X_crossVal)

# Create a Test Set
print('Test data [%d, %d]'%(int(0.8 * row_indices.size),row_indices.size))
X_test = X[row_indices[int(0.8 * row_indices.size):row_indices.size], :]
print(X_test)

Training data [0, 600]
[[4865 4262 3760 ...   75 3749 1520]
 [ 811 2376 2365 ... 4302 1365 4534]
 [4050 1671 2143 ...  965 2682 2267]
 ...
 [2697  920 4701 ... 4437 1532 2236]
 [1828 2273 3943 ... 1535 4283 3479]
 [ 287 2761 4702 ...  299 4436  605]]
Cross Validation data [600, 800]
[[1957 2308 1539 ... 3320 1764 2987]
 [  76 3232  749 ... 1954 3393 1087]
 [4291 3877 4182 ...   97 3899 2417]
 ...
 [4409  441 4395 ... 2281 4349 1539]
 [3802  760 4868 ...  553 4723 4294]
 [1572 4750 2716 ... 4537 4110 2957]]
Test data [800, 1000]
[[2216 4901 4453 ...  784 3942 2976]
 [ 121 4743 3367 ... 2193 2457 1747]
 [2946  287 3177 ... 3721  498 3521]
 ...
 [ 713 2094  517 ...  282 3165 1152]
 [4760  655 4620 ... 4281 1947 2373]
 [2470 4343 4129 ... 1172 4415  308]]


If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [36]:
# Print the shape of X_train
print("Shape of X_train: ", X_train.shape)

# Print the shape of X_crossVal
print("Shape of X_crossVal: ", X_crossVal.shape)

# Print the shape of X_test
print("Shape of X_test: ", X_test.shape)

Shape of X_train:  (600, 20)
Shape of X_crossVal:  (200, 20)
Shape of X_test:  (200, 20)
