# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [3]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size=(1000,20))

# print the shape of X
print(X)

[[4222 4153 1363 ..., 1794 1122 4329]
 [2960 4885   49 ..., 3435 1008  479]
 [4931 3298 1283 ..., 1212 1509  536]
 ..., 
 [4382 2800 1469 ..., 1706 1481 2114]
 [3897 3820  927 ..., 4346 2423 2951]
 [ 515  553 3214 ..., 4777 3695    7]]


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [5]:
# Average of the values in each column of X
ave_cols = X.mean(axis=0)
# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [6]:
# Print the shape of ave_cols
print(ave_cols)

# Print the shape of std_cols
print(std_cols)

[ 2568.225  2539.403  2493.194  2514.301  2533.095  2541.651  2564.767
  2537.747  2599.536  2489.54   2470.046  2510.175  2500.214  2578.642
  2557.28   2527.465  2576.928  2461.544  2463.59   2549.254]
[ 1429.46066626  1452.18287712  1461.17139664  1445.01442152  1415.95485662
  1441.57111972  1454.92903219  1397.15204004  1462.26941864  1456.96588581
  1472.97235544  1425.68773733  1461.31747687  1455.26301672  1450.99242851
  1493.77325815  1475.34858146  1441.71864941  1427.45491344  1421.54948542]


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [7]:
# Mean normalize X
X_norm = (X-ave_cols)/std_cols

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [8]:
# Print the average of all the values of X_norm
print(X_norm)

# Print the average of the minimum value in each column of X_norm
print(X_norm.mean(axis=0).min(axis=0))

# Print the average of the maximum value in each column of X_norm
print(X_norm.mean(axis=0).min(axis=0))

[[ 1.15692235  1.11115275 -0.77348489 ..., -0.46301961 -0.93984755
   1.25197611]
 [ 0.2740719   1.6152215  -1.67276338 ...,  0.67520525 -1.01970997
  -1.45633622]
 [ 1.65291362  0.52238393 -0.82823548 ..., -0.86670447 -0.66873566
  -1.41623913]
 ..., 
 [ 1.26885268  0.17945192 -0.70094036 ..., -0.52405787 -0.68835099
  -0.3061828 ]
 [ 0.92956388  0.88184279 -1.07187562 ...,  1.30708998 -0.02843522
   0.28261134]
 [-1.43636341 -1.36787386  0.49330695 ...,  1.60603874  0.86266122
  -1.78836827]]
-1.76192394008e-16
-1.76192394008e-16


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [None]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [10]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[206 798 703 562 313 304 936 484 783 104 173 801 600 752   8 809 489 211
 370  40 499 648 188 247 439 663 180 971 990 336 352  32 975 825  44 535
 174 636 755 705 467 981 602 284 172 814 579 493 457 556 440 430 788 119
 818 516 838 324 305 926 615 185 931 414 582 893 586 606 799 946 747 144
 341 107 943 265 358 281 460 248  61  97 258 343 639 877 744 710 696 665
 717 182 689 496  19 782 157 348  27  53 997 634 154 552 220 302 836 517
 408 833 177 622 858 789 332 596 593 960 243  37 840 892 631 653 441 879
 774 954 570 567 952 860 682 957 355 565 687 816 264 779 557 934 534 966
 384 554 670 213 168 314 401 310 807 290 340 768 203 514 259 813 181 994
 764 146  29 597 649 431 945 624 411 702 519 580 178 453  55 745  50 272
 192 700 388 359 283 743 842  46  25   5 413 539  17 518 847 111 335 191
 325 929 463 784 964 619 832  22 620 193 856 307  76 486 233 360 426 140
 465  15 886 548 464 365 389 541 231 459 127 369  13 511 739 986 817 381
 215 450 806 894 255  77 544 321  48 475 685 105 55

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [15]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
training_idx =row_indices[:(row_indices.shape[0]*60)//100]
crossVal_idx = row_indices[(row_indices.shape[0]*60)//100:(row_indices.shape[0]*80)//100]
test_idx = row_indices[(row_indices.shape[0]*80)//100:row_indices.shape[0]]

# Create a Training Set
X_train = X_norm[training_idx,:]

# Create a Cross Validation Set
X_crossVal = X_norm[crossVal_idx,:]

# Create a Test Set
X_test = X_norm[test_idx,:]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [18]:
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
