# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [5]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X =  np.random.randint(0,5001,size=(1000,20))

# print the shape of X
print(X)

[[3751 3698 1368 ... 1751 2100 4198]
 [  50 2061 1723 ... 2886 1295 3429]
 [4727 3901   19 ... 4263   45 3666]
 ...
 [2044 1383 2717 ... 1097 4493 1912]
 [4067 2807 4050 ... 4716 3338 4690]
 [ 850 3666 3107 ... 4562 3736 2959]]


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [0]:
# Average of the values in each column of X
ave_cols = X.mean(axis=0)

# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [7]:
# Print the shape of ave_cols
print(ave_cols)

# Print the shape of std_cols
print(std_cols)

[2459.481 2504.424 2587.354 2458.037 2409.019 2529.851 2532.697 2543.763
 2451.532 2486.793 2518.736 2499.584 2392.423 2548.261 2509.006 2454.108
 2485.529 2496.94  2536.005 2479.052]
[1427.71882093 1405.74935683 1456.80996519 1464.96794901 1447.319077
 1442.91365397 1437.59920186 1444.83397483 1421.35254071 1425.69347342
 1457.17694749 1451.56455831 1463.74067378 1445.3911446  1423.30948566
 1482.47827179 1486.56730798 1443.08389929 1434.34032258 1444.60611839]


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [0]:
# Mean normalize X
#DONT HAVE TO SPECIFY AS ave_cols[:]. ave_cols ALREADY MEANS THE ENTIRE ARRAY
X_norm = (X[:, 0:] - ave_cols)/std_cols

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [30]:
# Print the average of all the values of X_norm
print(X_norm)

# Print the average of the minimum value in each column of X_norm
print(X_norm.min(axis=0))

# Print the average of the maximum value in each column of X_norm
print(X_norm.max(axis=0))


[[ 0.90460319  0.84906743 -0.83700279 ... -0.51690688 -0.30397598
   1.18990774]
 [-1.68764393 -0.31543603 -0.59331966 ...  0.26960317 -0.86520959
   0.65758271]
 [ 1.58821118  0.9934744  -1.76299865 ...  1.22380965 -1.73669035
   0.82164127]
 ...
 [-0.29101038 -0.79774107  0.08899308 ... -0.97010299  1.3643868
  -0.39253053]
 [ 1.12593529  0.21524178  1.00400604 ...  1.53772071  0.55913857
   1.530485  ]
 [-1.12730951  0.82630377  0.35670129 ...  1.43100481  0.83661805
   0.33223451]]
[-1.72196441 -1.77800117 -1.77604084 -1.67446462 -1.66308801 -1.74359082
 -1.7617546  -1.75297857 -1.72478814 -1.7414634  -1.72781762 -1.71923735
 -1.63104233 -1.75956592 -1.75928428 -1.65473454 -1.66661071 -1.73028055
 -1.75481715 -1.71607469]
[1.76611736 1.77312975 1.65542937 1.73448368 1.78812056 1.70775915
 1.71626626 1.70001332 1.78806308 1.7585877  1.69661207 1.71912161
 1.77871466 1.69417047 1.74662926 1.71664708 1.68607973 1.73244258
 1.71716221 1.7388463 ]


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [10]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([4, 1, 3, 0, 2])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [11]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
#INSTEAD OF HARDCODING IT TO 1000 MAKE IT MODULAR AS MENTIONED.
row_indices = np.random.permutation(X_norm.shape[0])

print(row_indices)

[114   9 269 527 691 625 958 546  20 423 978 616  60 583 739 223 899 682
 697 699 125 757 775 701 749 299 317 861   3 629 648 336 783 552 890 973
 181 243 307 346 257 518 294 825 429  89 881 841 303 678 865 779 818 515
 187 409 517 424  52 639 743 631 179 439 907 142 143 389 994 796 755  26
 696 160 853 592 268 846 989 437 960 765 731 675 383 319 205 468 717  11
 108 882 452 813 446  23 220 721 819 714  47 118 347   5 427  48 886 298
 287 374 964 174 357 472 852 673 225 967  10  36 767 434 460 122 203 420
 656 504 671 442   8 538  57 497  82 763 754 932 947  88 635 931 713 244
  90 371  46 840 184 919 131 365 360 641 896 137 175  25 589  62 147 222
 688 681 172 431 540 981 503 380 267 490 516 156 506 521 830 483 577 955
 403 185  16 526 615 241 785  79 952 111 470 653 151  19 773 760 769 774
 293 419  93 345 780 218 145 903 880 927 944 549 604 658 213 200 992  81
 186 605 916  33 548 643 976 100 822 310 991 473 435 938 650 400 664  78
 557 782 132 121 533  87  84  94 709  66 670 898 60

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [18]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.

#MAKE IT MODULAR :
total_sample_size=X_norm.shape[0]

train_portion,cross_valid_portion,test_portion=(int(total_sample_size*0.6),int(total_sample_size*0.2),int(total_sample_size*0.2))


#INSTEAD OF GIVING HARDCODED VALUES MADE IT MODULAR. NOW IT CAN HANDLE NOT ONLY 1000 BUT 100,10K,100 MIL, ....ETC.
#WE DONT HAVE TO REWRITE THE INDEX VALUES EVERYTIME

train_indices = row_indices[:train_portion]
print("train_indices: \n", train_indices)
cross_val_indices = row_indices[train_portion:train_portion+cross_valid_portion]
print("cross_val_indices: \n", cross_val_indices)
test_indices = row_indices[train_portion+cross_valid_portion:]
print("test_indices: \n", test_indices)

# Create a Training Set
X_train = X[train_indices]
print("X_train: \n", X_train)

# Create a Cross Validation Set
X_crossVal = X[cross_val_indices]
print("X_crossVal: \n", X_crossVal)

# Create a Test Set
X_test = X[test_indices]
print("X_test: \n", X_test)


train_indices: 
 [114   9 269 527 691 625 958 546  20 423 978 616  60 583 739 223 899 682
 697 699 125 757 775 701 749 299 317 861   3 629 648 336 783 552 890 973
 181 243 307 346 257 518 294 825 429  89 881 841 303 678 865 779 818 515
 187 409 517 424  52 639 743 631 179 439 907 142 143 389 994 796 755  26
 696 160 853 592 268 846 989 437 960 765 731 675 383 319 205 468 717  11
 108 882 452 813 446  23 220 721 819 714  47 118 347   5 427  48 886 298
 287 374 964 174 357 472 852 673 225 967  10  36 767 434 460 122 203 420
 656 504 671 442   8 538  57 497  82 763 754 932 947  88 635 931 713 244
  90 371  46 840 184 919 131 365 360 641 896 137 175  25 589  62 147 222
 688 681 172 431 540 981 503 380 267 490 516 156 506 521 830 483 577 955
 403 185  16 526 615 241 785  79 952 111 470 653 151  19 773 760 769 774
 293 419  93 345 780 218 145 903 880 927 944 549 604 658 213 200 992  81
 186 605 916  33 548 643 976 100 822 310 991 473 435 938 650 400 664  78
 557 782 132 121 533  87  84  94 7

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [20]:
# Print the shape of X_train
print("X_train:\n", X_train, "\nShape: ", X_train.shape)

# Print the shape of X_crossVal
print("X_crossVal:\n", X_crossVal, "\nShape: ", X_crossVal.shape)

# Print the shape of X_test
print("X_test:\n", X_test, "\nShape: ", X_test.shape)


X_train:
 [[ 828 2838   41 ...  251 3999 1767]
 [2090 1676  331 ... 1646 4498 4518]
 [2616 4987 2465 ...  446 3392 1202]
 ...
 [1745 2068 3540 ... 3286 2174  671]
 [4796 2911 4257 ... 2121  540 1065]
 [ 484 3320  698 ...   22 3019 2636]] 
Shape:  (600, 20)
X_crossVal:
 [[ 320 2265 1712 ...   49 3781 2939]
 [ 300 4035 4969 ... 2825 4903 4827]
 [4414 1428 2552 ... 2876 4425 4041]
 ...
 [4095  617 1719 ... 4997 2393 1764]
 [2319 2413 2558 ... 3136 4255  472]
 [ 248 1364 2243 ... 3897  981 1897]] 
Shape:  (200, 20)
X_test:
 [[1843 4828 2514 ... 3089 2073 2323]
 [ 311  648 2737 ...  806 4411 4408]
 [1422 4443 2021 ... 2677  750 4767]
 ...
 [2087 3792 1806 ... 1543 4263 2375]
 [4619 2891 1639 ... 1408 4678 4688]
 [1740 1707 3158 ... 1870  697 2793]] 
Shape:  (200, 20)
