## Import data and take a look at it

In [1]:
#Import gen_data function
from data_gen import gen_data

#Get the data by calling the gen_data function
data1,data2 = gen_data()

#Print 10 entries from data1 and data2
print(data1[:10])
print(data2[:10])

[47, 64, 32, 64, 21, 72, 31, 11, 56, 26]
[27, 11, 17, 36, 24, 36, 5, 9, 39, 49]


## Standardize the data:
1. Calculate it's mean $\mu = (\frac{\Sigma(x_i)}{n})$ 
2. Calculate it's standard deviation $(\frac{\Sigma(x_i^2)}{n} - \mu^2)^{1/2}$.
3. For each element perform the following:

    $z_i = \frac{x_i - \mu}{\sigma}$

In [3]:
from math import sqrt

In [4]:
### edTest(test_std) ###
#Use list comprehension to square each element of the list data1
data_sq1 = [x**2 for x in data1]

#Calculate mean and standard deviation using formula provided in the markdown cell above
mean1 = sum(data1)/len(data1)
std1 = sqrt( sum(data_sq1)/len(data_sq1) - mean1**2 )

#Standardize the data using list comprehension and display 10 elements
std_data = [(x - mean1)/std1 for x in data1]
print(std_data[:10])

[-0.25458086605908237, 0.3971093497223363, -0.829601644689746, 0.3971093497223363, -1.2512835490188992, 0.7037870983253569, -0.8679363632651235, -1.634630734772675, 0.09043160111931575, -1.0596099561420114]


### Similarly standardize data2

In [5]:
#Use list comprehension to square each element of the list data2
data_sq2 = [x**2 for x in data2]

mean2 = sum(data2)/len(data2)
std2 = sqrt( sum(data_sq2)/len(data_sq2) - mean2**2 )

std_data2 = [(x - mean2)/std2 for x in data2]
print(std_data2[:10])

[0.3549398997058298, -0.46336534457274015, -0.15650087796827644, 0.8152365996125254, 0.2015076664035979, 0.8152365996125254, -0.7702298111772039, -0.5656535001075614, 0.9686688329147572, 1.4801096105888634]


### ⏸ If you had 1000 such data sets, what would be the most efficient way of standardizing them all?

#### A. Copy-paste the code for each dataset.
#### B. Call the TA and ask him/her to do it.
#### C. Write a function to standardize the data.

In [6]:
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')

answer = 'C'

## Writing a Function
Manually copy pasting code in order to process all different datasets would be very tedious and it also reduces code readability which increases chances of small errors.

Which is why we will declare a function to do the job for us. Everytime we want to standardize data all we have to do is simply call the function.

In [7]:
### edTest(test_func) ###
#Define a function which calculates mean and std of input data, and returns standardized data
def standardize(data):
  data_sq = [x**2 for x in data]
  mean = sum(data)/len(data)
  std = sqrt( sum(data_sq)/len(data_sq) - mean**2 )
  return [(x - mean)/std for x in data]

In [8]:
#Call the standardize function on data1 and display 10 elements
data1_std = standardize(data1)
print(data1_std[:10])

[-0.25458086605908237, 0.3971093497223363, -0.829601644689746, 0.3971093497223363, -1.2512835490188992, 0.7037870983253569, -0.8679363632651235, -1.634630734772675, 0.09043160111931575, -1.0596099561420114]


In [9]:
#Call the standardize function on data2 and display 10 elements
data2_std = standardize(data2)
print(data2_std[:10])

[0.3549398997058298, -0.46336534457274015, -0.15650087796827644, 0.8152365996125254, 0.2015076664035979, 0.8152365996125254, -0.7702298111772039, -0.5656535001075614, 0.9686688329147572, 1.4801096105888634]


## De-standardization function
Often in data science, we perform manipulations on the standardized dataset(because it's usually easier) and then convert it back to original scale by destandardizing. 
So let's write a function to retrieve the data by de-standardizing.

## Function to de-standardize
You wil require the original `mean` and `std` values in order to de-standardize. Perform the following on each element: 

$x_i = z_i . \sigma + \mu$

In [10]:
### edTest(test_de) ###
# Write a function which takes data, mean and std as input 
# and returns de-standardized data
# Make sure you use the correct mean and std for 
# data1 and data2 calculated earlier
def destandardize(mean,std,data):
  return [x * std + mean for x in data]

In [13]:
### edTest(test_de1) ###
# Use mean and std of data1 calculated earlier and destandardize data1_std
data_de1 = destandardize(mean1,std1,data1_std)
# Display first 10 elements of the destandardized data.
print(data_de1[:10])

[47.0, 64.0, 32.0, 64.0, 21.0, 72.0, 31.0, 11.0, 56.0, 26.0]


In [14]:
### edTest(test_de2) ###
# Use mean and std of data2 calculated earlier and destandardize data2_std
data_de2 = destandardize(mean2,std2,data2_std)
# Display first 10 elements of the destandardized data.
print(data_de2[:10])

[27.0, 11.0, 17.0, 36.0, 24.0, 36.0, 5.0, 9.0, 39.0, 49.0]



### ⏸ By looking at what data is required for destandardizing, do you observe something out of place?

#### A. No, all looks good.
#### B. `mean` and `std` got over-written when copy-pasting code.
#### C. Function to de-standardize requires extra data(mean,std) which were not given by standardize function.
#### D. B and C.

In [15]:
### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')

answer = 'C'