## Standardization
Goal: Perform the tranformation on validation and test sets in a right way
The following code shows two ways to standardize validation and test sets (here is only shown on a test set).
- 1- Run the following code to see the values of X_test_std1 and X_test_std2
- 2- Re-apply standardization using StandrdScaler from scikit-learn
- 3- Assuming the StandardScaler result is the correct transformation, is the following statement correct? **Yes, because X_test_std2, is same as StandarScaler output.**
- "We should re-use the parameters estimated from the training set to transfrom validation and test sets" 

**Note: The standard deviation differs between numpy and pandas. Pandas uses N-1 in the denominator whereas numpy by default does not (sample vs population standard deviation).  We must use numpy std in this assignment since StandardScaler uses numpy std for scaling. Thefore, you should use `df.std(axis=0, ddof=0)` or `df.values.std(axis=0)` to calculate standard deviation for purpose of scaling.**


In [13]:
import pandas as pd

X_train = pd.DataFrame([10 ,20, 30])
X_test = pd.DataFrame([5,6,7])

mu_train, sigma_train = X_train.mean(axis=0), X_train.values.std(axis=0)
mu_test, sigma_test = X_test.mean(axis=0), X_test.values.std(axis=0)

X_train_std = (X_train - mu_train) / sigma_train
X_test_std1 = (X_test - mu_test) / sigma_test
X_test_std2 = (X_test - mu_train) / sigma_train
print(X_test_std1)
print('\n')
print(X_test_std2)

          0
0 -1.224745
1  0.000000
2  1.224745


          0
0 -1.837117
1 -1.714643
2 -1.592168


In [15]:
# Add your code for step 3 here

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

print(X_test_std)

[[-1.83711731]
 [-1.71464282]
 [-1.59216833]]


## Missing Data
Goal: Try a new data imputation strategy and encourage using scikit-learn documentation
- 1- Create a dataframe same as the one in the slide #36 with coding. 
- 2- Examine its shape and dimensions
- 3- Print it to make sure that is what you expect
- 4- Use SimpleImputer to impute missing values with constant value equal to -1 
- (Hint: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)



In [8]:
import pandas as pd
import numpy as np

# 1, 2, 3
tupleList = [(1.0, 2.0, 3.0, 4.0), (5.0, 6.0, np.nan, 8.0), (10.0,11.0,12.0,np.nan)]
colNames = ['A', 'B', 'C', 'D']
data = pd.DataFrame(tupleList, columns=colNames)
print(data)
print('\nShape:', data.shape)
print('\n')

# 4
from sklearn.impute import SimpleImputer

const_imp = SimpleImputer(missing_values = np.nan, strategy='constant', fill_value=-1)
const_imp.fit(data)
print(repr(const_imp.transform(data)))

      A     B     C    D
0   1.0   2.0   3.0  4.0
1   5.0   6.0   NaN  8.0
2  10.0  11.0  12.0  NaN

Shape: (3, 4)


array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., -1.,  8.],
       [10., 11., 12., -1.]])
