###  Python Basics Tutorial

#### Data Prep Basics Tutorial

####  Machine Learning Mastery with Python
####  Jason Brownlee

#### In this recipe:
- rescaling
- standardize
- normalize
- binarize

###  Rescale Data

- Many ML techniques can perform better when data is on same scale

In [1]:
# Scikit-learn has MinMaxScaler class

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

filename = 'pima-indians-diabetes.data.csv'
path = 'D:\\OneDrive - QJA\\My Files\\DataScience\\DataSets'

# name columns
names = ['preg', 'plas', 'pres', 'skin', 'test',
        'mass', 'pedi', 'age', 'class']

dataframe = read_csv(path + '\\' + filename, names = names)

In [7]:
array = dataframe.values
# print(array)

# split array into input and output
X = array[:, 0:8] # 1st through 7th column
Y = array[:, 8] # just 8th column

# set object to scale data between 0 and 1
scaler = MinMaxScaler(feature_range = (0,1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
set_printoptions(precision = 3)
print(rescaledX[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


### Standardize Data

- transform attributes with Gaussian distributions and differing means and standard deviations to mean of 0, sd of 1

- useful for techniques such as linear and logistic regression, and LDA

In [8]:
from sklearn.preprocessing import StandardScaler
# from pandas import read_csv
# from numpy import set_printoptions

array = dataframe.values
# print(array)

# split array into input and output
X = array[:, 0:8] # 1st through 7th column
Y = array[:, 8] # just 8th column

# set object to standardize data mean=0, sd=1
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
set_printoptions(precision = 3)
print(rescaledX[0:5, :])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


### Normalize Data

- rescale each observation row to have unit norm of 1 (independent of other rows)
- useful for sparse data sets when using weighting algo

In [11]:
from sklearn.preprocessing import Normalizer
# from pandas import read_csv
# from numpy import set_printoptions

array - dataframe.values

# split array into input and output
X = array[:, 0:8] # 1st through 7th column
Y = array[:, 8] # just 8th column

# set object to standardize each row mean=0, sd=1
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

set_printoptions(precision = 3)
print(normalizedX[0:5, :])



[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


### Binarize Data

- use a threshold to mark values 0 or 1
    - above threshold = 1, below = 0
- useful when you have probabilities you want to convert into binary (ie: yes/no)
- useful to add features

In [13]:
from sklearn.preprocessing import Binarizer
# from pandas import read_csv
# from numpy import set_printoptions

array = dataframe.values

# split array into input and outp
X = array[:, 0:8]  # 1st through 7th column
Y = array[:, 8]  # just 8th column

# set object to binarize data based on threshold
# threshold: all values <= 0 turned to 0's, 
# all others turned to 1's
binarizer = Binarizer(threshold = 0.0).fit(X)
binaryX = binarizer.transform(X)

set_printoptions(precision = 3)
print(binaryX[0:5, :])


[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]
