# Scaling and Normalization - PCA

## Override
* Scaling is used to change the range of the data
* Normalization is used to change the shape of the distribution of the data
* Standardization is used to change the mean and standard deviation of the data
* Z-score normalization is used to change the mean and standard deviation of the data to 0 and 1

![Normalization](Normalization-Formula.jpg)

## Data Dictionary
* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction: Diabetes pedigree function
* Age: Age (years)
* Outcome: Class variable (0 or 1)

## import libraries

In [1]:
import pandas as pd


## Load Data

In [2]:
df = pd.read_csv('pima-indians-diabetes.csv')

## Scaling Data

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
from sklearn.preprocessing import normalize, scale

X = df.drop('Outcome', axis=1)
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [5]:
Xs = scale(X)
Xs[:5]

array([[ 0.63994726,  0.84832379,  0.14964075,  0.90726993, -0.69289057,
         0.20401277,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575,  0.53090156, -0.69289057,
        -0.68442195, -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, -1.28821221, -0.69289057,
        -1.10325546,  0.60439732, -0.10558415],
       [-0.84488505, -0.99820778, -0.16054575,  0.15453319,  0.12330164,
        -0.49404308, -0.92076261, -1.04154944],
       [-1.14185152,  0.5040552 , -1.50468724,  0.90726993,  0.76583594,
         1.4097456 ,  5.4849091 , -0.0204964 ]])

In [6]:
pd.DataFrame(Xs).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


## Normalize Data

In [7]:
Xn = normalize(X)
Xn[:5]

array([[0.03355237, 0.82762513, 0.40262844, 0.19572216, 0.        ,
        0.18789327, 0.00350622, 0.27960308],
       [0.008424  , 0.71604034, 0.55598426, 0.24429612, 0.        ,
        0.22407851, 0.00295683, 0.26114412],
       [0.04039768, 0.92409698, 0.32318146, 0.        , 0.        ,
        0.11765825, 0.00339341, 0.16159073],
       [0.00661199, 0.58846737, 0.43639153, 0.15207584, 0.62152733,
        0.185797  , 0.0011042 , 0.13885185],
       [0.        , 0.5963863 , 0.17412739, 0.15236146, 0.73133502,
        0.18762226, 0.00996009, 0.14365509]])

In [8]:
pd.DataFrame(Xn).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.033552,0.827625,0.402628,0.195722,0.0,0.187893,0.003506,0.279603
1,0.008424,0.71604,0.555984,0.244296,0.0,0.224079,0.002957,0.261144
2,0.040398,0.924097,0.323181,0.0,0.0,0.117658,0.003393,0.161591
3,0.006612,0.588467,0.436392,0.152076,0.621527,0.185797,0.001104,0.138852
4,0.0,0.596386,0.174127,0.152361,0.731335,0.187622,0.00996,0.143655


## PCA - Principal Component Analysis
* Data reduction technique that allows you to reduce the number of columns in a dataset
* Example : When you have 5300 columns in a dataset, you can use PCA to reduce the number of columns to 10 or 20 or 30

In [11]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
# X.head()
Xp = pca.fit_transform(X)
Xp[:5]

array([[-7.57146549e+01, -3.59507826e+01, -7.26078895e+00],
       [-8.23582676e+01,  2.89082132e+01, -5.49667139e+00],
       [-7.46306434e+01, -6.79064965e+01,  1.94618081e+01],
       [ 1.10774227e+01,  3.48984859e+01, -5.30177923e-02],
       [ 8.97437881e+01, -2.74693708e+00,  2.52128586e+01]])

In [12]:
pd.DataFrame(Xp, columns=['PC1', 'PC2', 'PC1']).head()

Unnamed: 0,PC1,PC2,PC1.1
0,-75.714655,-35.950783,-7.260789
1,-82.358268,28.908213,-5.496671
2,-74.630643,-67.906496,19.461808
3,11.077423,34.898486,-0.053018
4,89.743788,-2.746937,25.212859


In [13]:
# log, vector for example multi dimensional data, cosine similarity, euclidean distance

### PCA Explained Variance Ratio

In [14]:
pca.explained_variance_ratio_ , sum(pca.explained_variance_ratio_)

(array([0.88854663, 0.06159078, 0.02579012]), 0.9759275372391633)