# Standardization
Standardization in data science refers to the process of rescaling or transforming the features of a dataset so that they have a mean of 0 and a standard deviation of 1. This is also known as z-score normalization. Standardization is a common preprocessing step in data analysis and machine learning.

The formula for standardization is given by:

![](chat.png)

where:
- \( z \) is the standardized value.
- \( x \) is the original value of a feature.
- \( \mu \) is the mean of the feature.
- \( \sigma \) is the standard deviation of the feature.

The purpose of standardization is to bring all the features onto the same scale, making it easier to compare and interpret their values. Standardized features have a mean of 0 and a standard deviation of 1, which simplifies the interpretation of the data and ensures that no single feature dominates the analysis solely based on its scale.

Standardization is particularly important in machine learning algorithms that are sensitive to the scale of the input features, such as support vector machines, k-nearest neighbors, and principal component analysis. It helps these algorithms perform better and converge faster, as they are less influenced by the scale of the input features.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Social_Network_Ads.csv')

In [3]:
df.sample(5)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
98,15575694,Male,35,73000,0
50,15694395,Female,24,32000,0
13,15704987,Male,32,18000,0
29,15669656,Male,31,18000,0
73,15782530,Female,33,113000,0


In [4]:
df = df.iloc[:,2:]
df

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


### Train Test Split

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
#Define x and y
x = df.iloc[:,:2]
y = df['Purchased']

In [7]:
x

Unnamed: 0,Age,EstimatedSalary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000
...,...,...
395,46,41000
396,51,23000
397,50,20000
398,36,33000


In [8]:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.3,
                                                   random_state=0)

In [9]:
X_train.shape, X_test.shape

((280, 2), (120, 2))

### StandardScaler
![](sf.png)

In [10]:
from sklearn.preprocessing import StandardScaler

In [11]:
scaler = StandardScaler()

In [12]:
#fit the scaler to the train set, it will learn the parameter
scaler.fit(X_train)

In [13]:
#Transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [14]:
scaler.mean_

array([3.78642857e+01, 6.98071429e+04])

In [15]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [16]:
X_train_scaled 

Unnamed: 0,Age,EstimatedSalary
0,-1.163172,-1.584970
1,2.170181,0.930987
2,0.013305,1.220177
3,0.209385,1.075582
4,0.405465,-0.486047
...,...,...
275,0.993704,-1.151185
276,-0.869053,-0.775237
277,-0.182774,-0.514966
278,-1.065133,-0.457127


In [17]:
X_test_scaled

Unnamed: 0,Age,EstimatedSalary
0,-0.771013,0.497201
1,0.013305,-0.572804
2,-0.280814,0.150172
3,-0.771013,0.265849
4,-0.280814,-0.572804
...,...,...
115,1.091743,-0.139018
116,0.699584,1.769639
117,-0.672973,0.555039
118,0.797624,0.352606


In [18]:
X_train.describe()

Unnamed: 0,Age,EstimatedSalary
count,280.0,280.0
mean,37.864286,69807.142857
std,10.218201,34641.201654
min,18.0,15000.0
25%,30.0,43000.0
50%,37.0,70500.0
75%,46.0,88000.0
max,60.0,150000.0


In [19]:
np.round(X_train_scaled.describe())

Unnamed: 0,Age,EstimatedSalary
count,280.0,280.0
mean,0.0,0.0
std,1.0,1.0
min,-2.0,-2.0
25%,-1.0,-1.0
50%,-0.0,0.0
75%,1.0,1.0
max,2.0,2.0


### Effect of Standarization

### Why Scaling is Important?

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
lr = LogisticRegression()
lr_scaled = LogisticRegression()

In [22]:
#Actual Data
lr.fit(X_train, y_train)
#Scaled Data
lr_scaled.fit(X_train_scaled, y_train)

In [23]:
y_pred = lr.predict(X_test)
y_pred_scaled = lr_scaled.predict(X_test_scaled)

In [24]:
from sklearn.metrics import accuracy_score

In [25]:
print('Actual', accuracy_score(y_test, y_pred))
print('Scaled', accuracy_score(y_test, y_pred_scaled))

Actual 0.6583333333333333
Scaled 0.8666666666666667


### When to Use Standardization
![](whentouse.png)