# Feature scaling:

**what is feature scaling?**  
*feature scaling is a technique to standardize the independent features in the data in fixed range*.  

**why feature scaling is important?**  
*It involves transforming the data into a similar scale, which can bring several advantages:*
* improved Model Performance:
* Gradient Descent Convergence:
* Distance-Based Algorithms:

## Types of feature scaling?

Basicaly their are two types:
1. Standardization
2. Normalization

# Standardization:

Standardization in machine learning refers to the process of transforming the features of a dataset to have a mean of 0 and a standard deviation of 1. This technique is also known as z-score normalization or feature scaling. Standardization is commonly applied to features that have different units or scales to ensure that all features contribute equally to the learning process and prevent some features from dominating others during model training.

The standardization process involves the following steps:

1. Calculate the mean: Compute the mean value for each feature in the dataset.

2. Calculate the standard deviation: Calculate the standard deviation of each feature.

3. Transform the data: Subtract the mean from each feature value and then divide by the standard deviation. This centers the data around 0 and scales it to have a standard deviation of 1.

The formula for standardization of a feature:

$$\text{Xstandardized} = \frac{x - \text{mean}(x)}{\text{std}(x)}$$ 
where mean(x) is the mean of feature x and std(x) is the standard deviation of feature x.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
dataset = pd.read_csv('D:\\AI\\project by siddharkhan\\project code and data\\data\\Social_Network_Ads.csv')

In [3]:
dataset.sample(5)

Unnamed: 0,Age,EstimatedSalary,Purchased
144,34,25000,0
34,27,90000,0
199,35,22000,0
270,43,133000,0
40,27,17000,0


In [4]:
X=dataset.drop('Purchased',axis=1)
Y=dataset['Purchased']

<span style="color:blue">for feature scaling you should spliting the data</span>

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=0)

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#fit the scaler to train the set, it will learn the parameter
scaler.fit(X_train)

#transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
print(X_train_scaled[:5])
#this return numpy array it's look very difficult

[[ 1.92295008  2.14601566]
 [ 2.02016082  0.3787193 ]
 [-1.3822153  -0.4324987 ]
 [-1.18779381 -1.01194013]
 [ 1.92295008 -0.92502392]]


In [8]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [9]:
X_train_scaled
#now it's in the form of dataframe

Unnamed: 0,Age,EstimatedSalary
0,1.922950,2.146016
1,2.020161,0.378719
2,-1.382215,-0.432499
3,-1.187794,-1.011940
4,1.922950,-0.925024
...,...,...
315,0.950843,-1.156800
316,-0.896162,-0.780164
317,-0.215686,-0.519415
318,-1.090583,-0.461471


In [10]:
X_train.describe()

Unnamed: 0,Age,EstimatedSalary
count,320.0,320.0
mean,38.21875,69928.125
std,10.30304,34570.057299
min,18.0,15000.0
25%,30.0,43000.0
50%,38.0,69500.0
75%,46.0,88000.0
max,60.0,150000.0


In [11]:
X_train_scaled.describe()

Unnamed: 0,Age,EstimatedSalary
count,320.0,320.0
mean,0.0,1.1102230000000002e-17
std,1.001566,1.001566
min,-1.96548,-1.591382
25%,-0.798951,-0.7801636
50%,-0.021265,-0.01240367
75%,0.756421,0.5235797
max,2.117372,2.319848


## when we use feature scaling?

**Algorithms that typically require feature scaling:**  
1. Gradient Descent-Based Algorithms:
    * Linear Regression.
    * Logistic Regression.
    * Support Vector Machines(SVM)
2. K-means:
3. K-Nearest-Neighbours:
4. PCA (Principal Component Analysis):
5. Artificial Neural Networks:

**Algorithms that typically not require feature scaling:**  
1. Tree-Based Algorithms: 
    * Decision Trees.
    * Random Forests.
    * Gradient Boosting Machines(GBM).
    * XGBoost
2. Naive Bayes:
3. Ensemble Methods:

## How to apply standardization when there is categorical data in dataset?

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

In [13]:
dataset=pd.read_csv('D:\\AI\\project by siddharkhan\\project code and data\\data\\loan_approval_prediction.csv')

In [14]:
dataset.sample(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
70,LP001243,Male,Yes,0,Graduate,No,3208,3066.0,172.0,360.0,1.0,Urban,Y
117,LP001405,Male,Yes,1,Graduate,No,2214,1398.0,85.0,360.0,,Urban,Y
100,LP001345,Male,Yes,2,Not Graduate,No,4288,3263.0,133.0,180.0,1.0,Urban,Y
481,LP002536,Male,Yes,3+,Not Graduate,No,3095,0.0,113.0,360.0,1.0,Rural,Y
580,LP002892,Male,Yes,2,Graduate,No,6540,0.0,205.0,360.0,1.0,Semiurban,Y


In [15]:
X=dataset.drop('Loan_Status',axis=1)
Y=dataset['Loan_Status']

In [16]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=42)

In [17]:
# Separate numerical and categorical columns
numeric_cols = X_train.select_dtypes(include=['float', 'int']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

In [18]:
# Pipeline to preprocess numerical columns (standardization) and leave categorical columns unchanged
numeric_pipeline = Pipeline([('scaler', StandardScaler())])

In [19]:
"""
ColumnTransformer is used to apply different transformations to these columns.
The numerical columns are standardized using StandardScaler,
while the categorical columns are left unchanged using the 'passthrough' option.
"""
preprocessor = ColumnTransformer(
    transformers=[('numeric', numeric_pipeline, numeric_cols),('categorical', 'passthrough', categorical_cols)])

# Apply preprocessing to the training sest:
X_train_processed = preprocessor.fit_transform(X_train)

# Apply preprocessing to the testing sest:
X_test_processed = preprocessor.fit_transform(X_test)

In [20]:
X_train_processed = pd.DataFrame(X_train_processed, columns=X_train.columns)
X_test_processed = pd.DataFrame(X_test_processed, columns=X_test.columns)

In [21]:
X_train_processed

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,-0.501334,0.278657,0.39638,0.308731,-2.369951,LP002788,Male,Yes,0,Not Graduate,No,Urban
1,-0.428032,0.451038,0.094446,0.308731,0.42195,LP002950,Male,Yes,0,Not Graduate,,Rural
2,-0.566972,0.232088,-0.149424,0.308731,0.42195,LP001868,Male,No,0,Graduate,No,Semiurban
3,-0.477011,0.040931,-0.462972,0.308731,0.42195,LP002587,Male,Yes,0,Not Graduate,No,Rural
4,0.219858,-0.597514,-0.195876,0.308731,0.42195,LP002716,Male,No,0,Not Graduate,No,Semiurban
...,...,...,...,...,...,...,...,...,...,...,...,...
424,-0.597793,0.106653,-0.5791,0.308731,0.42195,LP001245,Male,Yes,2,Not Graduate,Yes,Semiurban
425,0.991862,-0.174639,0.907346,0.308731,0.42195,LP001369,Male,Yes,2,Graduate,No,Urban
426,-0.37089,-0.597514,-1.357162,0.308731,0.42195,LP001888,Female,No,0,Graduate,No,Urban
427,0.763626,-0.597514,,-1.430479,0.42195,LP002393,Female,,,Graduate,No,Semiurban
