# Feature Engineering – Feature Scaling
 
**Feature Scaling** is a preprocessing technique used to adjust the range of numerical features in a dataset, so they fall within a specific scale, often to make them comparable and improve model performance.

## Why is Feature Scaling Important?
**1. Ensures Fairness**                         
Suppose you have two features:                   
* Age: Ranges from 18 to 100.                  
* Income: Ranges from 10,000 to 1,00,000.                           
If you don’t scale the features, the Income feature (larger range) will dominate models like Linear Regression or Logistic Regression because its numerical values are much higher than those of the Age feature. This will make the model prioritize Income over Age, which might not be ideal.
Example (Without Scaling):                                         
* The model could interpret a change of +1 in Age (e.g., 25 → 26) as much less important than a +1 in Income (e.g., 50,000 → 50,001).

  
**2. Model Convergence**                          
* Gradient Descent-based models (e.g., Logistic Regression, Neural Networks) rely on minimizing a loss function.                       
* If one feature has values between 0 and 1 and another between 1,000 and 10,000, the gradient updates will vary significantly, slowing convergence.
* Example:                                    
Imagine climbing a hill where one side is smooth, but the other side is bumpy and steep. You’ll struggle to move smoothly toward the peak. Similarly, the optimization algorithm struggles when feature scales are inconsistent.                                      

**3. Improves Accuracy for Distance-Based Algorithms**                          
* Distance-based algorithms like KNN and SVM rely on calculating the distance between points.                      
* Example (Without Scaling):                                            
If you calculate the Euclidean distance between two data points:              
The distance formula is given by:                     

$$
\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$

Where:
- \(x\) represents **Income**   
- \(y\) represents **Age**
 
If Income is on a much larger scale, the distance will primarily be influenced by differences in Income, ignoring Age.              

**4. Consistency**                            
* Features on vastly different scales can confuse the model during training, leading to inconsistent results.                
* Example (With Scaling):                            
After scaling, both Age and Income will have a mean of 0 and a standard deviation of 1 (if using StandardScaler). This ensures the model treats all features equally, leading to more reliable predictions.                

## When is Feature Scaling Needed?
* Required for algorithms that rely on distances or gradients (e.g., KNN, SVM, Logistic Regression, Neural Networks).           
* Not Required for tree-based algorithms like Decision Trees, Random Forests, and XGBoost (they are scale-invariant).         

## 1. Importing Necessary Libraries

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Libraries for Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PowerTransformer

#Libraries for visualizations 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 
warnings.filterwarnings('ignore')

## 2. Load the Dataset

In [5]:
# Load the dataset
df = pd.read_excel("C:\\Users\\BINPAT\\Documents\\Python Self\\Feature Engineering\\Datasets\\Scaling.xlsx")  
df.head()

Unnamed: 0,Experience,Salary,Country_Type
0,1,30,Devloping
1,3,35,Developed
2,4,43,Devloping
3,5,36,Developed
4,6,27,Devloping


## 3. Extracting basic information about the dataset

In [7]:
# Display basic information about the dataset
print("Dataset Information:")
df.info()

# Display the column names
print("\nColumn Names:")
print(df.columns)

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Experience    15 non-null     int64 
 1   Salary        15 non-null     int64 
 2   Country_Type  15 non-null     object
dtypes: int64(2), object(1)
memory usage: 492.0+ bytes

Column Names:
Index(['Experience', 'Salary', 'Country_Type'], dtype='object')


## 1. Standardization (Z-Score Normalization)
Standardization is the process of rescaling the features so that they have a mean of 0 and a standard deviation of 1. This technique is commonly used when the data follows a Gaussian (normal) distribution.

**Formula:**
$$
Z = \frac{X - \mu}{\sigma}
$$
Where:  
- X is the feature value.  
- meu is the mean of the feature.  
- sigma is the standard deviation of the feature.

**When to Use:**  
- When the data is normally distributed (or close to it).  
- For algorithms like Logistic Regression, KNN, SVM, and Neural Networks that are sensitive to the scale of the data.

**Pros:**  
- Preserves the distribution of the data.  
- Useful for many machine learning algorithms.  
- Works well when features have different units.

**Cons:**  
- Sensitive to outliers; can distort the data if outliers are present.

In [9]:
scaler = StandardScaler()

# Standardizing the 'Experience' and 'Salary' columns
df[['Experience_SS', 'Salary_SS']] = scaler.fit_transform(df[['Experience', 'Salary']])

print("Standardization (Z-Score Normalization):")
df.head()

Standardization (Z-Score Normalization):


Unnamed: 0,Experience,Salary,Country_Type,Experience_SS,Salary_SS
0,1,30,Devloping,-1.689306,0.596869
1,3,35,Developed,-0.355643,1.04452
2,4,43,Devloping,0.311188,1.760763
3,5,36,Developed,0.978019,1.134051
4,6,27,Devloping,1.644851,0.328278


## 2. Min-Max Scaling (Normalization)
Min-Max Scaling scales the data to a fixed range, typically [0, 1], based on the minimum and maximum values of each feature.

**Formula:**
$$
X' = \frac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}
$$
Where:  
- X is the feature value.  
- min(X) is the minimum value of the feature.  
- max(X) is the maximum value of the feature.

**When to Use:**  
- When you want to scale the features to a specific range.  
- For algorithms like KNN, Neural Networks, and Gradient Descent where the scale affects the distance metric.

**Pros:**  
- Simple and easy to interpret.  
- Scales the data into a predefined range.

**Cons:**  
- Sensitive to outliers (outliers can distort the range).

In [11]:
min_max_scaler = MinMaxScaler()

# Applying Min-Max Scaling to 'Experience' and 'Salary' columns
df[['Experience_MM', 'Salary_MM']] = min_max_scaler.fit_transform(df[['Experience', 'Salary']])

print("\nMin-Max Scaling (Normalization):")
df.head()


Min-Max Scaling (Normalization):


Unnamed: 0,Experience,Salary,Country_Type,Experience_SS,Salary_SS,Experience_MM,Salary_MM
0,1,30,Devloping,-1.689306,0.596869,0.0,0.628571
1,3,35,Developed,-0.355643,1.04452,0.4,0.771429
2,4,43,Devloping,0.311188,1.760763,0.6,1.0
3,5,36,Developed,0.978019,1.134051,0.8,0.8
4,6,27,Devloping,1.644851,0.328278,1.0,0.542857


## 3. Robust Scaling
Robust Scaling uses the median and the interquartile range (IQR) for scaling, which makes it less sensitive to outliers compared to Standardization and Min-Max Scaling.

**Formula:**
$$
X' = \frac{X - \text{Median}(X)}{\text{IQR}(X)}
$$
Where:  
- Median(X) is the median of the feature.  
- IQR(X) is the interquartile range of the feature.

**When to Use:**  
- When the dataset contains outliers that you do not want to distort the scaling.  
- Works well for algorithms that are robust to outliers (e.g., Tree-based algorithms).

**Pros:**  
- Not sensitive to outliers.  
- Works well with skewed data.

**Cons:**  
- Can be less effective if the data is normally distributed.

In [13]:
robust_scaler = RobustScaler()

# Applying Robust Scaling to 'Experience' and 'Salary' columns
df[['Experience_RS', 'Salary_RS']] = robust_scaler.fit_transform(df[['Experience', 'Salary']])

print("\nRobust Scaling:")
df.head()


Robust Scaling:


Unnamed: 0,Experience,Salary,Country_Type,Experience_SS,Salary_SS,Experience_MM,Salary_MM,Experience_RS,Salary_RS
0,1,30,Devloping,-1.689306,0.596869,0.0,0.628571,-1.2,0.439024
1,3,35,Developed,-0.355643,1.04452,0.4,0.771429,-0.4,0.682927
2,4,43,Devloping,0.311188,1.760763,0.6,1.0,0.0,1.073171
3,5,36,Developed,0.978019,1.134051,0.8,0.8,0.4,0.731707
4,6,27,Devloping,1.644851,0.328278,1.0,0.542857,0.8,0.292683


## 4. Log Transformation
Log Transformation applies the logarithmic function to each feature, which compresses large values and helps to normalize data that is positively skewed.

**Formula:**
$$
X' = \log(X + 1)
$$
Where:  
- \(X\) is the feature value.  
- The "+1" ensures there are no issues with zero or negative values.

**When to Use:**  
- When the data is highly skewed (e.g., income, population).  
- When you want to reduce the effect of large values in the dataset.

**Pros:**  
- Reduces skewness and helps with non-linear relationships.  
- Can handle data with exponential growth patterns.

**Cons:**  
- Only works for positive values (log transformation is undefined for zero or negative values).  
- Can distort the interpretation of original values.

In [15]:
# Applying Log Transformation (ensure data is positive before applying)

df['Experience_log'] = np.log1p(df['Experience'])
df['Salary_log'] = np.log1p(df['Salary'])

print("\nLog Transformation:")
df.head()


Log Transformation:


Unnamed: 0,Experience,Salary,Country_Type,Experience_SS,Salary_SS,Experience_MM,Salary_MM,Experience_RS,Salary_RS,Experience_log,Salary_log
0,1,30,Devloping,-1.689306,0.596869,0.0,0.628571,-1.2,0.439024,0.693147,3.433987
1,3,35,Developed,-0.355643,1.04452,0.4,0.771429,-0.4,0.682927,1.386294,3.583519
2,4,43,Devloping,0.311188,1.760763,0.6,1.0,0.0,1.073171,1.609438,3.78419
3,5,36,Developed,0.978019,1.134051,0.8,0.8,0.4,0.731707,1.791759,3.610918
4,6,27,Devloping,1.644851,0.328278,1.0,0.542857,0.8,0.292683,1.94591,3.332205


## 5. Power Transformation (Box-Cox) 
Power Transformation is used to stabilize variance and make data more Gaussian. Box-Cox is a family of transformations that includes log, square root, and others, based on a parameter $$\lambda.$$

**Formula (Box-Cox):**
$$
X' = \frac{X^\lambda - 1}{\lambda} \quad \text{for} \quad \lambda \neq 0
$$
If lambda = 0, the transformation is equivalent to the log transformation.

**When to Use:**  
- When data is skewed or has non-constant variance.  
- Used when we want to transform the data to approximate a normal distribution.

**Pros:**  
- Stabilizes variance.  
- Useful for linear models that assume normality.

**Cons:**  
- Requires the data to be strictly positive (cannot be used with zero or negative values).
- The optimal value of lambda may need to be found using cross-validation.

In [17]:
power_transformer = PowerTransformer()

# Applying Power Transformation to 'Experience' and 'Salary' columns
df[['Experience_PT', 'Salary_PT']] = power_transformer.fit_transform(df[['Experience', 'Salary']])

print("\nPower Transformation (Box-Cox):")
df.head()


Power Transformation (Box-Cox):


Unnamed: 0,Experience,Salary,Country_Type,Experience_SS,Salary_SS,Experience_MM,Salary_MM,Experience_RS,Salary_RS,Experience_log,Salary_log,Experience_PT,Salary_PT
0,1,30,Devloping,-1.689306,0.596869,0.0,0.628571,-1.2,0.439024,0.693147,3.433987,-1.649095,0.686905
1,3,35,Developed,-0.355643,1.04452,0.4,0.771429,-0.4,0.682927,1.386294,3.583519,-0.381695,1.032429
2,4,43,Devloping,0.311188,1.760763,0.6,1.0,0.0,1.073171,1.609438,3.78419,0.291395,1.523073
3,5,36,Developed,0.978019,1.134051,0.8,0.8,0.4,0.731707,1.791759,3.610918,0.984215,1.097565
4,6,27,Devloping,1.644851,0.328278,1.0,0.542857,0.8,0.292683,1.94591,3.332205,1.693885,0.461058


# Now lets check which one is effective 

In [19]:
df.columns

Index(['Experience', 'Salary', 'Country_Type', 'Experience_SS', 'Salary_SS',
       'Experience_MM', 'Salary_MM', 'Experience_RS', 'Salary_RS',
       'Experience_log', 'Salary_log', 'Experience_PT', 'Salary_PT'],
      dtype='object')

In [20]:
# Variance Calculation
# Calculate variance before and after scaling
variance_before = df[['Experience', 'Salary']].var()
variance_after = df[['Experience_SS', 'Salary_SS', 'Experience_MM', 'Salary_MM', 
                     'Experience_RS', 'Salary_RS', 'Experience_log', 'Salary_log', 
                     'Experience_PT', 'Salary_PT']].var()

# Display variances
print("Variance Before Scaling:") 
print(variance_before)
print("\nVariance After Scaling:")
print(variance_after)

Variance Before Scaling:
Experience      2.409524
Salary        133.666667
dtype: float64

Variance After Scaling:
Experience_SS     1.071429
Salary_SS         1.071429
Experience_MM     0.096381
Salary_MM         0.109116
Experience_RS     0.385524
Salary_RS         0.318065
Experience_log    0.156705
Salary_log        0.270403
Experience_PT     1.071429
Salary_PT         1.071429
dtype: float64


### Conclusion:

Before scaling, the variance of the `Experience` and `Salary` columns was significantly higher, especially for `Salary` (133.67), indicating a wide range of values. After applying various scaling techniques, we observe that:

- **Standard Scaling (Experience_SS, Salary_SS)** reduced the variance to a consistent 1.071429 for both features, ensuring that they are on a comparable scale.
- **Min-Max Scaling (Experience_MM, Salary_MM)** resulted in very low variances (close to 0), as values were compressed into a small range.
- **Robust Scaling (Experience_RS, Salary_RS)** and **Log Transformation (Experience_log, Salary_log)** also decreased variance, with robust scaling being more resistant to outliers.
- **Power Transformation (Experience_PT, Salary_PT)** maintained a variance of 1.071429, similar to Standard Scaling, but with potentially improved handling of non-normal distributions.

In summary, scaling techniques like Standard Scaling and Power Transformation preserve the overall variance structure of the data, while Min-Max Scaling and Robust Scaling reduce it, which can be useful depending on the algorithm and model requirements.

In [22]:
# Standard Daviation Calculation
# Calculate Standard Daviation before and after scaling
std_before = df[['Experience', 'Salary']].std()
std_after = df[['Experience_SS', 'Salary_SS', 'Experience_MM', 'Salary_MM', 
                     'Experience_RS', 'Salary_RS', 'Experience_log', 'Salary_log', 
                     'Experience_PT', 'Salary_PT']].std()

# Display variances
print("Standard Daviation Before Scaling:") 
print(std_before)
print("\nStandard Daviation Scaling:")
print(std_after)

Standard Daviation Before Scaling:
Experience     1.552264
Salary        11.561430
dtype: float64

Standard Daviation Scaling:
Experience_SS     1.035098
Salary_SS         1.035098
Experience_MM     0.310453
Salary_MM         0.330327
Experience_RS     0.620906
Salary_RS         0.563972
Experience_log    0.395860
Salary_log        0.520003
Experience_PT     1.035098
Salary_PT         1.035098
dtype: float64


### Conclusion:

Before scaling, the standard deviation of `Experience` was 1.552264, and `Salary` had a much higher standard deviation of 11.561430, indicating significant spread in the data. After applying various scaling techniques, we observe:

- **Standard Scaling (Experience_SS, Salary_SS)** reduced the standard deviation for both features to 1.035098, ensuring that both are on the same scale, centered around 0 with unit variance.
- **Min-Max Scaling (Experience_MM, Salary_MM)** further reduced the standard deviation, bringing it down to 0.310453 for `Experience` and 0.330327 for `Salary`, which shows that the values are compressed into a smaller range.
- **Robust Scaling (Experience_RS, Salary_RS)** also decreased the standard deviation, with values of 0.620906 for `Experience` and 0.563972 for `Salary`, indicating less sensitivity to outliers compared to other scalers.
- **Log Transformation (Experience_log, Salary_log)** reduced the standard deviation, especially for `Salary`, with values of 0.395860 for `Experience` and 0.520003 for `Salary`, helping to compress large values and reduce skewness.
- **Power Transformation (Experience_PT, Salary_PT)** provided similar standard deviations (1.035098) as Standard Scaling, maintaining consistency in the data distribution.

In conclusion, scaling methods like Standard Scaling and Power Transformation ensure that features have a similar spread. Min-Max Scaling compresses the data into a small range, while Robust Scaling is more robust against outliers. Log and Power Transformations also help reduce skewness, improving model performance for algorithms sensitive to data distribution.

## For better understanding , we have to train a ML model and check the MSE OR RMSE score. 

## Pros and Cons of Feature Scaling

**Pros:**
* Ensures faster convergence in optimization.            
* Prevents bias towards features with larger values.            
* Improves accuracy for distance-based models.               

**Cons:**                              
* Adds preprocessing complexity.                      
* Can distort data if applied incorrectly (e.g., before train-test splitting).                     