# Data scaling in machine learning
is a preprocessing technique used to standardize the range of independent
variables or features of data.

# Scaling Techniques: 
# 1. Min Max Normalization
# 2. Standardization (Z Score)
# 3. Log Transformation
# 4. Robust Scaler
# 5. Max Absolute Scaler

# Min-Max Scaling
is a technique used to rescale data so that all values fall within a fixed range, like 0 to 1 or -1 to 1.
It's helpful when features have different units or scales, making comparisons easier.
The Formula:
# scaled =(𝑥 − min) / (max − min)
x: The original data value.
min: The smallest value in the dataset.
max: The largest value in the dataset.
How It Works:
Subtract the minimum value from 𝑥 to "shift" the dataset to start at 0.
Divide by the range ( max − min) to scale it proportionally within the range.
Example:
Say your data is: [10, 20, 30]

min = 10, max = 30.
Step-by-step scaling to 0–1:

For 
𝑥=10
x=10:(10−10)/(30−10)=0
For 𝑥=20
x=20: (20−10)/(30−10)=0.5 ....

# Key Points to Remember:
Range is fixed: Data is scaled between 0 and 1 (or another specified range).
Outliers are not removed: They still affect the scaling since they influence the min and max values.
Models like neural networks or KNN benefit from Min-Max scaling because it ensures balanced contributions from features.

# Scikitlearn implementation: 

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [3]:
data = {
    'Feature1' : [1,5,10,4,5],
    'Feature2' : [6,7,8,19,10]
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Feature1,Feature2
0,1,6
1,5,7
2,10,8
3,4,19
4,5,10


In [4]:
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns = df.columns)

In [9]:
print("Original DataFrame:\n",df)
print("Scaled DataFrame:\n",scaled_df)

Original DataFrame:
    Feature1  Feature2
0         1         6
1         5         7
2        10         8
3         4        19
4         5        10
Scaled DataFrame:
    Feature1  Feature2
0  0.000000  0.000000
1  0.444444  0.076923
2  1.000000  0.153846
3  0.333333  1.000000
4  0.444444  0.307692


# Standardization (Z Score)


# Standardization (Z-Score Scaling)

### What is Standardization?
Standardization (also known as **Z-Score Scaling**) is a technique used to transform data so that it has:
1. A **mean (average)** of 0.
2. A **standard deviation** of 1.

This ensures that the data is scaled properly and can be compared or used in machine learning models.

---

### Formula:
\[
z = \frac{x - \mu}{\sigma}
\]

Where:
- \( z \): The standardized value.
- \( x \): The original value.
- \( \mu \): The mean of the dataset.
- \( \sigma \): The standard deviation of the dataset.

---

### Steps:
1. Subtract the **mean** (\( \mu \)) from each value (\( x \)) to center the data around 0.
2. Divide by the **standard deviation** (\( \sigma \)) to ensure a standard deviation of 1.

---

### Example:
Suppose the data is: **[10, 20, 30]**

1. **Calculate Mean (\( \mu \)):**
   \[
   \mu = \frac{10 + 20 + 30}{3} = 20
   \]

2. **Calculate Standard Deviation (\( \sigma \)):**
   \[
   \sigma = \sqrt{\frac{(10-20)^2 + (20-20)^2 + (30-20)^2}{3}} = 8.16
   \]

3. **Apply Z-Score Formula:**
   - For \( x = 10 \):
     \[
     z = \frac{10 - 20}{8.16} = -1.22
     \]
   - For \( x = 20 \):
     \[
     z = \frac{20 - 20}{8.16} = 0
     \]
   - For \( x = 30 \):
     \[
     z = \frac{30 - 20}{8.16} = 1.22
     \]

**Result**: **[-1.22, 0, 1.22]**

---

### Key Points:
- After standardization:
  - The **mean becomes 0**.
  - The **standard deviation becomes 1**.
- **Outliers** are not suppressed, as data is not bounded to a fixed range.
- **When to Use**: 
  - Use for algorithms that assume normally distributed data, such as **SVM, Logistic Regression, or PCA**.


In [10]:
import pandas as pd

In [11]:
data = {
    'Feature 1': [1,20,3,40,5],
    'Feature 2': [6,7,18,19,10]
}
df= pd.DataFrame(data)
df.head()

Unnamed: 0,Feature 1,Feature 2
0,1,6
1,20,7
2,3,18
3,40,19
4,5,10


# Sklearn

In [12]:
from sklearn.preprocessing import StandardScaler

In [14]:
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized_data, columns= df.columns)

In [20]:
print('Original DataFrame: \n',df)
print('Standardized DataFrame: sklearn \n',standardized_df)

Original DataFrame: 
    Feature 1  Feature 2
0          1          6
1         20          7
2          3         18
3         40         19
4          5         10
Standardized DataFrame: sklearn 
    Feature 1  Feature 2
0  -0.869803  -1.095445
1   0.421311  -0.912871
2  -0.733896   1.095445
3   1.780378   1.278019
4  -0.597989  -0.365148
