# Feature scaling
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [feature-scaling.ipynb](https://github.com/diegoinacio/data-science-notebooks/blob/master/data-analytics/feature-scaling.ipynb)
---
Overview and practical applications of key *feature scaling* methods.

In [None]:
# Data analysis
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

# Feature engineering
from sklearn.model_selection import train_test_split

# Model (SVM Classification)
from sklearn.svm import SVC

# Metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

In [None]:
plt.rcParams['figure.figsize'] = (16, 8)

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 0. Data
---
Before we start talking let's first start acquiring and preparing our data to run a classification model.

### Download from Kaggle
---
For this experiment, we firstly are going to download the [diamond prices](https://www.kaggle.com/datasets/shivam2503/diamonds) dataset.

In [None]:
!kaggle datasets download -d "shivam2503/diamonds"

In [None]:
!unzip "diamonds.zip"

### Data preparation
---
Prepare our dataset to run a classification model based on numeric independent variable.

In [None]:
# Read data
df_diamonds = pd.read_csv("diamonds.csv")
df_diamonds.head()

In [None]:
df_diamonds = df_diamonds.drop(['Unnamed: 0'],axis=1)
df_diamonds.head()

In [None]:
df_diamonds.groupby("color").count()

In [None]:
# Filter data and get just 1 categorical feature
df_classification = (df_diamonds
    .where(df_diamonds.color == "G")
    .where(df_diamonds.clarity == "VS2")
    .dropna()
)

# Numerical dependent features
X = df_classification[[
    "carat", "depth", "table", 
    "price", "x", "y", "z"
]]

# Target data (classification)
y = df_classification["cut"]

In [None]:
X.describe()

## 1. What is *feature scaling*?
---
*Feature scaling* is a process used to rescale and normalize independent variables. This is an important process because sometimes columns can have different units and this can impact the performance of many algorithms that are based on dissimilarity between variables.

For example, if we have a classifier based on distance metrics (like KNN, SVM and etc), our model may not work correctly if our variables have different ranges in terms of magnitude. Let's se how our model performs without applying feature scaling:

In [None]:
# Split model data
X_train, X_test, y_train, y_test = (
    train_test_split(X, y, random_state=0, test_size=0.1)
)

LABELS = y_train.unique()

# Model | SVC Classification
svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

# Confusion matrix
CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))

### 1.1. Normalization
---
Normalization is probably the most common method for scaling features. This is based on *minimums* and *maximums*, and adjusts all variables $X$ to the range $[0, 1]$.

$$ \large
x' = \frac{x - min(x)}{max(x) - min(x)}
$$

where:
- $x$ is an independent variable;
- $min(x)$ is the minimum value of $x$;
- $max(x)$ is the maximum value of $x$.

In [None]:
X_ = (X - X.min())/(X.max() - X.min())
X_.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=0, test_size=0.1)

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))

Alternatively, we can redimension the interval of our variable to another range $[a, b]$.

$$ \large
x' = a + (b - a) \cdot \frac{x - min(x)}{max(x) - min(x)}
$$

where $a$ and $b$ are the left and right limit of the range, respectively.

In [None]:
a, b = 1, 5
X_ = a + (b - a)*(X - X.min())/(X.max() - X.min())
X_.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=0, test_size=0.1)

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))

### 1.2. Mean Normalization
---
This is similar to simple normalization, but is centrilized to the origin.

$$ \large
x' = \frac{x - \overline{x}}{max(x) - min(x)}
$$

where:
- $\overline{x}$ is the average (or arithmetic mean) value of the variable.

In [None]:
X_ = (X - X.mean())/(X.max() - X.min())
X_.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=0, test_size=0.1)

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))

### 1.3. InterQuartile Range Normalization
---
Normalization is very sensitive to outliers, which can affect minimum and maximum values. It can stretch our feature a little. To make make it not sensitive to outliers we can use some measure of position values to normalize our data. So instead of minimum and maximum, let's take an *InterQuartile Range* (IQR):

$$ \large
x' = \frac{x - median(x)}{Q3 - Q1}
$$

where:
- $median(x)$ is the median;
- $Q1$ is the first quartile;
- $Q3$ is the third quartile.

In [None]:
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
X_ = (X - X.median())/(Q3 - Q1)
X_.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=0, test_size=0.1)

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))

### 1.4. Standardization
---
This is another very popular scaler and really good to deal with outliers. this method is centrilized and weighted by *standard deviation*.

$$ \large
x' = \frac{x - \overline{x}}{\sigma}
$$

where:
- $\overline{x}$ is the mean value;
- $\sigma$ is the standard deviation of $x$.

In [None]:
X_ = (X - X.mean())/X.std()
X_.describe()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y, random_state=0, test_size=0.1)

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

CM = confusion_matrix(y_test, y_pred, labels=LABELS)
disp = ConfusionMatrixDisplay(confusion_matrix=CM,display_labels=LABELS)
disp.plot()
plt.show()

In [None]:
print(classification_report(y_test, y_pred, labels=LABELS))